Should next_ds be set to execution_date for manually triggered runs?

2018-12-12 Thread Dan Davydov
next_ds is useful when you need cron-style scheduling, e.g. a task that
runs for date "X" uses that date for its logic, e.g. send an email to users
saying the run that was supposed to run for date "X" has completed. The
problem is it doesn't behave as expected when it comes to manually
triggered runs as illustrated by the diagrams below.

Using execution_date in a task
*Scheduled Run (works as expected)*
execution_date1   start_date1
\/  \/
 *|-|*
/\  /\
 \_/
   scheduling_interval

*Manual Run** (works as expected)*
triggered_date + execution_date + start_date
\/
*|*

Using next_ds in a Task
*Scheduled Run (works as expected)*
next_ds1 + start_date1   next_ds2 + start_date2
\/ \/
 *||*
/\ /\
 \/
 scheduling_interval

*Manual Run* *(next_ds1 is expected to match triggered_date as in the case
for the manually triggered run that uses the regular execution_date above)*
triggered_datenext_ds1 + start_date
\/ \/
*|-|*
/\ /\
 \/
 0 to scheduling_interval (depending on when the next execution
date is)

Proposal
Have next_ds always set to execution_date for manually triggered runs
instead of the next schedule-interval aligned execution date.

This *might* break backwards compatibility for some users but it can be
argued that the current functionality is a bug. If it's really desired we
can create new aliases that behave logically although I am against this.

prev_ds should probably also be made consistent with this logic.

Thoughts?
- Dan


Re: Will Airflow 2.0.0 support Python 2.7?

2018-11-29 Thread Dan Davydov
I think we should probably drop Python2.7 support for 2.0 since it will be
quite a large undertaking and I expect it to take a long time to complete
since we should batch as many as the non-backwards incompatible changes we
want to make as possible. Even if we drop support a little bit earlier than
Jan 1 2020 I don't think it's the end of the world (since users can just
use older package versions).

On Thu, Nov 29, 2018 at 6:50 AM airflowuser
 wrote:

> I think that many packages dropped support for Python 2.7 because they are
> depend on 3rd party packages which also dropped support.
>
> Python 2.7 will be deprecated on 1-Jan-2020.
> Assuming Airflow 2.0.0 will be released 1st/2nd quarter of 2019 it means
> that airflow 3.0.0 will have to be introduced 3rd/4th of 2019 quarter as
> dropping support for Python 2.7 can't be done in minor version...
>
> Am I wrong?
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Thursday, November 29, 2018 12:04 PM, Ash Berlin-Taylor 
> wrote:
>
> > This came up previously, and no firm conclusion was released, but given
> Python 2.7 is still maintained for another year, yes, probably.
> >
> > > On 29 Nov 2018, at 08:48, airflowuser airflowu...@protonmail.com.INVALID
> wrote:
> > > Are there plans to drop support for Python 2.7 - if so when ?
>
>
>


Re: [DISCUSS] Apache Airflow graduation from the incubator

2018-11-27 Thread Dan Davydov
+1 thank you for everyone who helped drive this (especially Bolke!).

On Tue, Nov 27, 2018 at 1:02 AM Tao Feng  wrote:

> And happy to help out any items for the TLP push.
>
> On Mon, Nov 26, 2018 at 9:53 PM Tao Feng  wrote:
>
> > +1 as well. (do we need to start another vote thread per Jakob's
> > suggestion?)
> >
> > Great to see the thriving community!
> >
> > On Sat, Nov 24, 2018 at 3:57 AM Bolke de Bruin 
> wrote:
> >
> >> Hi All,
> >>
> >> With the Apache Airflow community healthy and growing, I think now would
> >> be a good time to
> >> discuss where we stand regarding to graduation from the Incubator, and
> >> what requirements remains.
> >>
> >> Apache Airflow entered incubation around 2 years ago, since then, the
> >> Airflow community learned
> >> a lot about how to do things in Apache ways. Now we are a very helpful
> >> and engaged community,
> >> ready to help on all questions from the Airflow community. We delivered
> >> multiple releases that have
> >> been increasing in quality ever since, now we can do self-driving
> >> releases in good cadence.
> >>
> >> The community is growing, new committers and PPMC members keep joining.
> >> We addressed almost all
> >> the maturity issues stipulated by Apache Project Maturity Model [1]. So
> >> final requirements remain, but
> >> those just need a final nudge. Committers and contributors are invited
> to
> >> verify the list and pick up the last
> >> bits (QU30, CO50). Finally (yahoo!) all the License and IP issues we can
> >> see got resolved.
> >>
> >> Base on those, I believes it's time for us to graduate to TLP. [2] Any
> >> thoughts?
> >> And welcome advice from Airflow Mentors?
> >>
> >> Thanks,
> >>
> >> [1]
> >> https://cwiki.apache.org/confluence/display/AIRFLOW/Maturity+Evaluation
> >> [2]
> >>
> https://incubator.apache.org/guides/graduation.html#graduating_to_a_top_level_project
> >> Regards,
> >
> >
>


Re: Remove airflow from pypi

2018-11-23 Thread Dan Davydov
This could potentially break builds for some users but I feel the pros
mentioned outweigh this, I went ahead and deleted it.

On Fri, Nov 23, 2018 at 10:18 AM Bolke de Bruin  wrote:

> Agree! This is even a security issue.
>
> Sent from my iPhone
>
> > On 23 Nov 2018, at 15:29, Driesprong, Fokko 
> wrote:
> >
> > Hi all,
> >
> > I think we should remove airflow 
> (not
> > apache-airflow) from Pypi. I still get questions from people who
> > accidentally install Airflow 1.8.0. I see this is maintained
> > by mistercrunch, artwr, aeon. Anyone any objections?
> >
> > Cheers, Fokko
>


Re: Pinning dependencies for Apache Airflow

2018-10-04 Thread Dan Davydov
Relevant discussion about this:
https://github.com/apache/incubator-airflow/pull/1809#issuecomment-257502174

On Thu, Oct 4, 2018 at 11:25 AM Jarek Potiuk 
wrote:

> TL;DR; A change is coming in the way how dependencies/requirements are
> specified for Apache Airflow - they will be fixed rather than flexible (==
> rather than >=).
>
> This is follow up after Slack discussion we had with Ash and Kaxil -
> summarising what we propose we'll do.
>
> *Problem:*
> During last few weeks we experienced quite a few downtimes of TravisCI
> builds (for all PRs/branches including master) as some of the transitive
> dependencies were automatically upgraded. This because in a number of
> dependencies we have  >= rather than == dependencies.
>
> Whenever there is a new release of such dependency, it might cause chain
> reaction with upgrade of transitive dependencies which might get into
> conflict.
>
> An example was Flask-AppBuilder vs flask-login transitive dependency with
> click. They started to conflict once AppBuilder has released version
> 1.12.0.
>
> *Diagnosis:*
> Transitive dependencies with "flexible" versions (where >= is used instead
> of ==) is a reason for "dependency hell". We will sooner or later hit other
> cases where not fixed dependencies cause similar problems with other
> transitive dependencies. We need to fix-pin them. This causes problems for
> both - released versions (cause they stop to work!) and for development
> (cause they break master builds in TravisCI and prevent people from
> installing development environment from the scratch.
>
> *Solution:*
>
>- Following the old-but-good post
>https://nvie.com/posts/pin-your-packages/ we are going to fix the
> pinned
>dependencies to specific versions (so basically all dependencies are
>"fixed").
>- We will introduce mechanism to be able to upgrade dependencies with
>pip-tools (https://github.com/jazzband/pip-tools). We might also take a
>look at pipenv: https://pipenv.readthedocs.io/en/latest/
>- People who would like to upgrade some dependencies for their PRs will
>still be able to do it - but such upgrades will be in their PR thus they
>will go through TravisCI tests and they will also have to be specified
> with
>pinned fixed versions (==). This should be part of review process to
> make
>sure new/changed requirements are pinned.
>- In release process there will be a point where an upgrade will be
>attempted for all requirements (using pip-tools) so that we are not
> stuck
>with older releases. This will be in controlled PR environment where
> there
>will be time to fix all dependencies without impacting others and likely
>enough time to "vet" such changes (this can be done for alpha/beta
> releases
>for example).
>- As a side effect dependencies specification will become far simpler
>and straightforward.
>
> Happy to hear community comments to the proposal. I am happy to take a lead
> on that, open JIRA issue and implement if this is something community is
> happy with.
>
> J.
>
> --
>
> *Jarek Potiuk, Principal Software Engineer*
> Mobile: +48 660 796 129
>


Re: Why not mark inactive DAGs in the main scheduler loop?

2018-08-22 Thread Dan Davydov
Agreed on delegation to a subprocess but I think that can come as part of a
larger redesign (maybe along with uploading DAG import errors etc). The
query should be quite fast so it should not have a significant impact on
the Scheduler times.

On Wed, Aug 22, 2018 at 3:52 PM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> I'd rather the scheduler delegate that to one of the minions (subprocess)
> if possible. We should keep everything we can off the main thread.
>
> BTW I've been speaking about renaming the scheduler to "supervisor" for a
> while now. While renaming may be a bit tricky (updating all references in
> the code), we should think of the scheduler as more of a supervisor as it
> takes on all sorts of supervision-related tasks.
>
> Tangent: we need to start thinking about allowing for a distributed
> scheduler too, and I'm thinking we need to be careful around the tasks that
> shouldn't be parallelized (this may or may not be one of them).  We'll need
> to do very basic leader election and taking/releasing locks while running
> these tasks. I'm thinking we can just set flags in the database to do that.
>
> Max
>
> On Wed, Aug 22, 2018 at 12:19 PM Taylor Edmiston 
> wrote:
>
> > I'm not super familiar with this part of the scheduler.  What exactly are
> > the implications of doing this mid-loop vs at scheduler termination?
> > Is there a use case where DAGs hit this besides having been deleted?
> >
> > The deactivate_stale_dags call doesn't appear to be super expensive or
> > anything like that.
> >
> > This seems like a reasonable idea to me.
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stack Overflow
> > <https://stackoverflow.com/users/149428/taylor-edmiston>
> >
> >
> >
> > On Wed, Aug 22, 2018 at 2:32 PM Dan Davydov  >
> > wrote:
> >
> > > I see some PRs creating endpoints to delete DAGs and other things
> related
> > > to manually deleting DAGs from the DB, but is there a good reason why
> we
> > > can't just move the deactivating DAG logic into the main scheduler
> loop?
> > >
> > > The scheduler already has some code like this, but it only runs when
> the
> > > Scheduler terminates:
> > >   if all_files_processed:
> > > self.log.info(
> > > "Deactivating DAGs that haven't been touched since %s",
> > > execute_start_time.isoformat()
> > > )
> > > models.DAG.deactivate_stale_dags(execute_start_time)
> > >
> >
>


Re: Broken DAG message won't go away in webserver

2018-08-10 Thread Dan Davydov
The scheduler should clear import errors for DAGfiles that no longer exist
or DAGs that have fixed the import errors. Take a look at
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L761
and where it is called.

If it's not working there might be a bug.

On Fri, Aug 10, 2018 at 10:19 AM Ben Laird  wrote:

> What is the desired behavior then? If I fix an error in a DAG, there should
> be a way to clear the state in the error table. Since these errors pop up
> in the webserver logs, and restarting the webserver with no error signals
> that the DAGs are healthy, should restarting the webserver clear the error
> table? Or have a CLI command as a fallback?
>
>
> On Thu, Aug 9, 2018 at 5:35 PM, Alex Guziel  .invalid>
> wrote:
>
> > IIRC the scheduler sets these messages in the error table in the db.
> >
> > On Thu, Aug 9, 2018 at 2:13 PM, Ben Laird  wrote:
> >
> > > The messages persist even after restarting the webserver. I've verified
> > > with other airflow users in the office that they'd have to manually
> > delete
> > > records from the 'import_error' table.
> > >
> > > When you say 'sync your DAGs', what do you mean exactly? When we fix a
> > DAG,
> > > we'd normally kill the webserver process, push a zip containing our dag
> > > directory (with the fixed code), unzip and restart the webserver.
> > >
> > > Thanks
> > >
> > > On Thu, Aug 9, 2018 at 4:43 PM, Taylor Edmiston 
> > > wrote:
> > >
> > > > Yeah, you definitely shouldn't need to do a resetdb for that.
> > > >
> > > > Did you try restarting the webserver?
> > > >
> > > > How do you sync your DAGs to the webserver?  Is it possible the fixed
> > DAG
> > > > didn't get synced there?
> > > >
> > > > For me, IIRC, the error stops persisting once the DAG is fixed and
> > > synced.
> > > >
> > > > *Taylor Edmiston*
> > > > Blog  | CV
> > > >  | LinkedIn
> > > >  | AngelList
> > > >  | Stack Overflow
> > > > 
> > > >
> > > >
> > > > On Thu, Aug 9, 2018 at 3:35 PM, Ben Laird 
> wrote:
> > > >
> > > > > Hello -
> > > > >
> > > > > I've noticed this several times and not sure what the solution is.
> > If I
> > > > > have a DAG error at some point, I'll see message in the webserver
> > that
> > > > says
> > > > > "Broken DAG: [Error]". However, after fixing the code, restarting
> the
> > > > > webserver, etc, the error persists. After closing it out, it will
> > just
> > > > pop
> > > > > up again after reloading.
> > > > >
> > > > > The only way I was able to delete was by doing a `airflow resetdb`.
> > I'd
> > > > > like to avoid manually deleting records from the DB, as now in prod
> > we
> > > > > cannot just kill the DB state.
> > > > >
> > > > > Any suggestions?
> > > > >
> > > > > Thanks,
> > > > > Ben Laird
> > > > >
> > > >
> > >
> >
>


Re: Kerberos and Airflow

2018-08-05 Thread Dan Davydov
I look forward to reading the draft and working on it with you! Not 100%
sure I can make it so SF for the hackathon (I'm in New York now), but I can
participate remotely.



On Sat, Aug 4, 2018 at 9:30 AM Bolke de Bruin  wrote:

> Hi Dan,
>
> Don’t misunderstand me. I think what I proposed is complementary to the
> dag submit function. The only thing you mentioned I don’t think is needed
> is to fully serialize up front and therefore excluding callback etc
> (although there are other serialization libraries like marshmallow that
> might be able to do it).
>
> You are right to mention that the hashes should be calculated at submit
> time and a authorized user should be able to recalculate a hash. Another
> option could be something like https://pypi.org/project/signedimp/ which
> we could use to verify dependencies.
>
> I’ll start writing something up. We can then shoot holes in it (i think
> you have a point on the crypto) and maybe do some hacking on it. This could
> be part of the hackathon in sept in SF, I’m sure some other people would
> have an interest in it as well.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 3 aug. 2018 om 23:14 heeft Dan Davydov 
> het volgende geschreven:
> >
> > I designed a system similar to what you are describing which is in use at
> > Airbnb (only DAGs on a whitelist would be allowed to merged to the git
> repo
> > if they used certain types of impersonation), it worked for simple use
> > cases, but the problem was doing access control becomes very difficult,
> > e.g. solving the problem of which DAGs map to which manifest files, and
> > which manifest files can access which secrets.
> >
> > There is also a security risk where someone changes e.g. a python file
> > dependency of your task, or let's say you figure out a way to block those
> > kinds of changes based on your sthashing, what if there is a legitimate
> > change in a dependency and you want to recalculate the hash? Then I think
> > you go back to a solution like your proposed "airflow submit" command to
> > accomplish this.
> >
> > Additional concerns:
> > - I'm not sure if I'm a fan of the the first time a scheduler parses a
> DAG
> > to be what creates the hashes either, it feels to me like
> > encryption/hashing should be done before DAGs are even parsed by the
> > scheduler (at commit time or submit time of the DAGs)
> > - The type of the encrypted key seem kind of hacky to me, i.e. some kind
> of
> > custom hash based on DAG structure instead of a simple token passed in by
> > users which has a clear separation of concerns WRT security
> > - Added complexity both to Airflow code, and to users as they need to
> > define or customize hashing functions for DAGs to improve security
> > If we can get a reasonably secure solution then it might be a reasonable
> > trade-off considering the alternative is a major overhaul/restrictions to
> > DAGs.
> >
> > Maybe I'm missing some details that would alleviate my concerns here,
> and a
> > bit of a more in-depth document might help?
> >
> >
> >
> > *Also: using the Kubernetes executor combined with some of the things
> > wediscussed greatly enhances the security of Airflow as the
> > environment isn’t really shared anymore.*
> > Assuming a multi-tenant scheduler, I feel the same set of hard problems
> > exist with Kubernetes, as the executor mainly just simplifies the
> > post-executor parts of task scheduling/execution which I think you
> already
> > outlined a good solution for early on in this thread (passing keys from
> the
> > executor to workers).
> >
> > Happy to set up some time to talk real-time about this by the way, once
> we
> > iron out the details I want to implement whatever the best solution we
> come
> > up with is.
> >
> >> On Thu, Aug 2, 2018 at 4:13 PM Bolke de Bruin 
> wrote:
> >>
> >> You mentioned you would like to make sure that the DAG (and its tasks)
> >> runs in a confined set of settings. Ie.
> >> A given set of connections at submission time not at run time. So here
> we
> >> can make use of the fact that both the scheduler
> >> and the worker parse the DAG.
> >>
> >> Firstly, when scheduler evaluates a DAG it can add an integrity check
> >> (hash) for each task. The executor can encrypt the
> >> metadata with this hash ensuring that the structure of the DAG remained
> >> the same. It means that the task is only
> >> able to decrypt the metadata when it is able to calculate the same hash.
> >>
> >> Similarly, if the 

Re: The need for LocalTaskJob

2018-08-04 Thread Dan Davydov
Alex (cc'd) brought this up to me about this a while ago too, and I agreed
with him. It is definitely something we should do, I remember there were
some things that were a bit tricky about removing the intermediate process
and would be a bit of work to fix (something about the tasks needing to
heartbeat the parent process maybe?).

TLDR: No blockers from me, just might be a bit of work to implement.


On Sat, Aug 4, 2018 at 9:15 AM Bolke de Bruin  wrote:

> Hi Max, Dan et al,
>
> Currently, when a scheduled task runs this happens in three steps:
>
> 1. Worker
> 2. LocalTaskJob
> 3. Raw task instance
>
> It uses (by default) 5 (!) different processes:
>
> 1. Worker
> 2. Bash + Airflow
> 3. Bash + Airflow
>
> I think we can merge worker and LocalTaskJob as the latter seems exist
> only to track a particular task. This can be done within the worker without
> side effects. Next to thatI think we can limit the amount of (airflow)
> processes to 2 if we remove the bash dependency. I don’t see any reason to
> depend on bash.
>
> Can you guys shed some light on what the thoughts were around those
> choices? Am I missing anything on why they should exist?
>
> Cheers
> Bolke
>
> Verstuurd vanaf mijn iPad


Re: Kerberos and Airflow

2018-08-03 Thread Dan Davydov
I designed a system similar to what you are describing which is in use at
Airbnb (only DAGs on a whitelist would be allowed to merged to the git repo
if they used certain types of impersonation), it worked for simple use
cases, but the problem was doing access control becomes very difficult,
e.g. solving the problem of which DAGs map to which manifest files, and
which manifest files can access which secrets.

There is also a security risk where someone changes e.g. a python file
dependency of your task, or let's say you figure out a way to block those
kinds of changes based on your hashing, what if there is a legitimate
change in a dependency and you want to recalculate the hash? Then I think
you go back to a solution like your proposed "airflow submit" command to
accomplish this.

Additional concerns:
- I'm not sure if I'm a fan of the the first time a scheduler parses a DAG
to be what creates the hashes either, it feels to me like
encryption/hashing should be done before DAGs are even parsed by the
scheduler (at commit time or submit time of the DAGs)
- The type of the encrypted key seem kind of hacky to me, i.e. some kind of
custom hash based on DAG structure instead of a simple token passed in by
users which has a clear separation of concerns WRT security
- Added complexity both to Airflow code, and to users as they need to
define or customize hashing functions for DAGs to improve security
If we can get a reasonably secure solution then it might be a reasonable
trade-off considering the alternative is a major overhaul/restrictions to
DAGs.

Maybe I'm missing some details that would alleviate my concerns here, and a
bit of a more in-depth document might help?



*Also: using the Kubernetes executor combined with some of the things
wediscussed greatly enhances the security of Airflow as the
environment isn’t really shared anymore.*
Assuming a multi-tenant scheduler, I feel the same set of hard problems
exist with Kubernetes, as the executor mainly just simplifies the
post-executor parts of task scheduling/execution which I think you already
outlined a good solution for early on in this thread (passing keys from the
executor to workers).

Happy to set up some time to talk real-time about this by the way, once we
iron out the details I want to implement whatever the best solution we come
up with is.

On Thu, Aug 2, 2018 at 4:13 PM Bolke de Bruin  wrote:

> You mentioned you would like to make sure that the DAG (and its tasks)
> runs in a confined set of settings. Ie.
> A given set of connections at submission time not at run time. So here we
> can make use of the fact that both the scheduler
> and the worker parse the DAG.
>
> Firstly, when scheduler evaluates a DAG it can add an integrity check
> (hash) for each task. The executor can encrypt the
> metadata with this hash ensuring that the structure of the DAG remained
> the same. It means that the task is only
> able to decrypt the metadata when it is able to calculate the same hash.
>
> Similarly, if the scheduler parses a DAG for the first time it can
> register the hashes for the tasks. It can then verify these hashes
> at runtime to ensure the structure of the tasks have stayed the same. In
> the manifest (which could even in the DAG or
> part of the DAG definition) we could specify which fields would be used
> for hash calculation. We could even specify
> static hashes. This would give flexibility as to what freedom the users
> have in the auto-generated DAGS.
>
> Something like that?
>
> B.
>
> > On 2 Aug 2018, at 20:12, Dan Davydov 
> wrote:
> >
> > I'm very intrigued, and am curious how this would work in a bit more
> > detail, especially for dynamically created DAGs (how would static
> manifests
> > map to DAGs that are generated from rows in a MySQL table for example)?
> You
> > could of course have something like regexes in your manifest file like
> > some_dag_framework_dag_*, but then how would you make sure that other
> users
> > did not create DAGs that matched this regex?
> >
> > On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin  <mailto:bdbr...@gmail.com>> wrote:
> >
> >> Hi Dan,
> >>
> >> I discussed this a little bit with one of the security architects here.
> We
> >> think that
> >> you can have a fair trade off between security and usability by having
> >> a kind of manifest with the dag you are submitting. This manifest can
> then
> >> specify what the generated tasks/dags are allowed to do and what
> metadata
> >> to provide to them. We could also let the scheduler generate hashes per
> >> generated
> >> DAG / task and verify those with an established version (1st run?). This
> >> limits the
> >> attack vector.
> >>
> >> 

Re: Apache Airflow welcome new committer/PMC member : Feng Tao (a.k.a. feng-tao)

2018-08-03 Thread Dan Davydov
Welcome Feng, awesome work :)!

On Fri, Aug 3, 2018 at 11:20 AM Taylor Edmiston  wrote:

> Congratulations, Feng!
>
> *Taylor Edmiston*
> Blog  | CV
>  | LinkedIn
>  | AngelList
>  | Stack Overflow
> 
>
>
> On Fri, Aug 3, 2018 at 7:31 AM, Driesprong, Fokko 
> wrote:
>
> > Welcome Feng! Awesome to have you on board!
> >
> > 2018-08-03 10:41 GMT+02:00 Naik Kaxil :
> >
> > > Hi Airflow'ers,
> > >
> > >
> > >
> > > Please join the Apache Airflow PMC in welcoming its newest member and
> > >
> > > co-committer, Feng Tao (a.k.a. feng-tao).
> > >
> > >
> > >
> > > Welcome Feng, great to have you on board!
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Kaxil
> > >
> > >
> > >
> > >
> > >
> > >
> > > Kaxil Naik
> > >
> > > Data Reply
> > > 2nd Floor, Nova South
> > > 160 Victoria Street, Westminster
> > >  > Westminster+%0D%0ALondon+SW1E+5LB+-+UK=gmail=g>
> > > London SW1E 5LB - UK
> > > phone: +44 (0)20 7730 6000
> > > k.n...@reply.com
> > > www.reply.com
> > >
> > > [image: Data Reply]
> > >
> >
>


Re: Kerberos and Airflow

2018-08-02 Thread Dan Davydov
I'm very intrigued, and am curious how this would work in a bit more
detail, especially for dynamically created DAGs (how would static manifests
map to DAGs that are generated from rows in a MySQL table for example)? You
could of course have something like regexes in your manifest file like
some_dag_framework_dag_*, but then how would you make sure that other users
did not create DAGs that matched this regex?

On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin  wrote:

> Hi Dan,
>
> I discussed this a little bit with one of the security architects here. We
> think that
> you can have a fair trade off between security and usability by having
> a kind of manifest with the dag you are submitting. This manifest can then
> specify what the generated tasks/dags are allowed to do and what metadata
> to provide to them. We could also let the scheduler generate hashes per
> generated
> DAG / task and verify those with an established version (1st run?). This
> limits the
> attack vector.
>
> A DagSerializer would be great, but I think it solves a different issue
> and the above
> is somewhat simpler to implement?
>
> Bolke
>
> > On 29 Jul 2018, at 23:47, Dan Davydov 
> wrote:
> >
> > *Let’s say we trust the owner field of the DAGs I think we could do the
> > following.*
> > *Obviously, the trusting the user part is key here. It is one of the
> > reasons I was suggesting using “airflow submit” to update / add dags in
> > Airflow*
> >
> >
> > *This is the hard part about my question.*
> > I think in a true multi-tenant environment we wouldn't be able to trust
> the
> > user, otherwise we wouldn't necessarily even need a mapping of Airflow
> DAG
> > users to secrets, because if we trust users to set the correct Airflow
> user
> > for DAGs, we are basically trusting them with all of the creds the
> Airflow
> > scheduler can access for all users anyways.
> >
> > I actually had the same thought as your "airflow submit" a while ago,
> which
> > I discussed with Alex, basically creating an API for adding DAGs instead
> of
> > having the Scheduler parse them. FWIW I think it's superior to the git
> time
> > machine approach because it's a more generic form of "serialization" and
> is
> > more correct as well because the same DAG file parsed on a given git SHA
> > can produce different DAGs. Let me know what you think, and maybe I can
> > start a more formal design doc if you are onboard:
> >
> > A user or service with an auth token sends an "airflow submit" request
> to a
> > new kind of Dag Serialization service, along with the serialized DAG
> > objects generated by parsing on the client. It's important that these
> > serialized objects are declaritive and not e.g. pickles so that the
> > scheduler/workers can consume them and reproducability of the DAGs is
> > guaranteed. The service will then store each generated DAG along with
> it's
> > access based on the provided token e.g. using Ranger, and the
> > scheduler/workers will use the stored DAGs for scheduling/execution.
> > Operators would be deployed along with the Airflow code separately from
> the
> > serialized DAGs.
> >
> > A serialed DAG would look something like this (basically Luigi-style :)):
> > MyTask - BashOperator: {
> >  cmd: "sleep 1"
> >  user: "Foo"
> >  access: "token1", "token2"
> > }
> >
> > MyDAG: {
> >  MyTask1 >> SomeOtherTask1
> >  MyTask2 >> SomeOtherTask1
> > }
> >
> > Dynamic DAGs in this case would just consist of a service calling
> "Airflow
> > Submit" that does it's own form of authentication to get access to some
> > kind of tokens (or basically just forwarding the secrets the users of the
> > dynamic DAG submit).
> >
> > For the default Airflow implementation you can maybe just have the Dag
> > Serialization server bundled with the Scheduler, with auth turned off,
> and
> > to periodically update the Dag Serialization store which would emulate
> the
> > current behavior closely.
> >
> > Pros:
> > 1. Consistency across running task instances in a dagrun/scheduler,
> > reproducability and auditability of DAGs
> > 2. Users can control when to deploy their DAGs
> > 3. Scheduler runs much faster since it doesn't have to run python files
> and
> > e.g. make network calls
> > 4. Scaling scheduler becomes easier because can have different service
> > responsible for parsing DAGs which can be trivially scaled horizontally
> > (clients are doing the parsing)
>

Re: Kerberos and Airflow

2018-07-29 Thread Dan Davydov
sting using “airflow submit” to update / add dags in
> Airflow. We could enforce authentication on the DAG. It was kind of ruled
> out in favor of git time machines although these never happened afaik ;-).
>
> BTW: I have updated my implementation with protobuf. Metadata is now
> available at executor and task.
>
>
> > On 29 Jul 2018, at 15:47, Dan Davydov 
> wrote:
> >
> > The concern is how to secure secrets on the scheduler such that only
> > certain DAGs can access them, and in the case of files that create DAGs
> > dynamically, only some set of DAGs should be able to access these
> secrets.
> >
> > e.g. if there is a secret/keytab that can be read by DAG A generated by
> > file X, and file X generates DAG B as well, there needs to be a scheme to
> > stop the parsing of DAG B on the scheduler from being able to read the
> > secret in DAG A.
> >
> > Does that make sense?
> >
> > On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin  <mailto:bdbr...@gmail.com>> wrote:
> >
> >> I’m not sure what you mean. The example I created allows for dynamic
> DAGs,
> >> as the scheduler obviously knows about the tasks when they are ready to
> be
> >> scheduled.
> >> This isn’t any different from a static DAG or a dynamic one.
> >>
> >> For Kerberos it isnt that special. Basically a keytab are the revokable
> >> users credentials
> >> in a special format. The keytab itself can be protected by a password.
> So
> >> I can imagine
> >> that a connection is defined that sets a keytab location and password to
> >> access the keytab.
> >> The scheduler understands this (or maybe the Connection model) and
> >> serializes and sends
> >> it to the worker as part of the metadata. The worker then reconstructs
> the
> >> keytab and issues
> >> a kinit or supplies it to the other service requiring it (eg. Spark)
> >>
> >> * Obviously the worker and scheduler need to communicate over SSL.
> >> * There is a challenge at the worker level. Credentials are secured
> >> against other users, but are readable by the owning user. So imagine 2
> DAGs
> >> from two different users with different connections without sudo
> >> configured. If they end up at the same worker if DAG 2 is malicious it
> >> could read files and memory created by DAG 1. This is the reason why
> using
> >> environment variables are NOT safe (DAG 2 could read
> /proc//environ).
> >> To mitigate this we probably need to PIPE the data to the task’s STDIN.
> It
> >> won’t solve the issue but will make it harder as now it will only be in
> >> memory.
> >> * The reconstructed keytab (or the initalized version) can be stored in,
> >> most likely, the process-keyring (
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>). As
> >> mentioned earlier this poses a challenge for Java applications that
> cannot
> >> read from this location (keytab an ccache). Writing it out to the
> >> filesystem then becomes a possibility. This is essentially the same how
> >> Spark solves it (
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>>).
> >>
> >> Why not work on this together? We need it as well. Airflow as it is now
> we
> >> consider the biggest security threat and it is really hard to secure it.
> >> The above would definitely be a serious improvement. Another step would
> be
> >> to stop Tasks from accessing the Airflow DB all together.
> >>
> >> Cheers
> >> Bolke
> >>
> >>> On 29 Jul 2018, at 05:36, Dan Davydov  <mailto:ddavy...@twitter.com.INVALID>>
> >> wrote:
> >>>
> >>> This makes sense, and thanks for putting this together. I might pick
> this
> >>> up myself depending on if we can get the rest of the mutli-tenancy
> story
> >>> nailed down, but I still think the tricky part is figuring out how to
> >> allow
> >>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
> with
> >>> Kerberos, curious what your thoughts are there. How would secrets be
> >> passed
> >>> 

Re: Kerberos and Airflow

2018-07-29 Thread Dan Davydov
The concern is how to secure secrets on the scheduler such that only
certain DAGs can access them, and in the case of files that create DAGs
dynamically, only some set of DAGs should be able to access these secrets.

e.g. if there is a secret/keytab that can be read by DAG A generated by
file X, and file X generates DAG B as well, there needs to be a scheme to
stop the parsing of DAG B on the scheduler from being able to read the
secret in DAG A.

Does that make sense?

On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin  wrote:

> I’m not sure what you mean. The example I created allows for dynamic DAGs,
> as the scheduler obviously knows about the tasks when they are ready to be
> scheduled.
> This isn’t any different from a static DAG or a dynamic one.
>
> For Kerberos it isnt that special. Basically a keytab are the revokable
> users credentials
> in a special format. The keytab itself can be protected by a password. So
> I can imagine
> that a connection is defined that sets a keytab location and password to
> access the keytab.
> The scheduler understands this (or maybe the Connection model) and
> serializes and sends
> it to the worker as part of the metadata. The worker then reconstructs the
> keytab and issues
> a kinit or supplies it to the other service requiring it (eg. Spark)
>
> * Obviously the worker and scheduler need to communicate over SSL.
> * There is a challenge at the worker level. Credentials are secured
> against other users, but are readable by the owning user. So imagine 2 DAGs
> from two different users with different connections without sudo
> configured. If they end up at the same worker if DAG 2 is malicious it
> could read files and memory created by DAG 1. This is the reason why using
> environment variables are NOT safe (DAG 2 could read /proc//environ).
> To mitigate this we probably need to PIPE the data to the task’s STDIN. It
> won’t solve the issue but will make it harder as now it will only be in
> memory.
> * The reconstructed keytab (or the initalized version) can be stored in,
> most likely, the process-keyring (
> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>). As
> mentioned earlier this poses a challenge for Java applications that cannot
> read from this location (keytab an ccache). Writing it out to the
> filesystem then becomes a possibility. This is essentially the same how
> Spark solves it (
> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>).
>
> Why not work on this together? We need it as well. Airflow as it is now we
> consider the biggest security threat and it is really hard to secure it.
> The above would definitely be a serious improvement. Another step would be
> to stop Tasks from accessing the Airflow DB all together.
>
> Cheers
> Bolke
>
> > On 29 Jul 2018, at 05:36, Dan Davydov 
> wrote:
> >
> > This makes sense, and thanks for putting this together. I might pick this
> > up myself depending on if we can get the rest of the mutli-tenancy story
> > nailed down, but I still think the tricky part is figuring out how to
> allow
> > dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work with
> > Kerberos, curious what your thoughts are there. How would secrets be
> passed
> > securely in a multi-tenant Scheduler starting from parsing the DAGs up to
> > the executor sending them off?
> >
> > On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin  <mailto:bdbr...@gmail.com>> wrote:
> >
> >> Here:
> >>
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>
> >>
> >> Is a working rudimentary implementation that allows securing the
> >> connections (only LocalExecutor at the moment)
> >>
> >> * It enforces the use of “conn_id” instead of the mix that we have now
> >> * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its
> >> connections
> >> * The scheduler reads the connection informations and serializes it to
> >> json (which should be a different format, protobuf preferably)
> >> * The scheduler then sends this info to the executor
> >> * The executor puts this in the environment of the task (environment
> most
> >> likely not secure enough for us)
> >> * The BaseHook reads out this environment variable and does not need to
> >> touch t

Re: Kerberos and Airflow

2018-07-28 Thread Dan Davydov
This makes sense, and thanks for putting this together. I might pick this
up myself depending on if we can get the rest of the mutli-tenancy story
nailed down, but I still think the tricky part is figuring out how to allow
dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work with
Kerberos, curious what your thoughts are there. How would secrets be passed
securely in a multi-tenant Scheduler starting from parsing the DAGs up to
the executor sending them off?

On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin  wrote:

> Here:
>
> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>
>
> Is a working rudimentary implementation that allows securing the
> connections (only LocalExecutor at the moment)
>
> * It enforces the use of “conn_id” instead of the mix that we have now
> * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its
> connections
> * The scheduler reads the connection informations and serializes it to
> json (which should be a different format, protobuf preferably)
> * The scheduler then sends this info to the executor
> * The executor puts this in the environment of the task (environment most
> likely not secure enough for us)
> * The BaseHook reads out this environment variable and does not need to
> touch the database
>
> The example_http_operator works, I havent tested any other. To make it
> work I just adjusted the hook and operator to use “conn_id” instead
> of the non standard http_conn_id.
>
> Makes sense?
>
> B.
>
> * The BaseHook is adjusted to not connect to the database
> > On 28 Jul 2018, at 17:50, Bolke de Bruin  wrote:
> >
> > Well, I don’t think a hook (or task) should be obtain it by itself. It
> should be supplied.
> > At the moment you start executing the task you cannot trust it anymore
> (ie. it is unmanaged
> > / non airflow code).
> >
> > So we could change the basehook to understand supplied credentials and
> populate
> > a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
> anyway, so
> > it shouldnt be too hard and should in principle not require changes to
> the hooks
> > themselves if they are well behaved.
> >
> > B.
> >
> >> On 28 Jul 2018, at 17:41, Dan Davydov  <mailto:ddavy...@twitter.com.INVALID>> wrote:
> >>
> >> *So basically in the scheduler we parse the dag. Either from the
> manifest
> >> (new) or from smart parsing (probably harder, maybe some auto
> register?) we
> >> know what connections and keytabs are available dag wide or per task.*
> >> This is the hard part that I was curious about, for dynamically created
> >> DAGs, e.g. those generated by reading tasks in a MySQL database or a
> json
> >> file, there isn't a great way to do this.
> >>
> >> I 100% agree with deprecating the connections table (at least for the
> >> secure option). The main work there is rewriting all hooks to take
> >> credentials from arbitrary data sources by allowing a customized
> >> CredentialsReader class. Although hooks are technically private, I
> think a
> >> lot of companies depend on them so the PMC should probably discuss if
> this
> >> is an Airflow 2.0 change or not.
> >>
> >> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin  <mailto:bdbr...@gmail.com>> wrote:
> >>
> >>> Sure. In general I consider keytabs as a part of connection
> information.
> >>> Connections should be secured by sending the connection information a
> task
> >>> needs as part of information the executor gets. A task should then not
> need
> >>> access to the connection table in Airflow. Keytabs could then be send
> as
> >>> part of the connection information (base64 encoded) and setup by the
> >>> executor (this key) to be read only to the task it is launching.
> >>>
> >>> So basically in the scheduler we parse the dag. Either from the
> manifest
> >>> (new) or from smart parsing (probably harder, maybe some auto
> register?) we
> >>> know what connections and keytabs are available dag wide or per task.
> >>>
> >>> The credentials and connection information then are serialized into a
> >>> protobuf message and send to the executor as part of the “queue”
> action.
> >>> The worker then deserializes the information and makes it securely
> >>> available to the task (which is quite hard btw).
> >>>
> >>> On that last bit making the info securely available might be storing
> it in
> &g

Re: Kerberos and Airflow

2018-07-28 Thread Dan Davydov
*So basically in the scheduler we parse the dag. Either from the manifest
(new) or from smart parsing (probably harder, maybe some auto register?) we
know what connections and keytabs are available dag wide or per task.*
This is the hard part that I was curious about, for dynamically created
DAGs, e.g. those generated by reading tasks in a MySQL database or a json
file, there isn't a great way to do this.

I 100% agree with deprecating the connections table (at least for the
secure option). The main work there is rewriting all hooks to take
credentials from arbitrary data sources by allowing a customized
CredentialsReader class. Although hooks are technically private, I think a
lot of companies depend on them so the PMC should probably discuss if this
is an Airflow 2.0 change or not.

On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin  wrote:

> Sure. In general I consider keytabs as a part of connection information.
> Connections should be secured by sending the connection information a task
> needs as part of information the executor gets. A task should then not need
> access to the connection table in Airflow. Keytabs could then be send as
> part of the connection information (base64 encoded) and setup by the
> executor (this key) to be read only to the task it is launching.
>
> So basically in the scheduler we parse the dag. Either from the manifest
> (new) or from smart parsing (probably harder, maybe some auto register?) we
> know what connections and keytabs are available dag wide or per task.
>
> The credentials and connection information then are serialized into a
> protobuf message and send to the executor as part of the “queue” action.
> The worker then deserializes the information and makes it securely
> available to the task (which is quite hard btw).
>
> On that last bit making the info securely available might be storing it in
> the Linux KEYRING (supported by python keyring). Keytabs will be tough to
> do properly due to Java not properly supporting KEYRING and only files and
> these are hard to make secure (due to the possibility a process will list
> all files in /tmp and get credentials through that). Maybe storing the
> keytab with a password and having the password in the KEYRING might work.
> Something to find out.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 27 jul. 2018 om 22:04 heeft Dan Davydov 
> het volgende geschreven:
> >
> > I'm curious if you had any ideas in terms of ideas to enable
> multi-tenancy
> > with respect to Kerberos in Airflow.
> >
> >> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin 
> wrote:
> >>
> >> Cool. The doc will need some refinement as it isn't entirely accurate.
> In
> >> addition we need to separate between Airflow as a client of kerberized
> >> services (this is what is talked about in the astronomer doc) vs
> >> kerberizing airflow itself, which the API supports.
> >>
> >> In general to access kerberized services (airflow as a client) one needs
> >> to start the ticket renewer with a valid keytab. For the hooks it isn't
> >> always required to change the hook to support it. Hadoop cli tools often
> >> just pick it up as their client config is set to do so. Then another
> class
> >> is there for HTTP-like services which are accessed by urllib under the
> >> hood, these typically use SPNEGO. These often need to be adjusted as it
> >> requires some urllib config. Finally, there are protocols which use SASL
> >> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require
> per
> >> protocol implementations.
> >>
> >> From the top of my head we support kerberos client side now with:
> >>
> >> * Spark
> >> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
> >> implementation)
> >> * Hive (not metastore afaik)
> >>
> >> Two things to remember:
> >>
> >> * If a job (ie. Spark job) will finish later than the maximum ticket
> >> lifetime you probably need to provide a keytab to said application.
> >> Otherwise you will get failures after the expiry.
> >> * A keytab (used by the renewer) are credentials (user and pass) so jobs
> >> are executed under the keytab in use at that moment
> >> * Securing keytab in multi tenancy airflow is a challenge. This also
> goes
> >> for securing connections. This we need to fix at some point. Solution
> for
> >> now seems to be no multi tenancy.
> >>
> >> Kerberos seems harder than it is btw. Still, we are sometimes moving
> away
> >> from it to OAUTH2 based authentication. This gets use closer to cloud
> >> standards 

Re: Give it up for Fokko!

2018-04-16 Thread Dan Davydov
Agreed, thanks for being super active reviewing PRs and JIRAs!

On Sat, Apr 14, 2018 at 12:48 AM Bolke de Bruin  wrote:

> Hear hear Fokko. So I assume you will be joining us in SF around October
> ;-)??
>
> > On 14 Apr 2018, at 09:16, Driesprong, Fokko 
> wrote:
> >
> > Thanks you all for the kind words. I really love the energetic community
> > around Airflow and all the cool stuff that we're working on. I find it
> > truly amazing how we build such an awesome product with great people from
> > all around the world!
> >
> > Cheers, Fokko
> >
> > 2018-04-14 1:56 GMT+02:00 Sid Anand :
> >
> >> +100
> >>
> >> On Fri, Apr 13, 2018 at 4:08 PM, Alex Tronchin-James 949-412-7220
> <(949)%20412-7220> <
> >> alex.n.ja...@gmail.com> wrote:
> >>
> >>> Bravo!!! Bien fait!
> >>>
> >>> On Fri, Apr 13, 2018 at 3:54 PM Joy Gao  wrote:
> >>>
>  
> 
>  On Fri, Apr 13, 2018 at 11:47 AM, Naik Kaxil 
> wrote:
> 
> > Couldn't agree more. Thanks Fokko
> >
> > On 13/04/2018, 17:56, "Maxime Beauchemin" <
> >> maximebeauche...@gmail.com
> 
> > wrote:
> >
> >Hey all,
> >
> >I wanted to point out the amazing work that Fokko is doing,
> >reviewing/merging PRs and doing fantastic committer & maintainer
>  work.
> > It
> >takes a variety of contributions to make projects like Airflow
>  thrive,
> > but
> >without this kind of involvement it woul
> dn't
> be possible to keep
> > shipping
> >better versions of the product steadily.
> >
> >Cheers to that!
> >
> >Max
> >
> >
> >
> >
> >
> >
> > Kaxil Naik
> >
> > Data Reply
> > 38 Grosvenor Gardens
> > London SW1W 0EB - UK
> > phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
> <+44%2020%207730%206000>
> > k.n...@reply.com
> > www.reply.com
> >
> 
> >>>
> >>
>
>


Re: Airflow Scalability with Local Executor

2018-03-28 Thread Dan Davydov
The LocalExecutor is great for running small numbers of DAGs/tasks, but it
is more of a starter executor meant to made Airflow work out of the box. I
would recommend switching to a different executor like the CeleryExecutor.

You are certainly right that there is room for reducing the memory
footprint of each Airflow process (though I'm not too sure how much can be
done about the CPU usage, could be a function of how your DAGs are parsed).
Even if you fix the current bottlenecks you will likely run into more.

On Wed, Mar 28, 2018 at 7:13 AM ramandu...@gmail.com 
wrote:

> Hi All,
> We have a use case to support 1000 concurrent DAGs. These dags would have
> have couple of Http task which would be submitting jobs to external
> services. Each DAG could run for couple of hours.
> HTTP tasks are periodically checking(with sleep 20) the job status.
> We tried running 1000 such dags(Parallelism set to 1000) with Airflow's
> LocalExecutor Mode but after 100 concurrent runs, tasks started failing due
> to
> --> OOM error
> --> Scheduler marked them failed because of lack of heartbeat.
> We are using 4 cores and 16 GB RAM. Each airflow worker is taking ~250 MB
> of Virtual memory and ~60 MB of RES memory which seems to be on higher
> side. CPU utilisation is also ~98%.
> Is there anything that can be done to optimise Memory/CPU for airflow
> worker.
> Any pointer to airflow benchmarking with LocalExecutor would also be
> helpful
>


Re: 4/17 Airflow Meetup Slides

2017-12-27 Thread Dan Davydov
Thanks Sid!

On Wed, Dec 27, 2017 at 3:13 PM Sid Anand <san...@apache.org> wrote:

> Thx for posting the slides : I've added them to the Announcements page :
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-Nov1,2017
>
> On Thu, Dec 7, 2017 at 12:09 PM, Dan Davydov <dan.davy...@airbnb.com
> .invalid
> > wrote:
>
> > Here <https://drive.google.com/open?id=154jnUADKfrHXLUDvJBQOl8aiBwpaw9-w
> >
> > are the presentations from the 4/17 Airflow meetup at Airbnb.
> >
> > Unfortunately we weren't able to get the recording software working for
> the
> > talk.
> >
> > Thank you all for coming!
> >
>


Re: Switching minimum support version from python 3.4 -> 3.5

2017-12-19 Thread Dan Davydov
Sounds good to me

On Tue, Dec 19, 2017 at 10:18 AM Chris Riccomini 
wrote:

> :thumbsup:
>
> On Tue, Dec 19, 2017 at 5:19 AM, Bolke de Bruin  wrote:
>
> > Hi All,
> >
> > We have some issues on Travis with issues around distributed and task,
> > these are related to Python 3.4. As Python 3.4 is not very popular (
> > https://user-images.githubusercontent.com/306380/
> > 29750903-f891cb2a-8b15-11e7-84cc-e26ce5b1e095.png) I am switching the
> > builds to Python 3.5 as a minimum.
> >
> > Please let me know if you think that is a bad idea.
> >
> > Cheers
> > Bolke
> >
> >
> >
>


4/17 Airflow Meetup Slides

2017-12-07 Thread Dan Davydov
Here 
are the presentations from the 4/17 Airflow meetup at Airbnb.

Unfortunately we weren't able to get the recording software working for the
talk.

Thank you all for coming!


Re: Experimental API

2017-10-30 Thread Dan Davydov
FWIW I am hoping we can change this insecure-by-default for 2.0, and there
is already some stuff in the Airflow config that lets you do this out of
the box if you tweak a couple of config values (e.g. check out secure_mode
that we can hopefully build upon).

On Mon, Oct 30, 2017 at 3:22 PM Bolke de Bruin  wrote:

> Hi All,
>
> Airflow out of the box comes without security configured. This goes for
> both the API and the UI. Currently, the API and the UI make use of
> different authentication backends due to the way authentication needed to
> be implemented. This should be better documented.
>
> So while “the web ui is protected, thus automatically the API as well” is
> the ideal situation, it is not an oversight and “not something has gone
> wrong”.
>
> Some part of this is technical debt. Which we probably won’t solve until
> the move towards FlaskApplicationBuilder, hopefully not too far out. That
> being said we might choose to have an Rest API as a separate service from
> the WebUI.
>
> Cheers
> Bolke
>
>
>
> > On 30 Oct 2017, at 16:42, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >
> > Oh gods.
> >
> > Something has gone wrong - the methods are decorated with
> `@requires_authentication` but they... don't. Oh, because the default
> backend doesn't do any authentication or protection at all.
> >
> > I thik this is CVEworthy - using the User+Password auth for the web
> front end/using default config should not leave the API unprotected. I
> think the default API auth backend should deny all rather than allow all?
> >
> > -ash
> >
> >> On 30 Oct 2017, at 08:51, Niels Zeilemaker <
> nielszeilema...@godatadriven.com> wrote:
> >>
> >> Hi All,
> >>
> >> I've implemented HTTP Basic Authentication for the experiment API, see
> https://github.com/apache/incubator-airflow/pull/2730. This seems to work
> fine.
> >> However, while implementing this. I noticed, to my surprise, that the
> experimental API was open even though we enabled Password authentication
> for the web-interface.
> >> This seems like a bug to me, as one would expect that the experimental
> API would use the same auth backend as the web-interface.
> >>
> >> Why did Airflow choose to split the authentication for the
> web-interface  and experimental API?
> >> And if it's not possible to combine those, is it possible to lock down
> the experimental API if one chooses a non-default web-interface auth
> backend?
> >>
> >> Niels
> >> Ps with an unsecured experimental api it is possible to trigger dags,
> list pools, delete pools, etc.
> >
>
>


Airflow Meetup @ Airbnb - Mon Dec 4th

2017-10-25 Thread Dan Davydov
Hey guys, we are doing an Airflow meetup at Airbnb on December 4th!

All are welcome, and food will be provided.

Please RSVP and see the details/agenda at
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/244525050/

Sincerely,
Dan


Re: Meetup Interest?

2017-10-16 Thread Dan Davydov
Glad to see there is interest! I'll work on setting this up.

On Sun, Oct 15, 2017 at 10:47 AM Cade Markegard <cademarkeg...@gmail.com>
wrote:

> +1 at SF meetup
>
> Would be interested in seeing that progress of airflow + k8s and any other
> advancements the community has made.
> On Sun, Oct 15, 2017 at 9:24 AM Feng Lu <fen...@google.com.invalid> wrote:
>
> > +1
> >
> > We can give an update on task secret management in K8SExecutor and also
> > want to share our thoughts and get feedback on Airflow CI/CD with the set
> > of GCP operators/hooks as an example.
> >
> > On Sat, Oct 14, 2017 at 7:06 PM, Marc Bollinger <m...@lumoslabs.com>
> > wrote:
> >
> > > +1
> > >
> > > We'd definitely be in. Would love to chat more about K8s/Airflow--Data
> > Eng
> > > has been a little twitchy about being the guinea pigs in our org, but
> the
> > > production app is now serving all traffic from it, so we're planning
> out
> > > our strategy.
> > >
> > > On Fri, Oct 13, 2017 at 1:29 PM, Daniel Imberman (BLOOMBERG/ SAN FRAN)
> <
> > > dimber...@bloomberg.net> wrote:
> > >
> > > > +1
> > > >
> > > > We're getting really close on the Kubernetes Executor PR. Would love
> to
> > > > discuss final features/architecture to make sure we cover our bases
> > > before
> > > > we try to roll out alpha.
> > > >
> > > >
> > > > From: mw...@newrelic.com
> > > > Subject: Re: Meetup Interest?
> > > >
> > > > +1 for this meetup idea! We don't use Kube+Airflow, but I'd love to
> see
> > > > talks on scaling it out team-wise and some design patterns people
> have
> > > come
> > > > up with.
> > > >
> > > > --
> > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > Engagement
> > > |
> > > > New Relic
> > > > On Fri, Oct 13, 2017 at 1:03 PM, Christopher Bockman <
> > > > ch...@fathomhealth.co> wrote:
> > > >
> > > > +1 as a vote.
> > > >
> > > > We're very actively working on Kube+Airflow, so would be particularly
> > > > interested on discussions there.
> > > >
> > > > On Fri, Oct 13, 2017 at 12:59 PM, Joy Gao <j...@wepay.com> wrote:
> > > >
> > > > > Hi Dan,
> > > > >
> > > > > I'd be happy to give an update on progress of the new RBAC UI we've
> > > been
> > > > > working on here at WePay.
> > > > >
> > > > > Cheers,
> > > > > Joy
> > > > >
> > > > > On Fri, Oct 13, 2017 at 12:10 PM, Dan Davydov <
> > > > > dan.davy...@airbnb.com.invalid> wrote:
> > > > >
> > > > > > Is there interest in doing an Airflow meet-up? Airbnb can host
> one
> > in
> > > > San
> > > > > > Francisco.
> > > > > >
> > > > > > Some talk ideas can include the progress on Kubernetes
> integration
> > > and
> > > > > > Scaling & Operations with Airflow. If you want to see other
> topics
> > > > > covered,
> > > > > > feel free to suggest them!
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
>


Meetup Interest?

2017-10-13 Thread Dan Davydov
Is there interest in doing an Airflow meet-up? Airbnb can host one in San
Francisco.

Some talk ideas can include the progress on Kubernetes integration and
Scaling & Operations with Airflow. If you want to see other topics covered,
feel free to suggest them!


Re: Apache Airflow welcome new committer/PMC member : Fokko Driespong (a.k.a. fokko)

2017-10-04 Thread Dan Davydov
Welcome!

On Wed, Oct 4, 2017 at 2:31 PM Maxime Beauchemin 
wrote:

> Welcome on board Fokko!
>
> Max
>
> On Wed, Oct 4, 2017 at 2:18 PM, Chris Riccomini 
> wrote:
>
> > Welcome!!
> >
> > On Wed, Oct 4, 2017 at 12:51 PM, Sid Anand  wrote:
> >
> > > Folks,
> > > Please join the Apache Airflow PMC in welcoming its newest member and
> > > co-committer, Fokko Driespong (a.k.a. fokko  >).
> > >
> > > https://cwiki.apache.org/confluence/display/AIRFLOW/
> > > Announcements#Announcements-Oct1,2017
> > >
> > >
> > > -s
> > >
> >
>


Re: Some random fun

2017-09-25 Thread Dan Davydov
Haha, this is great.

On Mon, Sep 25, 2017 at 11:37 AM Shah Altaf  wrote:

> **CupcakeSensor activated**
>
>
>
> On Mon, Sep 25, 2017 at 7:31 PM Laura Lorenz 
> wrote:
>
> > Just thought everyone here would appreciate the nerdy party our data team
> > threw ourselves for completing a milestone on a difficult DAG recently.
> We
> > played Pin the Task on the DAG and ate Task State cupcakes: see pics at
> > https://twitter.com/lalorenz6/status/912383049354096641
> >
>


Re: Airflow 1.8.2 released

2017-09-05 Thread Dan Davydov
+1, this was a lot more work than anticipated.

On Tue, Sep 5, 2017 at 10:10 AM Chris Riccomini 
wrote:

> Thanks so much for slogging through this!
>
> On Mon, Sep 4, 2017 at 10:26 AM, Sumit Maheshwari 
> wrote:
>
> > Awesome!!
> >
> > Thanks a lot Max for being the RM for this release.
> >
> >
> >
> > On Mon, Sep 4, 2017 at 10:51 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > Dear Airflow community,
> > >
> > > Airflow 1.8.2 was just released.
> > >
> > > The source release as well as the binary "sdist" release are available
> > > here:
> > > https://dist.apache.org/repos/dist/release/incubator/
> > > airflow/1.8.2-incubating/
> > >
> > > We also made this version available on Pypi for convenience (`pip
> install
> > > apache-airflow`):
> > > https://pypi.python.org/pypi/apache-airflow
> > >
> > > Note that 1.8.2 is a minor release that is several months behind the
> > > current `master` branch. We're trying to increase our release cadence
> as
> > we
> > > iron out the process and Apache requirements. The process requires a
> fair
> > > amount of back and forth with the community and Apache, and I have to
> > admit
> > > that I wasn't exactly on top of it as the person in charge of this
> > release.
> > > Some of the work I've done around the LICENSE files should make future
> > > releases easier though, so that's a positive thing.
> > >
> > > Find the CHANGELOG here for more details:
> > > https://github.com/apache/incubator-airflow/pull/2562
> > >
> > > Also note that 1.9.0rc1 will be cut off of master shortly and should
> > > include all of the latest development.
> > >
> > > Enjoy!
> > >
> > > Max
> > >
> >
>


Re: Airflow + Kubernetes update meeting

2017-09-05 Thread Dan Davydov
Works for me as well!

On Tue, Sep 5, 2017 at 10:43 AM Daniel Imberman 
wrote:

> @Marc we will make sure to record the meeting/supply notes. This should be
> a pretty straightforward update/overview meeint.
> @ChrisB this meeting will be a virtual meeting, though Bloomberg is
> definitely interested in hosting an airflow meetup at our SF location if
> there is sufficient interest :).
> @ChrisR Great to hear :). We've been working with members of the openshift
> community so we can definitely speak to those requirem
>
>
> On Tue, Sep 5, 2017 at 10:24 AM Feng Lu  wrote:
>
>> +1, either way works for me.
>>
>> On Tue, Sep 5, 2017 at 10:10 AM, Chris Riccomini 
>> wrote:
>>
>> > Works for me.
>> >
>> > On Tue, Sep 5, 2017 at 7:44 AM, Grant Nicholas > > northwestern.edu> wrote:
>> >
>> >> +1 for me if it works with others.
>> >>
>> >> On Mon, Sep 4, 2017 at 11:02 PM, Anirudh Ramanathan <
>> >> ramanath...@google.com> wrote:
>> >>
>> >>> Date/time work for me if we get quorum from this group.
>> >>>
>> >>> On Thu, Aug 31, 2017 at 7:54 PM, Christopher Bockman <
>> >>> ch...@fathomhealth.co> wrote:
>> >>>
>>  Hi Daniel, would this be remote or in person?
>> 
>> 
>>  On Aug 31, 2017 4:16 PM, "Daniel Imberman" <
>> daniel.imber...@gmail.com>
>>  wrote:
>> 
>>  Hey guys!
>> 
>>  So I wanted to set up a meeting to discuss some of the
>> updates/current
>>  work
>>  that is going on with both the kubernetes operator and kubernetes
>>  executor
>>  efforts. There has been some really cool updates/proposals on the
>>  design of
>>  these two features and I would love to get some community feedback to
>>  make
>>  sure that we are taking this in a direction that benefits everyone.
>> 
>>  I am thinking of having this meeting at 10:00AM on Thursday,
>> September
>>  7th
>>  PST. Would this time/place work?
>> 
>>  Thanks!
>> 
>>  Daniel
>> 
>> 
>> 
>> >>>
>> >>>
>> >>> --
>> >>> Anirudh Ramanathan
>> >>>
>> >>
>> >>
>> >
>>
>


Re: Airflow + Kubernetes Talk video

2017-07-28 Thread Dan Davydov
Thanks for organizing, and leading this effort in general!

On Fri, Jul 28, 2017 at 2:40 PM, Daniel Imberman 
wrote:

> Hi guys!
>
> Thank you again to everyone who attended the talk yesterday. I've posted
> the video of the conversation to youtube, and will soon add the video and
> slides to the airflow Wiki
>
> Cheers,
> Daniel
>
> https://www.youtube.com/watch?v=5BU3YPYYRno
>


Re: Role Based Access Control for Airflow UI

2017-07-25 Thread Dan Davydov
t; > > >>>>>> B.
> > > > > >>>>>>
> > > > > >>>>>>> On 22 Jun 2017, at 09:36, Bolke de Bruin <
> bdbr...@gmail.com>
> > > > > >>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> Hi Guys,
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks for putting the thinking in! It is about time that
> we
> > > get
> > > > > >>> this
> > > > > >>>>>> moving.
> > > > > >>>>>>>
> > > > > >>>>>>> The design looks pretty sound. One can argue about the
> > > different
> > > > > >>>> roles
> > > > > >>>>>> that are required, but that will be situation dependent I
> > guess.
> > > > > >>>>>>>
> > > > > >>>>>>> Implementation wise I would argue together with Max that
> FAB
> > > is a
> > > > > >>>>> better
> > > > > >>>>>> or best fit. The ER model that is being described is pretty
> > > much a
> > > > > >>> copy
> > > > > >>>>> of
> > > > > >>>>>> a normal security model. So a reimplementation of that is 1)
> > > > > >>>> significant
> > > > > >>>>>> duplication of effort and 2) bound to have bugs that have
> been
> > > > > >> solved
> > > > > >>>> in
> > > > > >>>>>> the other framework. Moreover, FAB does have integration out
> > of
> > > > the
> > > > > >>> box
> > > > > >>>>>> with some enterprisey systems like IPA, ActiveDirectory, and
> > > LDAP.
> > > > > >>>>>>>
> > > > > >>>>>>> So while you argue that using FAB would increase the scope
> of
> > > the
> > > > > >>>>>> proposal significantly, but I think that is not true. Using
> > FAB
> > > > > >> would
> > > > > >>>>> allow
> > > > > >>>>>> you to focus on what kind of out-of-the-box permission sets
> > and
> > > > > >> roles
> > > > > >>>> we
> > > > > >>>>>> would need and maybe address some issues that FAB lacks
> (maybe
> > > how
> > > > > >> to
> > > > > >>>>> deal
> > > > > >>>>>> with non web access - ie. in DAGs, maybe Kerberos, probably
> > how
> > > to
> > > > > >>> deal
> > > > > >>>>>> with API calls that are not CRUD). Implementation wise it
> > > probably
> > > > > >>>>>> simplifies what we need to do. Maybe - using Max’s early POC
> > as
> > > an
> > > > > >>>>> example
> > > > > >>>>>> - we can slowly move over?
> > > > > >>>>>>>
> > > > > >>>>>>> On a side note: Im planning to hire 2-3 ppl to work on
> > Airflow
> > > > > >>> coming
> > > > > >>>>>> year. Improvement of Security, Enterprise Integration,
> Revamp
> > UI
> > > > > >> are
> > > > > >>> on
> > > > > >>>>> the
> > > > > >>>>>> todo list. However, this is not confirmed yet as business
> > > > > >> priorities
> > > > > >>>>> might
> > > > > >>>>>> change.
> > > > > >>>>>>>
> > > > > >>>>>>> Bolke.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> On 15 Jun 2017, at 21:45, kalpesh dharwadkar <
> > > > > >>>>>> kalpeshdharwad...@gmail.com> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> @Dan:
> > > > > >>>>>>>>
> > > > &

Re: What argument does -A / --ignore_all_dependencies expect when triggering via airflow run?

2017-07-12 Thread Dan Davydov
Probably caused by the issue being fixed here: https://github.com/apache/i
ncubator-airflow/pull/2327

On Wed, Jul 12, 2017 at 4:25 AM, Tobias Feldhaus <
tobias.feldh...@localsearch.ch> wrote:

> I am trying to force airflow to run a task so that the depends_on_past
> setting of the DAG is honoured and I can get it to run the rest of it:
>
> I have tried using the –A / --ignore_all_dependencies parameter of the
> airflow run command, but I don’t know what argument it does expect:
>
>
>
> airflow run -l -A IGNORE_ALL_DEPENDENCIES -f google_pipelines
> search_log_sensor 2017-07-01
>
> airflow run -l --ignore_all_dependencies IGNORE_ALL_DEPENDENCIES -f
> google_pipelines search_log_sensor 2017-07-01
>
> both give me:
>
> airflow run: error: argument -A/--ignore_all_dependencies: expected one
> argument
>
>
> Am I using it wrong?
>
>
>
> Best,
> Tobias
>
>


Re: airflow backfill seems to ignore -I

2017-07-12 Thread Dan Davydov
Airflow dependencies were simplified a bit, -i no longer ignores failed
state tasks, check out the -A flag which ignores pretty much all
dependencies (including the failed state tasks), though depending on the
version you are using there is a bug that is being fixed here:
https://github.com/apache/incubator-airflow/pull/2327

On Wed, Jul 5, 2017 at 8:45 AM, Weiwei Zhang  wrote:

> I am using airflow 1.8.1 as well. It is able to pick up the rest of the
> tasks when using backfill with the only exception which is when there is a
> task failed and I had to clear the status to allow the backfill to work.
> Any ideas why it is behaving like this? The previous version 1.6.2 didn't
> require clearing the failed task before doing backfill.
>
> Thx a lot,
> Viv
>
> > On Jul 5, 2017, at 7:38 AM, Tobias Feldhaus <
> tobias.feldh...@localsearch.ch> wrote:
> >
> > I’ve just pulled the newest master and build it; the behaviour is the
> same. How can it be that “–i” is not honoured and dependencies are checked?
> >
> >
> > On 05.07.2017, 15:49, "Tobias Feldhaus" 
> wrote:
> >
> >But nonetheless, is it not possible to backfill and ignore the
> upstream dependencies with “-i” ?
> >
> >On 05.07.2017, 14:34, "Tobias Feldhaus"  ch> wrote:
> >
> >I meant –i , but I just needed to manually set the upstream
> things to success and it worked. Nevermind.
> >
> >Best,
> >Tobi
> >
> >On 05.07.2017, 14:28, "Tobias Feldhaus" <
> tobias.feldh...@localsearch.ch>
> wrote:
> >
> >Hi,
> >
> >When running airflow (1.8.1) backfill with –I and –t like:
> >
> >airflow backfill -t 'nonspider_sessions' -i -I -s 2017-05-30 -e
> 2017-05-31 google_pipelines
> >
> >I would expect it to rerun that specific task and ignoring the
> dependencies. Instead I see this:
> >
> >[2017-07-05 12:23:30,419] {base_task_runner.py:95} INFO -
> Subtask: [2017-07-05 12:23:30,419] {models.py:1145} INFO - Dependencies not
> met for  05:30:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule
> 'all_success' requires all upstream tasks to have succeeded, but found 3
> non-success(es). upstream_tasks_state={'successes': 0L, 'failed': 0L,
> 'upstream_failed': 0L, 'skipped': 0L, 'done': 0L},
> upstream_task_ids=['frontend_sensor', 'log_sensor', 'tracker_pipeline']
> >
> >Am I doing it wrong?
> >
> >
> >
> >Best,
> >Tobi
> >
> >
> >
> >
>


Re: Airflow Logging Improvements

2017-06-22 Thread Dan Davydov
I don't think Allison's PR fixes logging, but it's a step in the right
direction. The current approach creates an abstraction around reading logs,
whereas the final solution should define an interface for writings to logs
in addition to reading logs (which could indeed use something like
https://github.com/cmanaha/python-elasticsearch-logger for the writing
part). I agree we should move the logging towards something like log4j
(with context awareness of task id/dag id/execution date/attempt #). If
there are incompatibilities with this approach and the log4j solution (or
reasons why it would be difficult to port the PR over to the final model),
we should definitely address those concerns, but otherwise I still feel
this is a step in the right direction.

The concept of "attempt" is needed regardless of logging, the way retries
are stored/handled right now is not very sane.
- Old TI state is permanently deleted
- In the task logs you get "Try 1/6"... 2/6... 3/6... 1/6... 2/6 in logs
which doesn't make sense (if it's the Nth time the task is running it
should be logged as the Nth time). I recall other strange behaviors in
these log lines too (maybe something like Try 4/3).
- The "primary key" for a task instance run is not complete (which is what
Allison's logging change needs), you could say that TaskInstance should
only keep track of the latest TaskInstance run, but we still want to store
all tries for a task instance somewhere in the database, and we still need
to key this off of "attempt".

On Thu, Jun 22, 2017 at 11:00 AM, Allison Wang <allisonwang...@gmail.com>
wrote:

> Hi Bolke,
>
> I agree that we should make logging configurable lbut I wouldn't think
> using handlers like python-elasticsearch-logger is a good idea over
> flushing logs into files. Here are some reasons:
>
>1. Such handlers do not have the built-in backpressure-sensitive
>protocol that can prevent overwhelm ElasticSearch.
>2. Logs will be lost if ElasitcSearch cluster is down for reasons like
>upgrading.
>
> In general, it's not a good practice for python logger talk directly to
> ElasticSearch. Flushing logs into files give us more flexibilities to use
> tools like Filebeat and Logstash to collect and index logs into
> ElasticSearch.
>
> Thanks,
> Allison
>
> On Thu, Jun 22, 2017 at 12:05 AM Bolke de Bruin <bdbr...@gmail.com> wrote:
>
>> In the light of fixing logging, I would definitely appreciate written
>> design. Especially, as there have been multiple attempts to fix some issues
>> but these have been more like stop gap fixes.
>>
>> In my opinion Airflow should not stipulate in a hard coded fashion where
>> and how logging takes place. It should behave more like ‘log4j’
>> configurations. So it should not just use “dag_id + task+id +
>> execution_date” and write this to an arbitrary location on the filesystem.
>> I could imagine a settings file “logging.conf” that setups something like
>> this:
>>
>> [logger_scheduler]
>> level = INFO
>> handler = stderr
>> qualname = airflow.scheduler
>> formatter=scheduler_formatter
>>
>> In airflow.cfg it should allow setting something like this:
>>
>> [scheduler]
>> use_syslog = True
>> syslog_log_facility = LOG_LOCAL0
>>
>> To allow logging to syslog so it can be moved to a centralised location
>> if required (syslog being a special case afaik).
>>
>> Elasticsearch and any other backend can then just be a handler and we can
>> remove the custom stuff that is proposed in PR https://github.com/apache/
>> incubator-airflow/pull/2380 <https://github.com/apache/
>> incubator-airflow/pull/2380> by https://github.com/cmanaha/
>> python-elasticsearch-logger <https://github.com/cmanaha/
>> python-elasticsearch-logger> for example.
>>
>> I then can be convinced to add something like “attempt”, but probably
>> there are more friendly ways to solve it at that time. In addition
>> ‘attempts' should then imho not be managed by the task or cli, but rather
>> by the executor as that is the process which “attempts” a task.
>>
>> Bolke.
>>
>>
>> > On 22 Jun 2017, at 01:21, Dan Davydov <dan.davy...@airbnb.com> wrote:
>> >
>> > Responding to some of Bolke's concerns in the github PR for this change:
>> >
>> > > Mmm still not convinced. Especially on elastic search it is just
>> easier to use the start_date to shard on.
>> > sharding on start_date isn't great because there is still some risk of
>> collisions and it means that we are coupling the primary key with
>> start_date unnecessarily (e.g. hypothetically you could allow two tasks to

Re: Airflow Logging Improvements

2017-06-21 Thread Dan Davydov
Responding to some of Bolke's concerns in the github PR for this change:

> Mmm still not convinced. Especially on elastic search it is just easier
to use the start_date to shard on.
sharding on start_date isn't great because there is still some risk of
collisions and it means that we are coupling the primary key with
start_date unnecessarily (e.g. hypothetically you could allow two tasks to
run at the same in Airflow and in this case start_date would no longer be a
valid primary key), using monotonically increasing IDs for DB entries like
this is pretty standard practice.

> In addition I'm very against the managing of log files this way. Log
files are already a mess and should be refactored to be consistent and to
be managed from one place.

I agree about the logging mess, and there seem to have been efforts
attempting to fix this but they have all been abandoned so we decided to
move ahead with this change. I need to take a look at the PR first, but
this change should actually make logging less messy, since it should add an
abstraction for logging modules, and because you know exactly which try
numbers (and how many) ran on which workers from the file path. The log
folder structure already kind of mimicked the primary key of the
task_instance table (dag_id + task_id + execution_date), but really
try_number logically belongs in this key as well (at least for the key for
log files).


> The docker packagers can already not package airflow correctly without
jumping through hoops. Arbitrarily naming it certainly does not help here.

If this is referring to the // in the path, I don't think this
is arbitrarily naming it. A log "unit" really should be a single task run
(not an arbitrary grouping of a variable number of multiple runs), and each
unit should have a unique key or location. One of the reasons we are
working on this effort is to actually make Airflow play nicer with
Kubernetes/Docker (since airflow workers should ideally be ephemeral), and
allowing a separate service to read and ship the logs is necessary in this
case since the logs will be destroyed along with the worker instance. I
think in the future we should also allow custom logging modules (e.g.
directly writing logs to some service).



On Wed, Jun 21, 2017 at 3:11 PM, Allison Wang 
wrote:

> Hi,
>
> I am in the process of making airflow logging backed by Elasticsearch
> (more detail please check AIRFLOW-1325
> ). Here are several
> more logging improvements we are considering:
>
> *1. Log streaming.* Auto-refresh the logs if tasks are running.
>
> *2. Separate logs by attempts.*
> [image: Screen Shot 2017-06-21 at 2.49.11 PM.png]
> Instead of logging everything into one file, logs can be separated by
> attempt number and displayed using tabs. Attempt number here is a
> monotonically increasing number that represents each task instance run
> (unlike try_number, clear task instance won't reset attempt number).
> *try_number:* n^th retry by the task instance. try_number should not be
> greater than retries. Clear task will set try_number to 0.
> *attempt:* number of times current task instance got executed.
>
> *3. Collapsable logs.* Collapse logs that are mainly for debugging
> airflow internal and aren't really related to users' tasks (for example,
> logs showed before "starting attempt 1 of 1")
>
> All suggestions are welcome.
>
> Thanks,
> Allison
>


Re: Role Based Access Control for Airflow UI

2017-06-12 Thread Dan Davydov
Looks good to me in general, thanks for putting this together!

I think the ability to integrate with external RBAC systems like LDAP is
important (i.e. the Airflow DB should not be decoupled with the RBAC
database wherever possible).

I wouldn't be too worried about the permissions about refreshing DAGs, as
far as I know this functionality is no longer required with the new
webservers which reload state periodically, and will certainly be removed
when we have a better DAG consistency story.

I think it would also be good to think about this proposal/implementation
and how it applied in the API-driven world (e.g. when webserver hits APIs
like /clear on behalf of users instead of running commands against the
database directly).

On Mon, Jun 12, 2017 at 11:12 AM, Bolke de Bruin  wrote:

> Will respond but im traveling at the moment. Give me a few days.
>
> Sent from my iPhone
>
> > On 12 Jun 2017, at 13:39, Chris Riccomini  wrote:
> >
> > Hey all,
> >
> > Checking in on this. We spent a good chunk of time thinking about this,
> and
> > want to move forward with it, but want to make sure we're all on the same
> > page.
> >
> > Max? Bolke? Dan? Jeremiah?
> >
> > Cheers,
> > Chris
> >
> > On Thu, Jun 8, 2017 at 1:49 PM, kalpesh dharwadkar <
> > kalpeshdharwad...@gmail.com> wrote:
> >
> >> Hello everyone,
> >>
> >> As you all know, currently Airflow doesn’t have a built-in Role Based
> >> Access Control(RBAC) capability.  It does provide very limited
> >> authorization capability by providing admin, data_profiler, and user
> roles.
> >> However, associating these roles to authenticated identities is not a
> >> simple effort.
> >>
> >> To address this issue, I have created a design proposal for building
> RBAC
> >> into Airflow and simplifying user access management via the Airflow UI.
> >>
> >> The design proposal is located at https://cwiki.apache.org/
> >> confluence/display/AIRFLOW/Airflow+RBAC+proposal
> >>
> >> Any comments/questions/feedback are much appreciated.
> >>
> >> Thanks
> >> Kalpesh
> >>
>


Re: dag file processing times

2017-04-24 Thread Dan Davydov
Was talking with Alex about the DB case offline, for those we could support
a force refresh arg with an interval param.

Manifests would need to be hierarchal but I feel like it would spin out
into a full blown build system inevitably.

On Mon, Apr 24, 2017 at 3:02 PM, Arthur Wiedmer <arthur.wied...@gmail.com>
wrote:

> What if the DAG actually depends on configuration that only exists in a
> database and is retrieved by the Python code generating the DAG?
>
> Just asking because we have this case in production here. It is slowly
> changing, so still fits within the Airflow framework, but you cannot just
> watch a file...
>
> Best,
> Arthur
>
> On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> > Inotify can work without a daemon. Just fire a call to the API when a
> file
> > changes. Just a few lines in bash.
> >
> > If you bundle you dependencies in a zip you should be fine with the
> above.
> > Or if we start using manifests that list the files that are needed in a
> > dag...
> >
> >
> > Sent from my iPhone
> >
> > > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> > wrote:
> > >
> > > One idea to solve this is to use a daemon that uses inotify to watch
> for
> > > changes in files and then reprocesses just those files. The hard part
> is
> > > without any kind of dependency/build system for DAGs it can be hard to
> > tell
> > > which DAGs depend on which files.
> > >
> > > On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <gtoons...@gmail.com>
> > > wrote:
> > >
> > >> Hey,
> > >>
> > >> I've seen some people complain about DAG file processing times. An
> issue
> > >> was raised about this today:
> > >>
> > >> https://issues.apache.org/jira/browse/AIRFLOW-1139
> > >>
> > >> I attempted to provide a good explanation what's going on. Feel free
> to
> > >> validate and comment.
> > >>
> > >>
> > >> I'm noticing that the file processor is a bit naive in the way it
> > >> reprocesses DAGs. It doesn't look at the DAG interval for example, so
> it
> > >> looks like it reprocesses all files continuously in one big batch,
> even
> > if
> > >> we can determine that the next "schedule"  for all its dags are in the
> > >> future?
> > >>
> > >>
> > >> Wondering if a change in the DagFileProcessingManager could optimize
> > things
> > >> a bit here.
> > >>
> > >> In the part where it gets the simple_dags from a file it's currently
> > >> processing:
> > >>
> > >>for simple_dag in processor.result:
> > >>simple_dags.append(simple_dag)
> > >>
> > >> the file_path is in the context and the simple_dags should be able to
> > >> provide the next interval date for each dag in the file.
> > >>
> > >> The idea is to add files to a sorted deque by "next_schedule_datetime"
> > (the
> > >> minimum next interval date), so that when we build the list
> > >> "files_paths_to_queue", it can remove files that have dags that we
> know
> > >> won't have a new dagrun for a while.
> > >>
> > >> One gotcha to resolve after that is to deal with files getting updated
> > with
> > >> new dags or changed dag definitions and renames and different interval
> > >> schedules.
> > >>
> > >> Worth a PR to glance over?
> > >>
> > >> Rgds,
> > >>
> > >> Gerard
> > >>
> >
>


Welcome @saguziel as a committer and PMC member!

2017-04-13 Thread Dan Davydov
Alex (@saguziel - AirBnB) has been making contributions and reviews for
quite a long time now and I'm very happy to say he has just become an
official committer and PMC member.

He has ~13 commits, most of which are to the core of Airflow, and has been
active reviewing open source PRs, contributing in the recent release (e.g.
fixing blocking issues), and has a strong understanding of the the core
Airflow logic (he has submitted a couple of patches to remove race
conditions, and security patches).

Congratulations and welcome Alex!
-Dan


Re: 1.8.1 release

2017-03-21 Thread Dan Davydov
It seemed to only affect some subdags (but I think even when restarted they
were still affected), it seemed like a race condition.

For the second question, we did not yet (but checking this is part of the
ticket).

On Tue, Mar 21, 2017 at 3:01 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> @dan
>
> I'm obviously interested in the subdag issue as it is executed by the
> backfill logic. Do you have anything to reproduce it with? Can also talk
> about it tomorrow.
>
> Secondly, did you verify the all success / skipped 'fix' against 'wait for
> all tasks to finish'?
>
> @chris I also suggest using/embracing jira more (as you are doing), as it
> helps with cleaner changelogs, tracking and targeting releases.
>
> Also note that I already included some fixes in v1-8-test.
>
> Bolke
>
> Sent from my iPhone
>
> > On 21 Mar 2017, at 14:29, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
> >
> > Some of the issues I ran into while testing 1.8rc5 :
> >
> > https://issues.apache.org/jira/browse/AIRFLOW-1015
> >> https://issues.apache.org/jira/browse/AIRFLOW-1013
> >> https://issues.apache.org/jira/browse/AIRFLOW-1004
> >> https://issues.apache.org/jira/browse/AIRFLOW-1003
> >> https://issues.apache.org/jira/browse/AIRFLOW-1001
> >> https://issues.apache.org/jira/browse/AIRFLOW-1015
> >
> >
> > It would be great to have at least some of them fixed in 1.8.1.
> >
> > Thank you.
> >
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Tue, Mar 21, 2017 at 3:02 PM, Dan Davydov <dan.davy...@airbnb.com.
> invalid
> >> wrote:
> >
> >> Here is my list for targeted 1.8.1 fixes:
> >> https://issues.apache.org/jira/browse/AIRFLOW-982
> >> https://issues.apache.org/jira/browse/AIRFLOW-983
> >> https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general the
> >> slow
> >> startup time from this new logic of orphaned/reset task)
> >> https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will
> hopefully
> >> have a fix out for soon just finishing up tests)
> >>
> >> We are also hitting a new issue with subdags with rc5 that we weren't
> >> hitting with rc4 where subdags will occasionally just hang (had to roll
> >> back from rc5 to rc4), I'll try to spin up a JIRA for it soon which
> should
> >> be on the list too.
> >>
> >>
> >> On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini <criccom...@apache.org
> >
> >> wrote:
> >>
> >>> Agreed. I'm looking for a list of checksums/JIRAs that we want in the
> >>> bugfix release.
> >>>
> >>> On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin <bdbr...@gmail.com>
> >>> wrote:
> >>>
> >>>>
> >>>>
> >>>>> On 21 Mar 2017, at 12:51, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >>>>>
> >>>>> My suggestion, as we are using semantic versioning is:
> >>>>>
> >>>>> 1) no new features in the 1.8 branch
> >>>>> 2) only bug fixes in the 1.8 branch
> >>>>> 3) new features to land in 1.9
> >>>>>
> >>>>> This allows companies to
> >>>>
> >>>> Have a "known" version and can move to the new branch when they want
> to
> >>>> get new features. Obviously we only support N-1, so when 1.10 comes
> out
> >>> we
> >>>> stop supporting 1.8.X.
> >>>>
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 21 Mar 2017, at 11:22, Chris Riccomini <criccom...@apache.org>
> >>>> wrote:
> >>>>>>
> >>>>>> Hey all,
> >>>>>>
> >>>>>> I suggest that we start a 1.8.1 Airflow release now. The goal would
> >>> be:
> >>>>>>
> >>>>>> 1) get a second release under our belt
> >>>>>> 2) patch known issues with the 1.8.0 release
> >>>>>>
> >>>>>> I'm happy to run it, but I saw Maxime mentioning that Airbnb might
> >>> want
> >>>> to.
> >>>>>> @Max et al, can you comment?
> >>>>>>
> >>>>>> Also, can folks supply JIRAs for stuff that think needs to be in the
> >>>> 1.8.1
> >>>>>> bugfix release?
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>
> >>>
> >>
>


Re: 1.8.1 release

2017-03-21 Thread Dan Davydov
Here is my list for targeted 1.8.1 fixes:
https://issues.apache.org/jira/browse/AIRFLOW-982
https://issues.apache.org/jira/browse/AIRFLOW-983
https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general the slow
startup time from this new logic of orphaned/reset task)
https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will hopefully
have a fix out for soon just finishing up tests)

We are also hitting a new issue with subdags with rc5 that we weren't
hitting with rc4 where subdags will occasionally just hang (had to roll
back from rc5 to rc4), I'll try to spin up a JIRA for it soon which should
be on the list too.


On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini 
wrote:

> Agreed. I'm looking for a list of checksums/JIRAs that we want in the
> bugfix release.
>
> On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin 
> wrote:
>
> >
> >
> > > On 21 Mar 2017, at 12:51, Bolke de Bruin  wrote:
> > >
> > > My suggestion, as we are using semantic versioning is:
> > >
> > > 1) no new features in the 1.8 branch
> > > 2) only bug fixes in the 1.8 branch
> > > 3) new features to land in 1.9
> > >
> > > This allows companies to
> >
> > Have a "known" version and can move to the new branch when they want to
> > get new features. Obviously we only support N-1, so when 1.10 comes out
> we
> > stop supporting 1.8.X.
> >
> > >
> > > Sent from my iPhone
> > >
> > >> On 21 Mar 2017, at 11:22, Chris Riccomini 
> > wrote:
> > >>
> > >> Hey all,
> > >>
> > >> I suggest that we start a 1.8.1 Airflow release now. The goal would
> be:
> > >>
> > >> 1) get a second release under our belt
> > >> 2) patch known issues with the 1.8.0 release
> > >>
> > >> I'm happy to run it, but I saw Maxime mentioning that Airbnb might
> want
> > to.
> > >> @Max et al, can you comment?
> > >>
> > >> Also, can folks supply JIRAs for stuff that think needs to be in the
> > 1.8.1
> > >> bugfix release?
> > >>
> > >> Cheers,
> > >> Chris
> >
>


Re: [RESULT][VOTE]Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-17 Thread Dan Davydov
That's reasonable (treating it a bug instead of a change in behavior). Full
speed ahead!

On Thu, Mar 16, 2017 at 9:01 AM, Bolke de Bruin  wrote:

> Hello,
>
> Apache Airflow (incubating) 1.8.0 (RC5) has been accepted.
>
> 9 “+1” votes received:
>
> - Maxime Beauchemin (binding)
> - Chris Riccomini (binding)
> - Arthur Wiedmer (binding)
> - Jeremiah Lowin (binding)
> - Siddharth Anand (binding)
> - Alex van Boxel (binding)
> - Bolke de Bruin (binding)
>
> - Daniel Huang (non-binding)
>
> Vote thread (start):
> http://mail-archives.apache.org/mod_mbox/incubator-
> airflow-dev/201703.mbox/%3cB1833A3A-05FB-4112-B395-
> 135caf930...@gmail.com%3e
>
> Next steps:
> 1) will start the voting process at the IPMC mailinglist. I don’t expect
> changes.
> 2) Only after the positive voting on the IPMC and finalisation I will
> rebrand the RC to Release.
> 3) I will upload it to the incubator release page, then the tar ball needs
> to propagate to the mirrors.
> 4) Update the website (can someone volunteer please?)
> 5) Finally I will ask Maxime to upload it to pypi. It seems we can keep
> the apache branding as lib cloud is doing this as well (
> https://libcloud.apache.org/downloads.html#pypi-package).
>
> Cheers,
>
> Bolke


Re: Make Scheduler More Centralized

2017-03-17 Thread Dan Davydov
I'm not convinced that this would add *that* much more load, we could
probably change this functionality now if we wanted to. Just my two cents.

On Thu, Mar 16, 2017 at 4:06 PM, Rui Wang  wrote:

> Thanks all your comments!
>
> Then looks like we should focus on scalability of scheduler now rather
> than adding more load on it. I will give up this centralized idea now.
>
> On Tue, Mar 14, 2017 at 3:08 PM, Rui Wang  wrote:
>
>> Hi,
>> The design doc below I created is trying to make airflow scheduler more
>> centralized. Briefly speaking, I propose moving state change of
>> TaskInstance to scheduler. You can see the reasons for this change below.
>>
>>
>> Could you take a look and comment if you see anything does not make sense?
>>
>> -Rui
>>
>> 
>> --
>> Current The state of TaskInstance is changed by both scheduler and
>> worker. On worker side, worker monitors TaskInstance and changes the state
>> to RUNNING, SUCCESS, if task succeed, or to UP_FOR_RETRY, FAILED if task
>> fail. Worker also does failure email logic and failure callback logic.
>> Proposal The general idea is to make a centralized scheduler and make
>> workers dumb. Worker should not change state of TaskInstance, but just
>> executes what it is assigned and reports the result of the task. Instead,
>> the scheduler should make the decision on TaskInstance state change.
>> Ideally, workers should not even handle the failure emails and callbacks
>> unless the scheduler asks it to do so.
>> Why Worker does not have as much information as scheduler has. There
>> were bugs observed caused by worker when worker gets into trouble but
>> cannot make decision to change task state due to lack of information.
>> Although there is airflow metadata DB, it is still not easy to share all
>> information that scheduler has with workers.
>>
>> We can also ensure a consistent environment. There are slight differences
>> in the chef recipes for the different workers which can cause strange
>> issues when DAGs parse on one but not the other.
>>
>> In the meantime, moving state changes to the scheduler can reduce the
>> complexity of airflow. It especially helps when airflow needs to move to
>> distributed schedulers. In that case state change everywhere by both
>> schedulers and workers are harder to maintain.
>> How to change After lots of discussions, following step will be done:
>>
>> 1. Add a new column to TaskInstance table. Worker will fill this column
>> with the task process exit code.
>>
>> 2. Worker will only set TaskInstance state to RUNNING when it is ready to
>> run task. There was debate on moving RUNNING to scheduler as well. If
>> moving RUNNING to scheduler, either scheduler marks TaskInstance RUNNING
>> before it gets into queue, or scheduler checks the status code in column
>> above, which is updated by worker when worker is ready to run task. In
>> Former case, from user's perspective, it is bad to mark TaskInstance as
>> RUNNING when worker is not ready to run. User could be confused. In the
>> latter case, scheduler could mark task as RUNNING late due to schedule
>> interval. It is still not a good user experience. Since only worker knows
>> when is ready to run task, worker should still deliver this message to user
>> by setting RUNNING state.
>>
>> 3. In any other cases, worker should not change state of TaskInstance,
>> but save defined status code into column above.
>>
>> 4. Worker still handles failure emails and callbacks because there were
>> concern that scheduler could use too much resource to run failure callbacks
>> given unpredictable callback sizes. ( I think ideally scheduler should
>> treat failure callbacks and emails as tasks, and assign such tasks to
>> workers after TaskInstance state changes correspondingly). Eventually this
>> logic will be moved to the workers once there is support for multiple
>> distributed schedulers.
>>
>> 5. In scheduler's loop, scheduler should check TaskInstance status code,
>> then change state and retry/fail TaskInstance correspondingly.
>>
>
>


Re: Airflow Committers: Landscape checks doing more harm than good?

2017-03-16 Thread Dan Davydov
+1 as well though I have found it useful on larger PRs to help me catch
some issues so it probably makes sense to make to add the travis linting at
the same time we remove it in landscape. Not sure how much usability we
lose by losing the landscape UI but I like that all of the errors would be
in one place.

On Thu, Mar 16, 2017 at 4:51 PM, Bolke de Bruin  wrote:

> We can do it in Travis’ afaik. We should replace it.
>
> So +1.
>
> B.
>
> > On 16 Mar 2017, at 16:48, Jeremiah Lowin  wrote:
> >
> > This may be an unpopular opinion, but most Airflow PRs have a little red
> > "x" next to them not because they have failing unit tests, but because
> the
> > Landscape check has decided they introduce bad code.
> >
> > Unfortunately Landscape is often wrong -- here it is telling me my latest
> > PR introduced no less than 30 errors... in files I didn't touch!
> > https://github.com/apache/incubator-airflow/pull/2157 (however, it
> gives me
> > credit for fixing 23 errors in those same files, so I've got that going
> for
> > me... which is nice.)
> >
> > The upshot is that Github's "health" indicator can be swayed by minor or
> > erroneous issues, and therefore it serves little purpose other than
> making
> > it look like every PR is bad. This creates committer fatigue, since every
> > PR needs to be parsed to see if it actually is OK or not.
> >
> > Don't get me wrong, I'm all for proper style and on occasion Landscape
> has
> > pointed out problems that I've gone and fixed. But on the whole, I
> believe
> > that having it as part of our red / green PR evaluation -- equal to and
> > often superseding unit tests -- is harmful. I'd much rather be able to
> scan
> > the PR list and know unequivocally that "green" indicates ready to merge.
> >
> > J
>
>


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Dan Davydov
The only thing is that this is a change in semantics and changing semantics
(breaking some DAGs) and then changing them back (and breaking things
again) isn't great.

On Wed, Mar 15, 2017 at 7:02 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Indeed that could be the case. Let's get 1.8.0 out the door so we can
> focus on these bug fixes for 1.8.1.
>
> Bolke
>
> Sent from my iPhone
>
> > On 15 Mar 2017, at 18:25, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > Another issue we are seeing is
> > https://issues.apache.org/jira/browse/AIRFLOW-992 - tasks that have both
> > skipped children and successful children are run instead of skipped. Not
> > blocking the release on this just letting you guys know for the release
> bug
> > notes. We will be cherrypicking a fix for this onto our production when
> we
> > release 1.8 once we come up with one.
> >
> > It's possibly thought not necessarily related to an incomplete/incorrect
> > fix of https://issues.apache.org/jira/browse/AIRFLOW-719 .
> >
> >> On Wed, Mar 15, 2017 at 4:53 PM, siddharth anand <san...@apache.org>
> wrote:
> >>
> >> Confirmed that Bolke's PR above fixes the issue.
> >>
> >> Also, I agree this is not a blocker for the current airflow release, so
> my
> >> +1 (binding) stands.
> >> -s
> >>
> >>> On Wed, Mar 15, 2017 at 3:11 PM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >>>
> >>> PR is available: https://github.com/apache/incubator-airflow/pull/2154
> >>>
> >>> But marked for 1.8.1.
> >>>
> >>> - Bolke
> >>>
> >>>> On 15 Mar 2017, at 14:37, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >>>>
> >>>> On second thought I do consider it a bug and can have a fix out pretty
> >>> quickly, but I don’t consider it a blocker.
> >>>>
> >>>> - B.
> >>>>
> >>>>> On 15 Mar 2017, at 14:21, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >>>>>
> >>>>> Just to be clear: Also in 1.7.1 the DagRun was marked successful, but
> >>> its tasks continued to be scheduled. So one could also consider 1.7.1
> >>> behaviour a bug. I am not sure here, but I think it kind of makes sense
> >> to
> >>> consider the behaviour of 1.7.1 a bug. It has been present throughout
> all
> >>> the 1.8 rc/beta/apha series.
> >>>>>
> >>>>> So yes it is a change in behaviour whether it is a regression or an
> >>> integrity improvement is up for discussion. Either way I don’t consider
> >> it
> >>> a blocker.
> >>>>>
> >>>>> Bolke.
> >>>>>
> >>>>>> On 15 Mar 2017, at 14:06, siddharth anand <san...@apache.org>
> wrote:
> >>>>>>
> >>>>>> Here's the JIRA :
> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-989
> >>>>>>
> >>>>>> I confirmed it is a regression from 1.7.1.3, which I installed via
> >> pip
> >>> and
> >>>>>> tested against the same DAG in the JIRA.
> >>>>>>
> >>>>>> The issue occurs if a leaf / last / terminal downstream task is not
> >>>>>> cleared. You won't see this issue if you clear the entire DAG Run or
> >>> clear
> >>>>>> a task and all of its downstream tasks. If you truly want to only
> >>> clear and
> >>>>>> rerun a task, but not its downstream tasks, you can use the CLI to
> >>> execute
> >>>>>> a specific task (e.g. vial airflow run).
> >>>>>>
> >>>>>> This is a change in behavior -- if we do go ahead with the release,
> >>> then
> >>>>>> this JIRA should be in a list of JIRAs of known issues related to
> the
> >>> new
> >>>>>> version.
> >>>>>> -s
> >>>>>>
> >>>>>> On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini <
> >>> criccom...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> @Sid, does this happen if you clear downstream as well?
> >>>>>>>
> >>>>>>> On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini <
> >>> criccom...@apache

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-13 Thread Dan Davydov
I'll test this on staging as soon as I get a chance (the testing is
non-blocking on the rc5). Bolke very much in particular :).

On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin  wrote:

> +1 (binding) extremely impressed by the work and diligence all contributors
> have put in to getting these blockers fixed, Bolke in particular.
>
> On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer  wrote:
>
> > +1 (binding)
> >
> > Thanks again for steering us through Bolke.
> >
> > Best,
> > Arthur
> >
> > On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin 
> wrote:
> >
> > > Dear All,
> > >
> > > Finally, I have been able to make the FIFTH RELEASE CANDIDATE of
> Airflow
> > > 1.8.0 available at: https://dist.apache.org/repos/
> > > dist/dev/incubator/airflow/  > > repos/dist/dev/incubator/airflow/> , public keys are available at
> > > https://dist.apache.org/repos/dist/release/incubator/airflow/ <
> > > https://dist.apache.org/repos/dist/release/incubator/airflow/> . It is
> > > tagged with a local version “apache.incubating” so it allows upgrading
> > from
> > > earlier releases.
> > >
> > > Issues fixed since rc4:
> > >
> > > [AIRFLOW-900] Double trigger should not kill original task instance
> > > [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run protection
> > > [AIRFLOW-932] Do not mark tasks removed when backfilling
> > > [AIRFLOW-961] run onkill when SIGTERMed
> > > [AIRFLOW-910] Use parallel task execution for backfills
> > > [AIRFLOW-967] Wrap strings in native for py2 ldap compatibility
> > > [AIRFLOW-941] Use defined parameters for psycopg2
> > > [AIRFLOW-719] Prevent DAGs from ending prematurely
> > > [AIRFLOW-938] Use test for True in task_stats queries
> > > [AIRFLOW-937] Improve performance of task_stats
> > > [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval
> > > does not execute input.
> > > [AIRFLOW-919] Running tasks with no start date shouldn't break a DAGs
> UI
> > > [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
> > > [AIRFLOW-861] make pickle_info endpoint be login_required
> > > [AIRFLOW-853] use utf8 encoding for stdout line decode
> > > [AIRFLOW-856] Make sure execution date is set for local client
> > > [AIRFLOW-830][AIRFLOW-829][AIRFLOW-88] Reduce Travis log verbosity
> > > [AIRFLOW-794] Access DAGS_FOLDER and SQL_ALCHEMY_CONN exclusively from
> > > settings
> > > [AIRFLOW-694] Fix config behaviour for empty envvar
> > > [AIRFLOW-365] Set dag.fileloc explicitly and use for Code view
> > > [AIRFLOW-931] Do not set QUEUED in TaskInstances
> > > [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI
> instead
> > > of black
> > > [AIRFLOW-895] Address Apache release incompliancies
> > > [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no
> > > start date
> > > [AIRFLOW-793] Enable compressed loading in S3ToHiveTransfer
> > > [AIRFLOW-863] Example DAGs should have recent start dates
> > > [AIRFLOW-869] Refactor mark success functionality
> > > [AIRFLOW-856] Make sure execution date is set for local client
> > > [AIRFLOW-814] Fix Presto*CheckOperator.__init__
> > > [AIRFLOW-844] Fix cgroups directory creation
> > >
> > > No known issues anymore.
> > >
> > > I would also like to raise a VOTE for releasing 1.8.0 based on release
> > > candidate 5, i.e. just renaming release candidate 5 to 1.8.0 release.
> > >
> > > Please respond to this email by:
> > >
> > > +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you
> > are
> > > not.
> > >
> > > Thanks!
> > > Bolke
> > >
> > > My VOTE: +1 (binding)
> >
>


Re: High load in CPU of MySQL when running airflow

2017-03-07 Thread Dan Davydov
We will need to come up with a plan soon (better DB indexes and/or the
ability to rotate out old task instances according to some policy). Nothing
concrete as of yet though.

On Tue, Mar 7, 2017 at 6:18 PM, Jason Chen <chingchien.c...@gmail.com>
wrote:

> Hi Dan,
>
>  Thanks so much. This is exactly what I am looking for.
>
> Is there a plan on the future airflow road map to clean this up from
> Airflow system level? Say, in airflow.cfg, a setting to clean up data older
> than specified time.
>
> Your solution is to run an airflow job to clean up the data. That's great.
> In a short term for us, I will be just running the SQL command directly
> from MySQL CLI and then setup an airflow job to do that periodically.
>
> Thanks.
> -Jason
>
> On Tue, Mar 7, 2017 at 5:47 PM, Dan Davydov <dan.davy...@airbnb.com.
> invalid>
> wrote:
>
> > FWIW we use the following DAG at Airbnb to reap the task instances table
> > (this is a stopgap):
> >
> > # DAG to delete old TIs so that UI operations on the webserver are fast.
> > This DAG is a
> > # stopgap, ideally we would make the UI not query all task instances and
> > add indexes to
> > # the task_instance table where appropriate to speed up the remaining
> > webserver table
> > # queries.
> > # Note that there is a slight risk that some of these deleted task
> > instances may break
> > # the depends_on_past dependency for the following tasks but this should
> > rarely happy
> > # and is easy to diagnose and fix.
> >
> > from datetime import datetime
> >
> > from airflow import DAG
> > from airflow.operators import MySqlOperator
> >
> > args = {
> > 'owner': 'xxx',
> > 'email': ['xxx'],
> > 'start_date': datetime(2017, 1, 30),
> > 'mysql_conn_id': 'airflow_db',
> > }
> >
> > dag = DAG(
> > 'airflow_old_task_instance_pruning',
> > default_args=args,
> > )
> >
> > # TODO: TIs that have are successful without a start date will never be
> > # reaped because they have been mark-success'd in the UI. One fix for
> this
> > would be to
> > # make airflow set start_date when mark-success-ing.
> > sql = """\
> > DELETE ti FROM task_instance ti
> > LEFT OUTER JOIN dag_run dr
> > ON ti.execution_date = dr.execution_date AND
> >ti.dag_id = dr.dag_id
> > WHERE ((ti.start_date <= DATE_SUB(NOW(), INTERVAL 30 DAY) AND
> > ti.state != "running") OR
> >(ISNULL(ti.start_date) AND
> > ti.state = "failed")) AND
> >   (ISNULL(dr.id) OR dr.state != "running")
> > """
> > MySqlOperator(
> > task_id='delete_old_tis',
> > sql=sql,
> > dag=dag,
> > )
> >
> >
> >
> > On Tue, Mar 7, 2017 at 5:39 PM, Jason Chen <chingchien.c...@gmail.com>
> > wrote:
> >
> > > Hi Bolke,
> > >
> > >  Thanks, but it looks you are actually talking about Harish's use case.
> > >
> > >  My use case is about 50 Dags (each one with about 2-3 tasks). I feel
> our
> > > run interval setting for the dags are too low (~15 mins). It may result
> > in
> > > high CPU of MySQL.
> > >
> > >  Meanwhile, I dig to MySQL and I noticed a frequently running SQL
> > statement
> > > as below. It's without proper index on column task_instance.state.
> > >
> > > Shouldn't it index "state", given that there could be million of rows
> in
> > > task_instance?
> > >
> > > SQL Statement:
> > > "SELECT task_instance.task_id AS task_instance_task_id,
> > > task_instance.dag_id AS task_instance_dag_id, FROM task_instance
> > WHERE
> > > task_instance.state = 'queued'"
> > >
> > > Also, is there a possibility to clean some "unneeded" entries in the
> > tables
> > > (say, task_instance) ?  I mean, for example, removing task states older
> > > than 6 months?
> > >
> > > Feedback are welcome.
> > >
> > > Thanks.
> > >
> > > -Jason
> > >
> > >
> > >
> > > On Tue, Mar 7, 2017 at 11:45 AM, Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >
> > > > Hi Jason
> > > >
> > > > I think you need to back it up with more numbers. You assume that a
> > load
> > > > of 100% is bad and also that 16GB of mem is a lot.
> 

Re: High load in CPU of MySQL when running airflow

2017-03-07 Thread Dan Davydov
FWIW we use the following DAG at Airbnb to reap the task instances table
(this is a stopgap):

# DAG to delete old TIs so that UI operations on the webserver are fast.
This DAG is a
# stopgap, ideally we would make the UI not query all task instances and
add indexes to
# the task_instance table where appropriate to speed up the remaining
webserver table
# queries.
# Note that there is a slight risk that some of these deleted task
instances may break
# the depends_on_past dependency for the following tasks but this should
rarely happy
# and is easy to diagnose and fix.

from datetime import datetime

from airflow import DAG
from airflow.operators import MySqlOperator

args = {
'owner': 'xxx',
'email': ['xxx'],
'start_date': datetime(2017, 1, 30),
'mysql_conn_id': 'airflow_db',
}

dag = DAG(
'airflow_old_task_instance_pruning',
default_args=args,
)

# TODO: TIs that have are successful without a start date will never be
# reaped because they have been mark-success'd in the UI. One fix for this
would be to
# make airflow set start_date when mark-success-ing.
sql = """\
DELETE ti FROM task_instance ti
LEFT OUTER JOIN dag_run dr
ON ti.execution_date = dr.execution_date AND
   ti.dag_id = dr.dag_id
WHERE ((ti.start_date <= DATE_SUB(NOW(), INTERVAL 30 DAY) AND
ti.state != "running") OR
   (ISNULL(ti.start_date) AND
ti.state = "failed")) AND
  (ISNULL(dr.id) OR dr.state != "running")
"""
MySqlOperator(
task_id='delete_old_tis',
sql=sql,
dag=dag,
)



On Tue, Mar 7, 2017 at 5:39 PM, Jason Chen 
wrote:

> Hi Bolke,
>
>  Thanks, but it looks you are actually talking about Harish's use case.
>
>  My use case is about 50 Dags (each one with about 2-3 tasks). I feel our
> run interval setting for the dags are too low (~15 mins). It may result in
> high CPU of MySQL.
>
>  Meanwhile, I dig to MySQL and I noticed a frequently running SQL statement
> as below. It's without proper index on column task_instance.state.
>
> Shouldn't it index "state", given that there could be million of rows in
> task_instance?
>
> SQL Statement:
> "SELECT task_instance.task_id AS task_instance_task_id,
> task_instance.dag_id AS task_instance_dag_id, FROM task_instance WHERE
> task_instance.state = 'queued'"
>
> Also, is there a possibility to clean some "unneeded" entries in the tables
> (say, task_instance) ?  I mean, for example, removing task states older
> than 6 months?
>
> Feedback are welcome.
>
> Thanks.
>
> -Jason
>
>
>
> On Tue, Mar 7, 2017 at 11:45 AM, Bolke de Bruin  wrote:
>
> > Hi Jason
> >
> > I think you need to back it up with more numbers. You assume that a load
> > of 100% is bad and also that 16GB of mem is a lot.
> >
> > 30x25 = 750 tasks per hour = 12,5 tasks per minute. For every task we
> > launch a couple of processes (at least 2) that do not share memory, this
> is
> > to ensure tasks cannot hurt each other. Curl tasks are probably launched
> by
> > using a BashOperator, which means another process. Curl is itself another
> > process. So 4 processes per task, that cannot share memory. Curl can
> cache
> > memory itself as well. You probably have peak times and longer running
> > tasks so it is not evenly spread, then it starts adding up quickly?
> >
> > Bolke.
> >
> >
> > > On 7 Mar 2017, at 19:41, Jason Chen  wrote:
> > >
> > > Hi Harish,
> > > Thanks for the fast response and feedback.
> > > Yeah, I want to see the fix or more discussion !
> > >
> > > BTW, I assume that, given your 30 dags, airflow runs fine after your
> > > increase of heartbeat ?
> > > The default is 5 secs.
> > >
> > >
> > > Thanks.
> > > Jason
> > >
> > >
> > > On Tue, Mar 7, 2017 at 10:24 AM, harish singh <
> harish.sing...@gmail.com>
> > > wrote:
> > >
> > >> I had seen a similar behavior, a year ago, when we were are < 5 Dags.
> > Even
> > >> then the cpu utilization was reaching 100%.
> > >> One way to deal with this is - You could play with "heatbeat" numbers
> > (i.e
> > >> increase heartbeat).
> > >> But then you are introducing more delay to start jobs that are ready
> to
> > run
> > >> (ready to be queued -> queued -> run)
> > >>
> > >> Right now, we have more than 30 dags (each with ~ 20-25 tasks) that
> runs
> > >> every hour.
> > >> We are giving airflow about 5-6 cores (which still seems less for
> > airflow).
> > >> Also, for so many tasks every hour,  our mem consumption is over 16G.
> > >> All our tasks are basically doing "curl". So 16G seems too high.
> > >>
> > >> Having said that, I remember reading somewhere that there was a fix
> > coming
> > >> for this.
> > >> If not, I would definitely want to see more discussion on this.
> > >>
> > >> Thanks for opening this. I would love to hear on how people are
> working
> > >> around this.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Mar 7, 2017 at 9:42 AM, Jason 

Re: Proposal to simplify start/end dates

2017-03-07 Thread Dan Davydov
Those semantics should not change with my specific proposal, but I think
logically moving a DAG's start date back should backfill those old dagruns
(not the current behavior which continues running from the current date).
Interval changes are a bit of a hairy topic, I think those kinds of changes
along with mutations of DAG topology (new tasks, task dependency changes,
etc.) need to be thought out a little bit more and have a proposal drafted
(e.g. I think that Airflow should support dags with tasks with different
scheduling intervals). For the purposes of this proposal (and internally at
Airbnb) we do not support interval changes and recommend a new DAG be
created for these cases.

On Tue, Mar 7, 2017 at 1:52 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Ok sounds good. What do you do with a dag that gets predated and with an
> existing dag run? What happens if the interval changes, i.e. non cron
> syntax?
>
> (Just thinking out loud)
>
> B.
>
> Sent from my iPhone
>
> > On 7 Mar 2017, at 22:27, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > Sure thing.
> >
> > Current Behavior:
> > - User creates DAG with default_args start date to 2015
> > - dagrun gets kicked off for 2015
> > - User changes default_args start date to 2016
> > - dagruns continue running for 2015
> >
> > New Behavior:
> > - User creates DAG with default_args start date to 2015
> > - dagrun gets kicked off for 2015
> > - User changes default_args start date to 2016
> > - *dagruns start running for the 2016 start date instead of 2015*
> >
> >> On Tue, Mar 7, 2017 at 11:49 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >>
> >> Hey Dan,
> >>
> >> Im not sure if I am seeing a difference for #1 vs now, except you are
> >> excluding backfills now from the calculation? Can you provide an
> example?
> >>
> >> Bolke
> >>
> >>> On 7 Mar 2017, at 20:38, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> >> wrote:
> >>>
> >>> A very common source of confusion for our users is when they specify
> >>> start_date in default_args but not in their DAG arguments and then try
> to
> >>> change this start_date to move the execution of their DAG forward (e.g.
> >>> from 2015 to 2016). This doesn't work because the logic that is used to
> >>> calculate the "initial" start date of a dag differs from the logic to
> >>> calculate subsequent dagrun start dates.
> >>>
> >>> Current Airflow Logic:
> >>> DS to schedule initial dagrun: dag.start_date if it exists, else
> >> min(start
> >>> date of tasks_of_dag)
> >>> DS to schedule subsequent dagruns: last_dagrun + scheduled_interval
> >>>
> >>> There are a couple ways of addressing this:
> >>> 1. Change the definition of start date for subsequent dagruns to match
> >> the
> >>> "initial" dagrun start date (calculated from the minimum of task start
> >>> dates)
> >>> 2. Force explicit dag start dates
> >>>
> >>> I personally like 1.
> >>>
> >>> I also propose that we throw errors for DAGs that have tasks that
> depend
> >> on
> >>> other tasks with start dates that occur after theirs (otherwise there
> >> could
> >>> be deadlocks).
> >>>
> >>> What do people think?
> >>
> >>
>


Re: Proposal to simplify start/end dates

2017-03-07 Thread Dan Davydov
Sure thing.

Current Behavior:
- User creates DAG with default_args start date to 2015
- dagrun gets kicked off for 2015
- User changes default_args start date to 2016
- dagruns continue running for 2015

New Behavior:
- User creates DAG with default_args start date to 2015
- dagrun gets kicked off for 2015
- User changes default_args start date to 2016
- *dagruns start running for the 2016 start date instead of 2015*

On Tue, Mar 7, 2017 at 11:49 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hey Dan,
>
> Im not sure if I am seeing a difference for #1 vs now, except you are
> excluding backfills now from the calculation? Can you provide an example?
>
> Bolke
>
> > On 7 Mar 2017, at 20:38, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > A very common source of confusion for our users is when they specify
> > start_date in default_args but not in their DAG arguments and then try to
> > change this start_date to move the execution of their DAG forward (e.g.
> > from 2015 to 2016). This doesn't work because the logic that is used to
> > calculate the "initial" start date of a dag differs from the logic to
> > calculate subsequent dagrun start dates.
> >
> > Current Airflow Logic:
> > DS to schedule initial dagrun: dag.start_date if it exists, else
> min(start
> > date of tasks_of_dag)
> > DS to schedule subsequent dagruns: last_dagrun + scheduled_interval
> >
> > There are a couple ways of addressing this:
> > 1. Change the definition of start date for subsequent dagruns to match
> the
> > "initial" dagrun start date (calculated from the minimum of task start
> > dates)
> > 2. Force explicit dag start dates
> >
> > I personally like 1.
> >
> > I also propose that we throw errors for DAGs that have tasks that depend
> on
> > other tasks with start dates that occur after theirs (otherwise there
> could
> > be deadlocks).
> >
> > What do people think?
>
>


Proposal to simplify start/end dates

2017-03-07 Thread Dan Davydov
A very common source of confusion for our users is when they specify
start_date in default_args but not in their DAG arguments and then try to
change this start_date to move the execution of their DAG forward (e.g.
from 2015 to 2016). This doesn't work because the logic that is used to
calculate the "initial" start date of a dag differs from the logic to
calculate subsequent dagrun start dates.

Current Airflow Logic:
DS to schedule initial dagrun: dag.start_date if it exists, else min(start
date of tasks_of_dag)
DS to schedule subsequent dagruns: last_dagrun + scheduled_interval

There are a couple ways of addressing this:
1. Change the definition of start date for subsequent dagruns to match the
"initial" dagrun start date (calculated from the minimum of task start
dates)
2. Force explicit dag start dates

I personally like 1.

I also propose that we throw errors for DAGs that have tasks that depend on
other tasks with start dates that occur after theirs (otherwise there could
be deadlocks).

What do people think?


Re: Airflow running different with different user id ?

2017-03-03 Thread Dan Davydov
Yes it is starting on 1.8.0 which will be released soon, you can look in
the documentation/grep for "run_as".

On Mar 3, 2017 8:50 AM, "Michael Gong"  wrote:

> Hi,
>
>
> Suppose I have 1 airflow instance running 2 different DAGs, is it possible
> to specify the 2 DAGs running under 2 different ids ?
>
>
> Any advises are welcomed.
>
>
> Thanks.
>
> Michael
>
>
>
>
>


Re: Getting to RC5: Update

2017-03-01 Thread Dan Davydov
Agreed, I created a JIRA a couple of minutes ago (it's a subtask in the
JIRA I mentioned).

On Wed, Mar 1, 2017 at 10:58 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Please create a Jira and provide context when this happens. “REMOVED”
> marked means the taskinstance does not have a task equivalent anymore in
> the dag (or so it should :)).
>
> Bolke
>
> > On 1 Mar 2017, at 19:55, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > We are seeing another major issue with backfills where task instances are
> > being deleted and marked as "removed", I am still investigating. Let's
> keep
> > discussion about these in https://issues.apache.org/
> jira/browse/AIRFLOW-921
> > and the subtask comments to have it one place. I will look at the other
> > points you cc'd me on too. Thanks for continuing to drive this forward!
> >
> > On Wed, Mar 1, 2017 at 8:22 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Just wanted to give an update about the progress getting to RC5. As
> >> reported we have 6 blockers listed.
> >>
> >> 1. Double run job should not terminate the existing running job. ->
> Patch
> >> Available
> >> 2. Parallelize dag runs in backfills -> Patch Available, Tests need to
> be
> >> updated, see below
> >> 3. Setting a task to running manually breaks a DAGs UI -> Patch merged
> >> 4. Can't mark non-existent tasks as successful from graph view ->
> >> Workaround available (t.b.c.), Patch Available unit tests to be added
> >> 5. (Named)HivePartitionSensor broken if hook attr not set -> Patch
> merged
> >> 6. Skipped tasks potentially cause a dagrun to be marked as
> >> failure/success prematurely -> see below
> >>
> >> On 2 I would like to have some more discussion of this would be
> acceptable
> >> (https://github.com/apache/incubator-airflow/pull/2107). I have written
> >> the patch for this, however we are not large backfill users. So I need
> >> feedback specifically on ripping out the “executor” part: @dan, @max.
> >>
> >> On 6 Alex has reported this earlier and written a PR for this (
> >> https://github.com/apache/incubator-airflow/pull/1961). Maxime had some
> >> thoughts about this, which are currently blocking the integration.
> However,
> >> in testing it seems to solve the issue. Can we finalise the discussion
> >> please @max @dan @alex?
> >>
> >> Cheers
> >> Bolke
>
>


Re: Getting to RC5: Update

2017-03-01 Thread Dan Davydov
We are seeing another major issue with backfills where task instances are
being deleted and marked as "removed", I am still investigating. Let's keep
discussion about these in https://issues.apache.org/jira/browse/AIRFLOW-921
and the subtask comments to have it one place. I will look at the other
points you cc'd me on too. Thanks for continuing to drive this forward!

On Wed, Mar 1, 2017 at 8:22 AM, Bolke de Bruin  wrote:

> Hi,
>
> Just wanted to give an update about the progress getting to RC5. As
> reported we have 6 blockers listed.
>
> 1. Double run job should not terminate the existing running job. -> Patch
> Available
> 2. Parallelize dag runs in backfills -> Patch Available, Tests need to be
> updated, see below
> 3. Setting a task to running manually breaks a DAGs UI -> Patch merged
> 4. Can't mark non-existent tasks as successful from graph view ->
> Workaround available (t.b.c.), Patch Available unit tests to be added
> 5. (Named)HivePartitionSensor broken if hook attr not set -> Patch merged
> 6. Skipped tasks potentially cause a dagrun to be marked as
> failure/success prematurely -> see below
>
> On 2 I would like to have some more discussion of this would be acceptable
> (https://github.com/apache/incubator-airflow/pull/2107). I have written
> the patch for this, however we are not large backfill users. So I need
> feedback specifically on ripping out the “executor” part: @dan, @max.
>
> On 6 Alex has reported this earlier and written a PR for this (
> https://github.com/apache/incubator-airflow/pull/1961). Maxime had some
> thoughts about this, which are currently blocking the integration. However,
> in testing it seems to solve the issue. Can we finalise the discussion
> please @max @dan @alex?
>
> Cheers
> Bolke


Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-27 Thread Dan Davydov
rc + your patch (and a couple of our own custom ones)

On Mon, Feb 27, 2017 at 2:11 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Dan
>
> Btw are you running with my patch for this? Or still plain rc?
>
> Cheers
> Bolke
>
> Sent from my iPhone
>
> > On 27 Feb 2017, at 22:46, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >
> > I'll have a look. I verified and the code is there to take of this.
> >
> > B.
> >
> > Sent from my iPhone
> >
> >> On 27 Feb 2017, at 22:34, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >>
> >> Repro steps:
> >> - Create a DAG with a dummy task
> >> - Let this DAG run for one dagrun
> >> - Add a new subdag operator that contains a dummy operator to this DAG
> that
> >> has depends_on_past set to true
> >> - click on the white square for the new subdag operator in the DAGs
> first
> >> dagrun
> >> - Click "Zoom into subdag" (takes you to the graph view for the subdag)
> >> - Click the dummy task in the graph view and click "Mark Success"
> >> - Observe that the list of tasks to mark as success is empty (it should
> >> contain the dummy task)
> >>
> >>> On Mon, Feb 27, 2017 at 1:03 PM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >>>
> >>> Dan
> >>>
> >>> Can you elaborate on 2, cause I thought I specifically took care of
> that.
> >>>
> >>> Cheers
> >>> Bolke
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On 27 Feb 2017, at 20:27, Dan Davydov <dan.davy...@airbnb.com.
> INVALID>
> >>> wrote:
> >>>>
> >>>> I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track
> the
> >>>> pending issues.
> >>>>
> >>>> There are two more issues we found which I included there:
> >>>> 1. Task instances that have their state manually set to running make
> the
> >>> UI
> >>>> for their DAG unable to parse
> >>>> 2. Mark success doesn't work for non existent task instances/dagruns
> >>> which
> >>>> breaks the subdag use case (setting tasks as successful via the graph
> >>> view)
> >>>>
> >>>>> On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <bdbr...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Hey Max
> >>>>>
> >>>>> It is massive for sure. Sorry about that ;-). However it is not as
> >>> massive
> >>>>> as you might deduct from a first view. 0) run tasks concurrently
> across
> >>> dag
> >>>>> runs 1) ordering of the tasks was added to the loop. 2) calculating
> of
> >>>>> deadlocks, running tasks, tasks to run was corrected, 3) relying on
> the
> >>>>> executor for status updates was replaced, 4) (tbd) executor failure
> >>> check
> >>>>> to protect against endless Ioops.
> >>>>>
> >>>>> 0+1 seem bigger than they are due to the amount of lines changed. 2
> is a
> >>>>> subtle change, that touches a couple of lines to pop/push properly.
> 3)
> >>> is
> >>>>> bigger, as I didn't like the reliance on the executor. 4) is old code
> >>> that
> >>>>> needs to be added again.
> >>>>>
> >>>>> I probably can leave out 3 which makes 4 mood. The change would be
> >>>>> smaller. Maybe I could even completely remove 3 and just add 4. What
> are
> >>>>> your thoughts?
> >>>>>
> >>>>> The random failures we were seeing were the "implicit" test of not a
> >>>>> executing in the right order and then deadlocking. But no explicit
> tests
> >>>>> exist. Help would definitely be appreciated.
> >>>>>
> >>>>> Yes I thought about using the scheduler and/or reusing logic from the
> >>>>> scheduler. I even experimented a little with it but it didn't allow
> me
> >>> to
> >>>>> pass the tests effectively.
> >>>>>
> >>>>> What I am planning to do is split the function and make it unit
> testable
> >>>>> if you agree with the current approach.
> >>>>>
> >>>>> Bolke
> >>>>>
> >>>&

Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-27 Thread Dan Davydov
Repro steps:
- Create a DAG with a dummy task
- Let this DAG run for one dagrun
- Add a new subdag operator that contains a dummy operator to this DAG that
has depends_on_past set to true
- click on the white square for the new subdag operator in the DAGs first
dagrun
- Click "Zoom into subdag" (takes you to the graph view for the subdag)
- Click the dummy task in the graph view and click "Mark Success"
- Observe that the list of tasks to mark as success is empty (it should
contain the dummy task)

On Mon, Feb 27, 2017 at 1:03 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Dan
>
> Can you elaborate on 2, cause I thought I specifically took care of that.
>
> Cheers
> Bolke
>
> Sent from my iPhone
>
> > On 27 Feb 2017, at 20:27, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track the
> > pending issues.
> >
> > There are two more issues we found which I included there:
> > 1. Task instances that have their state manually set to running make the
> UI
> > for their DAG unable to parse
> > 2. Mark success doesn't work for non existent task instances/dagruns
> which
> > breaks the subdag use case (setting tasks as successful via the graph
> view)
> >
> >> On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >>
> >> Hey Max
> >>
> >> It is massive for sure. Sorry about that ;-). However it is not as
> massive
> >> as you might deduct from a first view. 0) run tasks concurrently across
> dag
> >> runs 1) ordering of the tasks was added to the loop. 2) calculating of
> >> deadlocks, running tasks, tasks to run was corrected, 3) relying on the
> >> executor for status updates was replaced, 4) (tbd) executor failure
> check
> >> to protect against endless Ioops.
> >>
> >> 0+1 seem bigger than they are due to the amount of lines changed. 2 is a
> >> subtle change, that touches a couple of lines to pop/push properly. 3)
> is
> >> bigger, as I didn't like the reliance on the executor. 4) is old code
> that
> >> needs to be added again.
> >>
> >> I probably can leave out 3 which makes 4 mood. The change would be
> >> smaller. Maybe I could even completely remove 3 and just add 4. What are
> >> your thoughts?
> >>
> >> The random failures we were seeing were the "implicit" test of not a
> >> executing in the right order and then deadlocking. But no explicit tests
> >> exist. Help would definitely be appreciated.
> >>
> >> Yes I thought about using the scheduler and/or reusing logic from the
> >> scheduler. I even experimented a little with it but it didn't allow me
> to
> >> pass the tests effectively.
> >>
> >> What I am planning to do is split the function and make it unit testable
> >> if you agree with the current approach.
> >>
> >> Bolke
> >>
> >> Sent from my iPhone
> >>
> >>> On 27 Feb 2017, at 18:35, Maxime Beauchemin <
> maximebeauche...@gmail.com>
> >> wrote:
> >>>
> >>> This PR is pretty massive and complex! It looks like solid work but
> let's
> >>> be really careful around testing and rolling this out.
> >>>
> >>> This may be out of scope for this PR, but wanted to discuss the idea of
> >>> using the scheduler's logic to perform backfills. It'd be nice to have
> >> that
> >>> logic in one place, though I lost grasp on the details around
> feasibility
> >>> around this approach. I'm sure you looked into this option before
> issuing
> >>> this PR and I'm curious to hear your thoughts on blockers/challenges
> >> around
> >>> this alternate approach.
> >>>
> >>> Also I'm wondering whether we have any sort of mechanisms in our
> >>> integration test to validate that task dependencies are respected and
> run
> >>> in the right order. If not I was thinking we could build some
> abstraction
> >>> to make it easy to write this type of tests in an expressive way.
> >>>
> >>> ```
> >>> #[some code to run a backfill, or a scheduler session]
> >>> it = IntegrationTestResults(dag_id='exmaple1')
> >>> assert it.ran_before('task1', 'task_2')
> >>> assert ti.overlapped('task1', 'task_3') # confirms 2 tasks ran in
> >> parallel
> >>> assert ti.none_failed()
> >>> assert ti.ran_last('root')

Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-27 Thread Dan Davydov
n 25 Feb 2017, at 09:07, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >>>
> >>> Hi Dan,
> >>>
> >>> - Backfill indeed runs only one dagrun at the time, see line 1755 of
> >> jobs.py. I’ll think about how to fix this over the weekend (I think it
> was
> >> my change that introduced this). Suggestions always welcome. Depending
> the
> >> impact it is a blocker or not. We don’t often use backfills and
> definitely
> >> not at your size, so that is why it didn’t pop up with us. I’m assuming
> >> blocker for now, btw.
> >>> - Speculation on the High DB Load. I’m not sure what your benchmark is
> >> here (1.7.1 + multi processor dags?), but as you mentioned in the code
> >> dependencies are checked a couple of times for one run and even task
> >> instance. Dependency checking requires aggregation on the DB, which is a
> >> performance killer. Annoying but not a blocker.
> >>> - Skipped tasks potentially cause a dagrun to be marked failure/success
> >> prematurely. BranchOperators are widely used if it affects these
> operators,
> >> then it is a blocker.
> >>>
> >>> - Bolke
> >>>
> >>>> On 25 Feb 2017, at 02:04, Dan Davydov <dan.davy...@airbnb.com.
> INVALID>
> >> wrote:
> >>>>
> >>>> Update on old pending issues:
> >>>> - Black Squares in UI: Fix merged
> >>>> - Double Trigger Issue That Alex G Mentioned: Alex has a PR in flight
> >>>>
> >>>> New Issues:
> >>>> - Backfill seems to be having issues (only running one dagrun at a
> >> time),
> >>>> we are still investigating - might be a blocker
> >>>> - High DB Load (~8x more than 1.7) - We are still investigating but
> it's
> >>>> probably not a blocker for the release
> >>>> - Skipped tasks potentially cause a dagrun to be marked as
> >> failure/success
> >>>> prematurely - not sure whether or not to classify this as a blocker
> >> (only
> >>>> really an issue for users who use the BranchingPythonOperator, which
> >> AirBnB
> >>>> does)
> >>>>
> >>>> On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <san...@apache.org>
> >> wrote:
> >>>>
> >>>>> IMHO, a DAG run without a start date is non-sensical but is not
> >> enforced
> >>>>> That said, our UI allows for the manual creation of DAG Runs without
> a
> >>>>> start date as shown in the images below:
> >>>>>
> >>>>>
> >>>>> - https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
> >>>>> 202017-02-22%2016.00.40.png?dl=0
> >>>>> <https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
> >>>>> 202017-02-22%2016.00.40.png?dl=0>
> >>>>> - https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
> >>>>> 202017-02-22%2016.02.22.png?dl=0
> >>>>> <https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
> >>>>> 202017-02-22%2016.02.22.png?dl=0>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
> >>>>> maximebeauche...@gmail.com> wrote:
> >>>>>
> >>>>>> Our database may have edge cases that could be associated with
> running
> >>>>> any
> >>>>>> previous version that may or may not have been part of an official
> >>>>> release.
> >>>>>>
> >>>>>> Let's see if anyone else reports the issue. If no one does, one
> >> option is
> >>>>>> to release 1.8.0 as is with a comment in the release notes, and
> have a
> >>>>>> future official minor apache release 1.8.1 that would fix these
> minor
> >>>>>> issues that are not deal breaker.
> >>>>>>
> >>>>>> @bolke, I'm curious, how long does it take you to go through one
> >> release
> >>>>>> cycle? Oh, and do you have a documented step by step process for
> >>>>> releasing?
> >>>>>> I'd like to add the Pypi part to this doc and add committers that
> are
> >>>>>> interested to have rights on the project on Pypi.
> >>>>>>
> >>>>>> Max
> >>>>>>
> >>>>>&g

Re: Cutting down on testing time

2017-02-27 Thread Dan Davydov
This looks like a great effort to me at least in the short term (in the
long term I think most of the integration tests should be run together if
the infra allows this). Another thing we could start looking into is
parallelizing tests (though this may require beefier machines from Travis).

On Sat, Feb 25, 2017 at 8:58 AM, Bolke de Bruin  wrote:

> Hi All,
>
> Jeremiah and I have been looking into optimising the time that is spend on
> tests. The reason for this was that Travis’ runs are taking more and more
> time and we are being throttled by travis. As part of that we enabled color
> coding of test outcomes and timing of tests. The results kind of
> …surprising.
>
> This is the top 20 of tests were we spend the most time. MySQL (remember
> concurrent access enabled) - https://s3.amazonaws.com/
> archive.travis-ci.org/jobs/205277617/log.txt:
>
> tests.BackfillJobTest.test_backfill_examples:  287.9209s
> tests.BackfillJobTest.test_backfill_multi_dates:  53.5198s
> tests.SchedulerJobTest.test_scheduler_start_date:  36.4935s
> tests.CoreTest.test_scheduler_job:  35.5852s
> tests.CliTests.test_backfill:  29.7484s
> tests.SchedulerJobTest.test_scheduler_multiprocessing:  26.1573s
> tests.DaskExecutorTest.test_backfill_integration:  24.5456s
> tests.CoreTest.test_schedule_dag_no_end_date_up_to_today_only:  17.3278s
> tests.SubDagOperatorTests.test_subdag_deadlock:  16.1957s
> tests.SensorTimeoutTest.test_timeout:  15.1000s
> tests.SchedulerJobTest.test_dagrun_deadlock_ignore_depends_on_past:
> 13.8812s
> tests.BackfillJobTest.test_cli_backfill_depends_on_past:  12.9539s
> tests.SchedulerJobTest.test_dagrun_deadlock_ignore_
> depends_on_past_advance_ex_date:  12.8779s
> tests.SchedulerJobTest.test_dagrun_success:  12.8177s
> tests.SchedulerJobTest.test_dagrun_root_fail:  10.3953s
> tests.SchedulerJobTest.test_dag_with_system_exit:  10.1132s
> tests.TransferTests.test_mysql_to_hive:  8.5939s
> tests.SchedulerJobTest.test_retry_still_in_executor:  8.1739s
> tests.SchedulerJobTest.test_dagrun_fail:  7.9855s
> tests.ImpersonationTest.test_default_impersonation:  7.4993s
>
> Yes we spend a whopping 5 minutes on executing all examples. Another
> interesting one is “tests.CoreTest.test_scheduler_job”. This test just
> checks whether a certain directories are creating as part of logging. This
> could have been covered by a real unit test just covering the functionality
> of the function that creates the files - now it takes 35s.
>
> We discussed several strategies for reducing time apart from rewriting
> some of the tests (that would be a herculean job!). What the most optimal
> seems is:
>
> 1. Run the scheduler tests apart from all other tests.
> 2. Run “operator” integration tests in their own unit.
> 3. Run UI tests separate
> 4. Run API tests separate
>
> This creates the following build matrix (warning ASCII art):
>
> ——
> |   |  Scheduler |  Operators   |
>  UI  |   API |
> ——
> | Python 2  | x  |. x   |
>  x   |   x   |
> ——
> | Python 3  | x  |  x   |
>  x   |   x   |
> ——
> | Kerberos  ||  |
>  x   |   x   |
> ——
> | Ldap  ||  |
>  x   |   |
> ——
> | Hive  ||  x   |
>  x   |   x   |
> ——
> | SSH   ||  x   |
>  |   |
> ——
> | Postgres  | x  |  x   |
>  x   |   x   |
> ——
> | MySQL | x  |  x   |
>  x   |   x   |
> ——
> | SQLite| x  |  x
>  |   x   |   x   |
> ——
>
>
> So from this build matrix one can deduct that Postgres, MySQL are generic
> services that will be present in every build. In addition all builds will
> use Python 2 and Python 3. And I propose using Python 3.4 and Python 3.5.
>
>
> Furthermore, I would like us to label our tests correctly, e.g. unit test
> or integration test.


Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-24 Thread Dan Davydov
Update on old pending issues:
- Black Squares in UI: Fix merged
- Double Trigger Issue That Alex G Mentioned: Alex has a PR in flight

New Issues:
- Backfill seems to be having issues (only running one dagrun at a time),
we are still investigating - might be a blocker
- High DB Load (~8x more than 1.7) - We are still investigating but it's
probably not a blocker for the release
- Skipped tasks potentially cause a dagrun to be marked as failure/success
prematurely - not sure whether or not to classify this as a blocker (only
really an issue for users who use the BranchingPythonOperator, which AirBnB
does)

On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <san...@apache.org> wrote:

> IMHO, a DAG run without a start date is non-sensical but is not enforced
>  That said, our UI allows for the manual creation of DAG Runs without a
> start date as shown in the images below:
>
>
>- https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
>202017-02-22%2016.00.40.png?dl=0
><https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
> 202017-02-22%2016.00.40.png?dl=0>
>- https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
>202017-02-22%2016.02.22.png?dl=0
><https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
> 202017-02-22%2016.02.22.png?dl=0>
>
>
> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Our database may have edge cases that could be associated with running
> any
> > previous version that may or may not have been part of an official
> release.
> >
> > Let's see if anyone else reports the issue. If no one does, one option is
> > to release 1.8.0 as is with a comment in the release notes, and have a
> > future official minor apache release 1.8.1 that would fix these minor
> > issues that are not deal breaker.
> >
> > @bolke, I'm curious, how long does it take you to go through one release
> > cycle? Oh, and do you have a documented step by step process for
> releasing?
> > I'd like to add the Pypi part to this doc and add committers that are
> > interested to have rights on the project on Pypi.
> >
> > Max
> >
> > On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >
> > > So it is a database integrity issue? Afaik a start_date should always
> be
> > > set for a DagRun (create_dagrun) does so  I didn't check the code
> though.
> > >
> > > Sent from my iPhone
> > >
> > > > On 22 Feb 2017, at 22:19, Dan Davydov <dan.davy...@airbnb.com.
> INVALID>
> > > wrote:
> > > >
> > > > Should clarify this occurs when a dagrun does not have a start date,
> > not
> > > a
> > > > dag (which makes it even less likely to happen). I don't think this
> is
> > a
> > > > blocker for releasing.
> > > >
> > > >> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov <
> dan.davy...@airbnb.com>
> > > wrote:
> > > >>
> > > >> I rolled this out in our prod and the webservers failed to load due
> to
> > > >> this commit:
> > > >>
> > > >> [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag
> > > >> 7c94d81c390881643f94d5e3d7d6fb351a445b72
> > > >>
> > > >> This fixed it:
> > > >> -  > > >> class="glyphicon glyphicon-info-sign" aria-hidden="true"
> title="Start
> > > Date:
> > > >> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}">
> > > >> +  > > >> class="glyphicon glyphicon-info-sign" aria-hidden="true">
> > > >>
> > > >> This is caused by assuming that all DAGs have start dates set, so a
> > > broken
> > > >> DAG will take down the whole UI. Not sure if we want to make this a
> > > blocker
> > > >> for the release or not, I'm guessing for most deployments this would
> > > occur
> > > >> pretty rarely. I'll submit a PR to fix it soon.
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Feb 21, 2017 at 9:49 AM, Chris Riccomini <
> > criccom...@apache.org
> > > >
> > > >> wrote:
> > > >>
> > > >>> Ack that the vote has already passed, but belated +1 (binding)
> > > >>>
> > > >>> On Tue, Feb 21, 2017 at 7:42 AM, Bolke de Bruin <bdbr...@gmail.com
> >
> > > >

Re: scheduler running on multiple nodes

2017-02-24 Thread Dan Davydov
We just had two running by accident for some period of time.

On Feb 24, 2017 5:52 AM, "Jason Jho" <jason@blueapron.com.invalid>
wrote:

> Hi Dan / Sid,
>
> Would you be able to elaborate on the multiple scheduler setup? Curious how
> that would have been deployed. Was the purpose to have some kind of
> failover or to distribute execution of jobs?
>
> Thanks!
> On Fri, Feb 24, 2017 at 3:49 AM Dan Davydov <dan.davy...@airbnb.com.
> invalid>
> wrote:
>
> > Fwiw Airbnb was running multiple schedulers for a short while on 1.7.1
> and
> > we didn't seem to have issues.
> >
> > On Feb 24, 2017 12:25 AM, "Bolke de Bruin" <bdbr...@gmail.com> wrote:
> >
> > > While I agree with the assessment of Sid that a lot has changed and we
> do
> > > not officially test on multiple schedulers, many changes were in the
> area
> > > of proper locking which benefit multiple schedulers. In addition the
> > tasks
> > > themselves have built in checks that they don’t run twice at the same
> > time.
> > >
> > > Yet YMMV.
> > >
> > > Bolke
> > >
> > > > On 24 Feb 2017, at 03:13, siddharth anand <san...@apache.org> wrote:
> > > >
> > > > I did  run 2 or more schedulers with Local Executors up until mid
> last
> > > > year. There have been enough changes to the code and feature
> additions
> > > that
> > > > I don't think this is a recommended practice at this point. Also,
> there
> > > is
> > > > not a lot of synchronization in the scheduler to ensure this will
> work.
> > > >
> > > > -s
> > > >
> > > > On Thu, Feb 9, 2017 at 6:47 AM, matus valo <matusv...@gmail.com>
> > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >>
> > > >>
> > > >> I am considering deployment of airflow as pipeline framework. I have
> > > found
> > > >> out multiple articles explaining deployment of airflow in
> distributed
> > > >> environment (e.g. [1]). Unfortunately, I was not able to find out
> any
> > > use
> > > >> case where scheduler is deployed distributed on multiple nodes. Is
> it
> > > >> possible to have scheduler distributed on multiple nodes to prevent
> > > single
> > > >> point of failure? I haven’t found any mention about it in
> > > documentation. I
> > > >> have found out in [2] that it is not possible but on the other hand
> in
> > > [3]
> > > >> is reference that this can be solved in new version of airflow.
> > > >>
> > > >>
> > > >>
> > > >> Thanks,
> > > >>
> > > >>
> > > >> Matus
> > > >>
> > > >>
> > > >>
> > > >> [1] http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
> > > >>
> > > >> [2]
> > https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
> > > >>
> > > >> [3] https://issues.apache.org/jira/browse/AIRFLOW-678
> > > >>
> > >
> > >
> >
>


Re: scheduler running on multiple nodes

2017-02-24 Thread Dan Davydov
Fwiw Airbnb was running multiple schedulers for a short while on 1.7.1 and
we didn't seem to have issues.

On Feb 24, 2017 12:25 AM, "Bolke de Bruin"  wrote:

> While I agree with the assessment of Sid that a lot has changed and we do
> not officially test on multiple schedulers, many changes were in the area
> of proper locking which benefit multiple schedulers. In addition the tasks
> themselves have built in checks that they don’t run twice at the same time.
>
> Yet YMMV.
>
> Bolke
>
> > On 24 Feb 2017, at 03:13, siddharth anand  wrote:
> >
> > I did  run 2 or more schedulers with Local Executors up until mid last
> > year. There have been enough changes to the code and feature additions
> that
> > I don't think this is a recommended practice at this point. Also, there
> is
> > not a lot of synchronization in the scheduler to ensure this will work.
> >
> > -s
> >
> > On Thu, Feb 9, 2017 at 6:47 AM, matus valo  wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> I am considering deployment of airflow as pipeline framework. I have
> found
> >> out multiple articles explaining deployment of airflow in distributed
> >> environment (e.g. [1]). Unfortunately, I was not able to find out any
> use
> >> case where scheduler is deployed distributed on multiple nodes. Is it
> >> possible to have scheduler distributed on multiple nodes to prevent
> single
> >> point of failure? I haven’t found any mention about it in
> documentation. I
> >> have found out in [2] that it is not possible but on the other hand in
> [3]
> >> is reference that this can be solved in new version of airflow.
> >>
> >>
> >>
> >> Thanks,
> >>
> >>
> >> Matus
> >>
> >>
> >>
> >> [1] http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
> >>
> >> [2] https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
> >>
> >> [3] https://issues.apache.org/jira/browse/AIRFLOW-678
> >>
>
>


Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-23 Thread Dan Davydov
Some more issues found by our users in addition to the one Alex reported
and the UI issue when a dagrun doesn't have a start date:
1. If a task fails it fails the whole dagrun immediately fails, this is a
very large change to how control flow works as the rest of the tasks in the
DAG are not run (even e.g. leaf tasks). The same is true of the skipped
status (if a leaf task is skipped then the root task for the DAG will get
skipped and none of the other tasks in the DAG will run).
2. The black squares in the UI for tasks that aren't ready to run yet are
confusing and make it hard for users to see which tasks haven't run yet
(lower contrast). We should never initialize tasks in the DB that do not
have a state (or at the least these should be white).
3. The Dagrun has a get_task_instance method that will fail if a dagrun
doesn't have a copy of a task instance created which we have seen happen
for some DAGs. This prevents those tasks from getting scheduled.

I already patched 3 (and have a PR in flight for open source), and am
working on a patch for 1 internally. 1 should be a blocker for releasing.

On Wed, Feb 22, 2017 at 4:38 PM, Alex Guziel <alex.guz...@airbnb.com.invalid
> wrote:

> I have some concern that this change
> https://github.com/apache/incubator-airflow/pull/1939
> [AIRFLOW-679] may be having issues because we are seeing lots of double
> triggers
> of tasks and tasks being killed as a result.
>
>
>
>
>
> On Wed, Feb 22, 2017 4:35 PM, Dan Davydov dan.davy...@airbnb.com.INVALID
> wrote:
> Bumping the thread so another user can comment.
>
>
>
>
> On Wed, Feb 22, 2017 at 3:12 PM, Maxime Beauchemin <
>
> maximebeauche...@gmail.com> wrote:
>
>
>
>
> > What I meant to ask is "how much engineering effort it takes to bake a
>
> > single RC?", I guess it depends on how much git-fu is necessary plus some
>
> > overhead cost of doing the series of actions/commands/emails/jira.
>
> >
>
> > I can volunteer for 1.8.1 (hopefully I can get do it along another Airbnb
>
> > engineer/volunteer to tag along) and will try to document/automate
>
> > everything I can as I go through the process. The goal of 1.8.1 could be
> to
>
> > basically package 1.8.0 + Dan's bugfix, and for Airbnb to get familiar
> with
>
> > the process.
>
> >
>
> > It'd be great if you can dump your whole process on the wiki, and we'll
>
> > improve it on this next pass.
>
> >
>
> > Thanks again for the mountain of work that went into packaging this
>
> > release.
>
> >
>
> > Max
>
> >
>
> > On Wed, Feb 22, 2017 at 2:44 PM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
>
> >
>
> > > I thought you volunteered to baby sit 1.8.1 Chris ;-)?
>
> > >
>
> > > Sent from my iPhone
>
> > >
>
> > > > On 22 Feb 2017, at 23:31, Chris Riccomini <criccom...@apache.org>
>
> > wrote:
>
> > > >
>
> > > > I'm +1 for doing a 1.8.1 fast follow-on
>
> > > >
>
> > > > On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
>
> > > > maximebeauche...@gmail.com> wrote:
>
> > > >
>
> > > >> Our database may have edge cases that could be associated with
> running
>
> > > any
>
> > > >> previous version that may or may not have been part of an official
>
> > > release.
>
> > > >>
>
> > > >> Let's see if anyone else reports the issue. If no one does, one
> option
>
> > > is
>
> > > >> to release 1.8.0 as is with a comment in the release notes, and
> have a
>
> > > >> future official minor apache release 1.8.1 that would fix these
> minor
>
> > > >> issues that are not deal breaker.
>
> > > >>
>
> > > >> @bolke, I'm curious, how long does it take you to go through one
>
> > release
>
> > > >> cycle? Oh, and do you have a documented step by step process for
>
> > > releasing?
>
> > > >> I'd like to add the Pypi part to this doc and add committers that
> are
>
> > > >> interested to have rights on the project on Pypi.
>
> > > >>
>
> > > >> Max
>
> > > >>
>
> > > >>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <bdbr...@gmail.com
> >
>
> > > wrote:
>
> > > >>>
>
> > > >>> So it is a database integrity issue? Afaik a start_date should
> always
>
> > > be
>
> > > >>> set fo

Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-22 Thread Dan Davydov
Bumping the thread so another user can comment.

On Wed, Feb 22, 2017 at 3:12 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> What I meant to ask is "how much engineering effort it takes to bake a
> single RC?", I guess it depends on how much git-fu is necessary plus some
> overhead cost of doing the series of actions/commands/emails/jira.
>
> I can volunteer for 1.8.1 (hopefully I can get do it along another Airbnb
> engineer/volunteer to tag along) and will try to document/automate
> everything I can as I go through the process. The goal of 1.8.1 could be to
> basically package 1.8.0 + Dan's bugfix, and for Airbnb to get familiar with
> the process.
>
> It'd be great if you can dump your whole process on the wiki, and we'll
> improve it on this next pass.
>
> Thanks again for the mountain of work that went into packaging this
> release.
>
> Max
>
> On Wed, Feb 22, 2017 at 2:44 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> > I thought you volunteered to baby sit 1.8.1 Chris ;-)?
> >
> > Sent from my iPhone
> >
> > > On 22 Feb 2017, at 23:31, Chris Riccomini <criccom...@apache.org>
> wrote:
> > >
> > > I'm +1 for doing a 1.8.1 fast follow-on
> > >
> > > On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > >> Our database may have edge cases that could be associated with running
> > any
> > >> previous version that may or may not have been part of an official
> > release.
> > >>
> > >> Let's see if anyone else reports the issue. If no one does, one option
> > is
> > >> to release 1.8.0 as is with a comment in the release notes, and have a
> > >> future official minor apache release 1.8.1 that would fix these minor
> > >> issues that are not deal breaker.
> > >>
> > >> @bolke, I'm curious, how long does it take you to go through one
> release
> > >> cycle? Oh, and do you have a documented step by step process for
> > releasing?
> > >> I'd like to add the Pypi part to this doc and add committers that are
> > >> interested to have rights on the project on Pypi.
> > >>
> > >> Max
> > >>
> > >>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >>>
> > >>> So it is a database integrity issue? Afaik a start_date should always
> > be
> > >>> set for a DagRun (create_dagrun) does so  I didn't check the code
> > though.
> > >>>
> > >>> Sent from my iPhone
> > >>>
> > >>>> On 22 Feb 2017, at 22:19, Dan Davydov <dan.davy...@airbnb.com.
> > INVALID>
> > >>> wrote:
> > >>>>
> > >>>> Should clarify this occurs when a dagrun does not have a start date,
> > >> not
> > >>> a
> > >>>> dag (which makes it even less likely to happen). I don't think this
> is
> > >> a
> > >>>> blocker for releasing.
> > >>>>
> > >>>>> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov <
> dan.davy...@airbnb.com
> > >
> > >>> wrote:
> > >>>>>
> > >>>>> I rolled this out in our prod and the webservers failed to load due
> > to
> > >>>>> this commit:
> > >>>>>
> > >>>>> [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag
> > >>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72
> > >>>>>
> > >>>>> This fixed it:
> > >>>>> -  > >>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"
> title="Start
> > >>> Date:
> > >>>>> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}">
> > >>>>> +  > >>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true">
> > >>>>>
> > >>>>> This is caused by assuming that all DAGs have start dates set, so a
> > >>> broken
> > >>>>> DAG will take down the whole UI. Not sure if we want to make this a
> > >>> blocker
> > >>>>> for the release or not, I'm guessing for most deployments this
> would
> > >>> occur
> > >>>>

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-17 Thread Dan Davydov
+1 (binding). Mark success works great now, thanks to Bolke for fixing.

On Fri, Feb 17, 2017 at 12:22 AM, Bolke de Bruin  wrote:

> Dear All,
>
> I have made the FOURTH RELEASE CANDIDATE of Airflow 1.8.0 available at:
> https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> https://dist.apache.org/repos/dist/dev/incubator/airflow/> , public keys
> are available at https://dist.apache.org/repos/
> dist/release/incubator/airflow/  /dist/release/incubator/airflow/> . It is tagged with a local version
> “apache.incubating” so it allows upgrading from earlier releases.
>
> One issues have been fixed since release candidate 3:
>
> * mark success was not working properly
>
> No known issues anymore.
>
> I would also like to raise a VOTE for releasing 1.8.0 based on release
> candidate 4, i.e. just renaming release candidate 4 to 1.8.0 release.
>
> Please respond to this email by:
>
> +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you are
> not.
>
> Thanks!
> Bolke
>
> My VOTE: +1 (binding)


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc3

2017-02-10 Thread Dan Davydov
Our staging looks good, all the DAGs there pass.
+1 (binding)

On Fri, Feb 10, 2017 at 10:21 AM, Chris Riccomini 
wrote:

> Running in all environments. Will vote after the weekend to make sure
> things are working properly, but so far so good.
>
> On Fri, Feb 10, 2017 at 6:05 AM, Bolke de Bruin  wrote:
>
> > Dear All,
> >
> > Let’s try again!
> >
> > I have made the THIRD RELEASE CANDIDATE of Airflow 1.8.0 available at:
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/> , public keys
> > are available at https://dist.apache.org/repos/dist/release/incubator/
> > airflow/ 
> > . It is tagged with a local version “apache.incubating” so it allows
> > upgrading from earlier releases.
> >
> > Two issues have been fixed since release candidate 2:
> >
> > * trigger_dag could create dags with fractional seconds, not supported by
> > logging and UI at the moment
> > * local api client trigger_dag had hardcoded execution of None
> >
> > Known issue:
> > * Airflow on kubernetes and num_runs -1 (default) can expose import
> issues.
> >
> > I have extensively discussed this with Alex (reporter) and we consider
> > this a known issue with a workaround available as we are unable to
> > replicate this in a different environment. UPDATING.md has been updated
> > with the work around.
> >
> > As these issues are confined to a very specific area and full unit tests
> > were added I would also like to raise a VOTE for releasing 1.8.0 based on
> > release candidate 3, i.e. just renaming release candidate 3 to 1.8.0
> > release.
> >
> > Please respond to this email by:
> >
> > +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you
> are
> > not.
> >
> > Thanks!
> > Bolke
> >
> > My VOTE: +1 (binding)
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-06 Thread Dan Davydov
Bolke, attached is the patch for the cgroups fix. Let me know which
branches you would like me to merge it to. If anyone has complaints about
the patch let me know (but it does not touch the core of airflow, only the
new cgroups task runner).

On Mon, Feb 6, 2017 at 4:24 PM, siddharth anand  wrote:

> Actually, I see the error is further down..
>
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py",
> line
> 469, in do_execute
>
> cursor.execute(statement, parameters)
>
> sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) null value in
> column "dag_id" violates not-null constraint
>
> DETAIL:  Failing row contains (null, running, 1, f).
>
>  [SQL: 'INSERT INTO dag_stats (state, count, dirty) VALUES (%(state)s,
> %(count)s, %(dirty)s)'] [parameters: {'count': 1L, 'state': u'running',
> 'dirty': False}]
>
> It looks like an autoincrement is missing for this table.
>
>
> I'm running `SQLAlchemy==1.1.4` - I see our setup.py specifies any version
> greater than 0.9.8
>
> -s
>
>
>
> On Mon, Feb 6, 2017 at 4:11 PM, siddharth anand  wrote:
>
> > I tried upgrading to 1.8.0rc1 from 1.7.1.3 via pip install
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/
> > airflow-1.8.0rc1+apache.incubating.tar.gz and then running airflow
> > upgradedb didn't quite work. First, I thought it completed successfully,
> > then saw errors some tables were indeed missing. I ran it again and
> > encountered the following exception :
> >
> > DB: postgresql://app_coust...@db-cousteau.ep.stage.agari.com:
> 5432/airflow
> >
> > [2017-02-07 00:03:20,309] {db.py:284} INFO - Creating tables
> >
> > INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
> >
> > INFO  [alembic.runtime.migration] Will assume transactional DDL.
> >
> > INFO  [alembic.runtime.migration] Running upgrade 2e82aab8ef20 ->
> > 211e584da130, add TI state index
> >
> > INFO  [alembic.runtime.migration] Running upgrade 211e584da130 ->
> > 64de9cddf6c9, add task fails journal table
> >
> > INFO  [alembic.runtime.migration] Running upgrade 64de9cddf6c9 ->
> > f2ca10b85618, add dag_stats table
> >
> > INFO  [alembic.runtime.migration] Running upgrade f2ca10b85618 ->
> > 4addfa1236f1, Add fractional seconds to mysql tables
> >
> > INFO  [alembic.runtime.migration] Running upgrade 4addfa1236f1 ->
> > 8504051e801b, xcom dag task indices
> >
> > INFO  [alembic.runtime.migration] Running upgrade 8504051e801b ->
> > 5e7d17757c7a, add pid field to TaskInstance
> >
> > INFO  [alembic.runtime.migration] Running upgrade 5e7d17757c7a ->
> > 127d2bf2dfa7, Add dag_id/state index on dag_run table
> >
> > /usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/crud.py:692:
> > SAWarning: Column 'dag_stats.dag_id' is marked as a member of the primary
> > key for table 'dag_stats', but has no Python-side or server-side default
> > generator indicated, nor does it indicate 'autoincrement=True' or
> > 'nullable=True', and no explicit value is passed.  Primary key columns
> > typically may not store NULL. Note that as of SQLAlchemy 1.1,
> > 'autoincrement=True' must be indicated explicitly for composite (e.g.
> > multicolumn) primary keys if AUTO_INCREMENT/SERIAL/IDENTITY behavior is
> > expected for one of the columns in the primary key. CREATE TABLE
> statements
> > are impacted by this change as well on most backends.
> >
>


Re: Flow-based Airflow?

2017-02-06 Thread Dan Davydov
Woops looks like I replied to the wrong thread! Thanks Bolke.

On Mon, Feb 6, 2017 at 1:42 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Dataflow or 1.8?
>
> Sent from my iPhone
>
> > On 6 Feb 2017, at 22:35, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > We have been running in our staging and have found a couple of issues. I
> > will report back with them soon.
> >
> >> On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin <jlo...@apache.org>
> wrote:
> >>
> >> Very good point -- however I'm hesitant to overcomplicate the base
> class.
> >> At the moment users only have to override "serialize()" and
> "deserialize()"
> >> to build any form of remote-backed dataflow, and I like the simplicity
> of
> >> that.
> >>
> >> However, if you look at my implementation of the GCSDataflow, the
> >> constructor gets passed serializer and deserializer functions that are
> >> applied to the data before storage and after recovery. I think that
> sort of
> >> runtime-configurable serialization is in the spirit of what you're
> >> describing and it should be straightforward to adapt it for more
> specific
> >> requirements.
> >>
> >> On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz <llor...@industrydive.com>
> >> wrote:
> >>
> >>> This is great!
> >>>
> >>> We work with a lot of external data in wildly non-standard formats so
> >>> another enhancement here we'd use and support is passing customizable
> >>> serializers to Dataflow subclasses. This would let the dataflows
> keyword
> >>> arg for a task handle dependency management, the Dataflow class or
> >>> subclasses handle IO, and the Serializer subclasses handle parsing.
> >>>
> >>> Happy to contribute here, perhaps to create an S3Dataflow subclass in
> the
> >>> style of your Google Cloud storage one for this PR.
> >>>
> >>> Laura
> >>>
> >>> On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin <jlo...@apache.org>
> >> wrote:
> >>>
> >>>> Great point. I think the best solution is to solve this for all XComs
> >> by
> >>>> checking object size before adding it to the DB. I don't see a built
> in
> >>> way
> >>>> of handling it (though apparently MySQL is internally limited to
> 64kb).
> >>>> I'll look into a PR that would enforce a similar limit for all
> >> databases.
> >>>>
> >>>> On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> >>>> maximebeauche...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> I'm not sure about XCom being the default, it seems pretty dangerous.
> >> It
> >>>> just takes one person that is not fully aware of the size of the data,
> >> or
> >>>> one day with an outlier and that could put the Airflow db in jeopardy.
> >>>>
> >>>> I guess it's always been an aspect of XCom, and it could be good to
> >> have
> >>>> some explicit gatekeeping there regardless of this PR/feature. Perhaps
> >>> the
> >>>> DB itself has protection against large blobs?
> >>>>
> >>>> Max
> >>>>
> >>>> On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin <jlo...@apache.org>
> >>> wrote:
> >>>>
> >>>>> Yesterday I began converting a complex script to a DAG. It turned out
> >>> to
> >>>> be
> >>>>> a perfect test case for the dataflow model: a big chunk of data
> >> moving
> >>>>> through a series of modification steps.
> >>>>>
> >>>>> So I have built an extensible dataflow extension for Airflow on top
> >> of
> >>>> XCom
> >>>>> and the existing dependency engine:
> >>>>> https://issues.apache.org/jira/browse/AIRFLOW-825
> >>>>> https://github.com/apache/incubator-airflow/pull/2046 (still waiting
> >>> for
> >>>>> tests... it will be quite embarrassing if they don't pass)
> >>>>>
> >>>>> The philosophy is simple:
> >>>>> Dataflow objects represent the output of upstream tasks. Downstream
> >>> tasks
> >>>>> add Dataflows with a specific key. When the downstream task runs, the
> >>>>> (optionally indexed) upstream result is available in the downstream
> >>>> context
> >>>>> under context['dataflows'][key]. In addition, PythonOperators receive
> >>> the
> >>>>> data as a keyword argument.
> >>>>>
> >>>>> The basic Dataflow serializes the data through XComs, but is
> >> trivially
> >>>>> extended to alternative storage via subclasses. I have provided (in
> >>>>> contrib) implementations of a local filesystem-based Dataflow as well
> >>> as
> >>>> a
> >>>>> Google Cloud Storage dataflow.
> >>>>>
> >>>>> Laura, I hope you can have a look and see if this will bring some of
> >>> your
> >>>>> requirements in to Airflow as first-class citizens.
> >>>>>
> >>>>> Jeremiah
> >>>>>
> >>>>
> >>>
> >>
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-06 Thread Dan Davydov
On the Airbnb side we should be good once https://github.com/apache/
incubator-airflow/pull/2057/ is merged.

On Mon, Feb 6, 2017 at 9:23 AM, Chris Riccomini 
wrote:

> Upgraded to RC1 in all environments this morning. So far so good.
>
> On Fri, Feb 3, 2017 at 6:04 PM, Jeremiah Lowin  wrote:
>
> > For what it's worth -- everything running smoothly after 24+ hours in a
> > production(ish) environment.
> >
> > On Thu, Feb 2, 2017 at 11:25 PM Jayesh Senjaliya 
> > wrote:
> >
> > > Thank You Bolke for all the efforts you are putting in !!
> > >
> > > I have deployed this RC now.
> > >
> > > On Thu, Feb 2, 2017 at 3:02 PM, Jeremiah Lowin 
> > wrote:
> > >
> > > > Fantastic work on this Bolke, thank you!
> > > >
> > > > We've deployed the RC and will report if there are any issues...
> > > >
> > > > On Thu, Feb 2, 2017 at 4:32 PM Bolke de Bruin 
> > wrote:
> > > >
> > > > > Now I am blushing :-)
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On 2 Feb 2017, at 22:05, Boris Tyukin 
> > wrote:
> > > > > >
> > > > > > LOL awesome!
> > > > > >
> > > > > > On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> > > > > > maximebeauche...@gmail.com> wrote:
> > > > > >
> > > > > >> The Apache mailing doesn't support images so here's a link:
> > > > > >>
> > > > > >> http://i.imgur.com/DUkpjZu.png
> > > > > >> ​
> > > > > >>
> > > > > >> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin <
> > > bo...@boristyukin.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Bolke, you are our hero! I am sure you put a lot of your time
> to
> > > make
> > > > > it
> > > > > >>> happen
> > > > > >>>
> > > > > >>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin <
> > bdbr...@gmail.com>
> > > > > >> wrote:
> > > > > >>>
> > > > >  Hi All,
> > > > > 
> > > > >  I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0
> > > available
> > > > > >> at:
> > > > >  https://dist.apache.org/repos/dist/dev/incubator/airflow/ ,
> > > public
> > > > > >> keys
> > > > >  are available at
> > > > > https://dist.apache.org/repos/dist/release/incubator/
> > > > >  airflow/ . It is tagged with a local version
> “apache.incubating”
> > > so
> > > > it
> > > > >  allows upgrading from earlier releases. This should be
> > considered
> > > of
> > > > >  release quality, but not yet officially vetted as a release
> yet.
> > > > > 
> > > > >  Issues fixed:
> > > > >  * Use static nvd3 and d3
> > > > >  * Python 3 incompatibilities
> > > > >  * CLI API trigger dag issue
> > > > > 
> > > > >  As the difference between beta 5 and the release candidate is
> > > > > >> relatively
> > > > >  small I hope to start the VOTE for releasing 1.8.0 quite soon
> (2
> > > > > >> days?),
> > > > > >>> if
> > > > >  the vote passes also a vote needs to happen at the IPMC
> > > mailinglist.
> > > > > As
> > > > >  this is our first Apache release I expect some comments and
> > > required
> > > > >  changes and probably a RC 2.
> > > > > 
> > > > >  Furthermore, we now have a “v1-8-stable” branch. This has
> > version
> > > > >  “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> > > > > >> “v1-8-test”
> > > > >  branch now has version “1.8.1alpha0” as version and “master”
> has
> > > > > >> version
> > > > >  “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means
> > > that,
> > > > > >> per
> > > > >  release guidelines, patches accompanied with an ASSIGNED Jira
> > and
> > > a
> > > > >  sign-off from a committer. Only then the release manager
> applies
> > > the
> > > > > >>> patch
> > > > >  to stable (In this case that would be me). The release manager
> > > then
> > > > > >>> closes
> > > > >  the bug when the patches have landed in the appropriate
> > branches.
> > > > For
> > > > > >>> more
> > > > >  information please see: https://cwiki.apache.org/
> > > > >  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > > >  Supported+Release+Lifetime  > > > >  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > > >  Supported+Release+Lifetime> .
> > > > > 
> > > > >  Any questions or suggestions don’t hesitate to ask!
> > > > > 
> > > > >  Cheers
> > > > >  Bolke
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>


Re: Flow-based Airflow?

2017-02-06 Thread Dan Davydov
We have been running in our staging and have found a couple of issues. I
will report back with them soon.

On Thu, Feb 2, 2017 at 2:23 PM, Jeremiah Lowin  wrote:

> Very good point -- however I'm hesitant to overcomplicate the base class.
> At the moment users only have to override "serialize()" and "deserialize()"
> to build any form of remote-backed dataflow, and I like the simplicity of
> that.
>
> However, if you look at my implementation of the GCSDataflow, the
> constructor gets passed serializer and deserializer functions that are
> applied to the data before storage and after recovery. I think that sort of
> runtime-configurable serialization is in the spirit of what you're
> describing and it should be straightforward to adapt it for more specific
> requirements.
>
> On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz 
> wrote:
>
> > This is great!
> >
> > We work with a lot of external data in wildly non-standard formats so
> > another enhancement here we'd use and support is passing customizable
> > serializers to Dataflow subclasses. This would let the dataflows keyword
> > arg for a task handle dependency management, the Dataflow class or
> > subclasses handle IO, and the Serializer subclasses handle parsing.
> >
> > Happy to contribute here, perhaps to create an S3Dataflow subclass in the
> > style of your Google Cloud storage one for this PR.
> >
> > Laura
> >
> > On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin 
> wrote:
> >
> > > Great point. I think the best solution is to solve this for all XComs
> by
> > > checking object size before adding it to the DB. I don't see a built in
> > way
> > > of handling it (though apparently MySQL is internally limited to 64kb).
> > > I'll look into a PR that would enforce a similar limit for all
> databases.
> > >
> > > On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> > > maximebeauche...@gmail.com>
> > > wrote:
> > >
> > > I'm not sure about XCom being the default, it seems pretty dangerous.
> It
> > > just takes one person that is not fully aware of the size of the data,
> or
> > > one day with an outlier and that could put the Airflow db in jeopardy.
> > >
> > > I guess it's always been an aspect of XCom, and it could be good to
> have
> > > some explicit gatekeeping there regardless of this PR/feature. Perhaps
> > the
> > > DB itself has protection against large blobs?
> > >
> > > Max
> > >
> > > On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin 
> > wrote:
> > >
> > > > Yesterday I began converting a complex script to a DAG. It turned out
> > to
> > > be
> > > > a perfect test case for the dataflow model: a big chunk of data
> moving
> > > > through a series of modification steps.
> > > >
> > > > So I have built an extensible dataflow extension for Airflow on top
> of
> > > XCom
> > > > and the existing dependency engine:
> > > > https://issues.apache.org/jira/browse/AIRFLOW-825
> > > > https://github.com/apache/incubator-airflow/pull/2046 (still waiting
> > for
> > > > tests... it will be quite embarrassing if they don't pass)
> > > >
> > > > The philosophy is simple:
> > > > Dataflow objects represent the output of upstream tasks. Downstream
> > tasks
> > > > add Dataflows with a specific key. When the downstream task runs, the
> > > > (optionally indexed) upstream result is available in the downstream
> > > context
> > > > under context['dataflows'][key]. In addition, PythonOperators receive
> > the
> > > > data as a keyword argument.
> > > >
> > > > The basic Dataflow serializes the data through XComs, but is
> trivially
> > > > extended to alternative storage via subclasses. I have provided (in
> > > > contrib) implementations of a local filesystem-based Dataflow as well
> > as
> > > a
> > > > Google Cloud Storage dataflow.
> > > >
> > > > Laura, I hope you can have a look and see if this will bring some of
> > your
> > > > requirements in to Airflow as first-class citizens.
> > > >
> > > > Jeremiah
> > > >
> > >
> >
>


Re: Airflow 1.8.0 BETA 5

2017-01-30 Thread Dan Davydov
@Alex
I'm not able to reproduce locally (assuming the two python files are in the
same folder or is on your PYTHONPATH). I don't see that import error
anyways.

Just in case, what is your complete DAG definition? Is anyone else able to
repro?

On Mon, Jan 30, 2017 at 3:09 PM, Alex Van Boxel <a...@vanboxel.be> wrote:

> Well this means none of my DAG's work anymore:
>
> you just can do this anymore:
>
> file bqschema.py with
>
> def marketing_segment():
> return [
> {"name": "user_id", "type": "integer", "mode": "nullable"},
> {"name": "bucket_date", "type": "timestamp", "mode": "nullable"},
> {"name": "segment_main", "type": "string", "mode": "nullable"},
> {"name": "segment_sub", "type": "integer", "mode": "nullable"},
>
>
> In marketing_segmentation.py:
>
>
> import bqschema
>
> Gives an error:
>
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.7/site-packages/airflow-1.8.0b5+
> apache.incubating-py2.7.egg/airflow/models.py",
> line 264, in process_file
> m = imp.load_source(mod_name, filepath)
>   File "/home/airflow/dags/marketing_segmentation.py", line 17, in
> 
> import bqschema
> ImportError: No module named bqschema
>
> *I don't think this is incorrect?!*
>
>
>
> On Mon, Jan 30, 2017 at 11:46 PM Dan Davydov <dan.davy...@airbnb.com.
> invalid>
> wrote:
>
> > The latest commit fixed a regression since 1.7 that files with parsing
> > errors no longer showed up on the UI.
> >
> > On Mon, Jan 30, 2017 at 2:42 PM, Alex Van Boxel <a...@vanboxel.be>
> wrote:
> >
> > > Just installed beta 5 on our dev environment it lighted up as a
> christmas
> > > tree. I got a a screen full of import errors. I see that the latest
> > commit
> > > did something with import errors... is it coorect?!
> > >
> > > On Sun, Jan 29, 2017 at 4:37 PM Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >
> > > > Hey Boris
> > > >
> > > > The scheduler is a bit more aggressive and can use multiple
> processors,
> > > so
> > > > higher CPU usage is actually a good thing.
> > > >
> > > > I case it is really out of hand look at the new scheduler options and
> > > > heartbeat options (see PR for updating.md not in the beta yet).
> > > >
> > > > Bolke
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 29 Jan 2017, at 15:35, Boris Tyukin <bo...@boristyukin.com>
> > wrote:
> > > > >
> > > > > I am not sure if it is my config or something, but looks like after
> > the
> > > > > upgrade and start of scheduler, airflow would totally hose CPU. The
> > > > reason
> > > > > is two new examples that start running right away - latest only and
> > > > latest
> > > > > with trigger. Once I pause them, CPU goes back to idle. Is this
> > because
> > > > now
> > > > > dags are not paused by default like it was before?
> > > > >
> > > > > As I mentioned before, I also had to upgrade mysql to 5.7 - if
> > someone
> > > > > needs a step by step instruction, make sure to follow all steps
> > > precisely
> > > > > here for in-place upgrade or you will have heck of the time (like
> > me).
> > > > >
> > > > https://dev.mysql.com/doc/refman/5.7/en/upgrading.html#
> > > upgrade-procedure-inplace
> > > > >
> > > > > BTW official Oracle repository for Oracle Linux only has MySql 5.6
> -
> > > for
> > > > > 5.7 you have to use MySql community repo.
> > > > >
> > > > >> On Sat, Jan 28, 2017 at 10:07 AM, Bolke de Bruin <
> bdbr...@gmail.com
> > >
> > > > wrote:
> > > > >>
> > > > >> Hi All,
> > > > >>
> > > > >> I have made the FIFTH beta of Airflow 1.8.0 available at:
> > > > >> https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > > > >> https://dist.apache.org/repos/dist/dev/incubator/airflow/> ,
> public
> > > > keys
> > > > >> are available at https://dist.apache.org/repos/
> > > dist/release/incubator/
> > > > >> airflow/ <https://dist.apache.org/repos/dist/release/incubator/
> > > airflow/
> > > > >
> > > > >> . It is tagged with a local version “apache.incubating” so it
> allows
> > > > >> upgrading from earlier releases.
> > > > >>
> > > > >> Issues fixed:
> > > > >> * Parsing errors not showing up in UI fixing a regression**
> > > > >> * Scheduler would terminate immediately if no dag files present
> > > > >>
> > > > >> ** As this touches the scheduler logic I though it warranted
> another
> > > > beta.
> > > > >>
> > > > >> This should be the last beta in my opinion and we can prepare
> > > changelog,
> > > > >> upgrade notes and release notes for the RC (Feb 2).
> > > > >>
> > > > >> Cheers
> > > > >> Bolke
> > > >
> > > --
> > >   _/
> > > _/ Alex Van Boxel
> > >
> >
> --
>   _/
> _/ Alex Van Boxel
>


Re: Airflow 1.8.0 BETA 5

2017-01-30 Thread Dan Davydov
The latest commit fixed a regression since 1.7 that files with parsing
errors no longer showed up on the UI.

On Mon, Jan 30, 2017 at 2:42 PM, Alex Van Boxel  wrote:

> Just installed beta 5 on our dev environment it lighted up as a christmas
> tree. I got a a screen full of import errors. I see that the latest commit
> did something with import errors... is it coorect?!
>
> On Sun, Jan 29, 2017 at 4:37 PM Bolke de Bruin  wrote:
>
> > Hey Boris
> >
> > The scheduler is a bit more aggressive and can use multiple processors,
> so
> > higher CPU usage is actually a good thing.
> >
> > I case it is really out of hand look at the new scheduler options and
> > heartbeat options (see PR for updating.md not in the beta yet).
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 29 Jan 2017, at 15:35, Boris Tyukin  wrote:
> > >
> > > I am not sure if it is my config or something, but looks like after the
> > > upgrade and start of scheduler, airflow would totally hose CPU. The
> > reason
> > > is two new examples that start running right away - latest only and
> > latest
> > > with trigger. Once I pause them, CPU goes back to idle. Is this because
> > now
> > > dags are not paused by default like it was before?
> > >
> > > As I mentioned before, I also had to upgrade mysql to 5.7 - if someone
> > > needs a step by step instruction, make sure to follow all steps
> precisely
> > > here for in-place upgrade or you will have heck of the time (like me).
> > >
> > https://dev.mysql.com/doc/refman/5.7/en/upgrading.html#
> upgrade-procedure-inplace
> > >
> > > BTW official Oracle repository for Oracle Linux only has MySql 5.6 -
> for
> > > 5.7 you have to use MySql community repo.
> > >
> > >> On Sat, Jan 28, 2017 at 10:07 AM, Bolke de Bruin 
> > wrote:
> > >>
> > >> Hi All,
> > >>
> > >> I have made the FIFTH beta of Airflow 1.8.0 available at:
> > >> https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > >> https://dist.apache.org/repos/dist/dev/incubator/airflow/> , public
> > keys
> > >> are available at https://dist.apache.org/repos/
> dist/release/incubator/
> > >> airflow/  airflow/
> > >
> > >> . It is tagged with a local version “apache.incubating” so it allows
> > >> upgrading from earlier releases.
> > >>
> > >> Issues fixed:
> > >> * Parsing errors not showing up in UI fixing a regression**
> > >> * Scheduler would terminate immediately if no dag files present
> > >>
> > >> ** As this touches the scheduler logic I though it warranted another
> > beta.
> > >>
> > >> This should be the last beta in my opinion and we can prepare
> changelog,
> > >> upgrade notes and release notes for the RC (Feb 2).
> > >>
> > >> Cheers
> > >> Bolke
> >
> --
>   _/
> _/ Alex Van Boxel
>


Re: Experiences with 1.8.0

2017-01-20 Thread Dan Davydov
I'd be happy to lend a hand fixing these issues and hopefully some others
are too. Do you mind creating jiras for these since you have the full
context? I have created a JIRA for (1) and have assigned it to myself:
https://issues.apache.org/jira/browse/AIRFLOW-780

On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin  wrote:

> This is to report back on some of the (early) experiences we have with
> Airflow 1.8.0 (beta 1 at the moment):
>
> 1. The UI does not show faulty DAG, leading to confusion for developers.
> When a faulty dag is placed in the dags folder the UI would report a
> parsing error. Now it doesn’t due to the separate parising (but not
> reporting back errors)
>
> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
> We run in a secure environment which requires this variable to be
> whitelisted if it is modified (needs to be added to UPDATING.md)
>
> 3. DagRuns do not exist for certain tasks, but don’t get fixed
> Log gets flooded without a suggestion what to do
>
> 4. At start up all running dag_runs are being checked, we seemed to have a
> lot of “left over” dag_runs (couple of thousand)
> - Checking was logged to INFO -> requires a fsync for every log message
> making it very slow
> - Checking would happen at every restart, but dag_runs’ states were not
> being updated
> - These dag_runs would never er be marked anything else than running for
> some reason
> -> Applied work around to update all dag_run in sql before a certain date
> to -> finished
> -> need to investigate why dag_runs did not get marked “finished/failed”
>
> 5. Our umask is set to 027
>
>


Re: Airflow 1.8.0 alpha 4

2017-01-11 Thread Dan Davydov
The task dependency engine code is well commented, but I can provide a high
level overview specifically for developers if there is interest (note that
this would be the first documentation of it's kind in that it would be
developer-only documentation). The disadvantage is that it would create
duplication with the logic itself on quite a large scale. Let me know Bolke.

On Wed, Jan 11, 2017 at 1:30 PM, Chris Riccomini 
wrote:

> @bolke, this sounds like a good list.
>
> On Wed, Jan 11, 2017 at 12:01 PM, Bolke de Bruin 
> wrote:
>
> > Ok.
> >
> > For now to call it “beta” 4 items seems to be left:
> >
> > Blocker:
> > * retry_delay not respected
> > * poison pill due to re-queue before process has finished (to be
> > investigated)
> >
> > Features:
> > * cgroups + impersonation
> > * dag.catchup (Ben Tallman -> Only documentation is missing).
> >
> > PRs that contain documentation would really be appreciated. In my opinion
> > we are lacking there. Think about docs covering:
> > * new scheduler behaviour and options
> > * task dependency engine
> > * api / kerberized api
> > * …
> >
> > Cheers
> > Bolke
> >
> > > On 11 Jan 2017, at 18:59, Arthur Wiedmer 
> > wrote:
> > >
> > > +1
> > >
> > > We can always think about different ways of doing this later (fair
> share
> > > scheduling etc...)
> > >
> > > Best,
> > > Arthur
> > >
> > > On Wed, Jan 11, 2017 at 4:46 AM, Bolke de Bruin 
> > wrote:
> > >
> > >> Dear All,
> > >>
> > >> I would like to drop "Schedule all pending DAG runs in a single
> > scheduler
> > >> loop” from the 1.8.0 release (updated: https://github.com/apache/
> > >> incubator-airflow/pull/1980  > >> incubator-airflow/pull/1980>, original: https://github.com/apache/
> > >> incubator-airflow/pull/1906  > >> incubator-airflow/pull/1906>). The reason for this is that it, imho,
> > >> biases the scheduler towards a single DAG as it fills the queue with
> > tasks
> > >> from one DAG and then goes to the next DAG. Starving DAGs that come
> > after
> > >> the first for resources. As such it should be updated and that will
> take
> > >> time.
> > >>
> > >> Please let me know if I am incorrect.
> > >>
> > >> Thanks
> > >> Bolke
> > >>
> > >>> On 10 Jan 2017, at 09:25, Bolke de Bruin  wrote:
> > >>>
> > >>> Dear All,
> > >>>
> > >>> I have made Airflow 1.8.0 alpha 4 available at
> > >> https://people.apache.org/~bolke/ 
> .
> > >> Again no Apache release yet - this is for testing purposes. I consider
> > this
> > >> Alpha to be a Beta if not for the pending features. If the pending
> > features
> > >> are merged within a reasonable time frame (except for **, as no
> progress
> > >> currently) then I am planning to mark the tarball as Beta and only
> allow
> > >> bug fixes and (very) minor features. This week hopefully.
> > >>>
> > >>> Blockers:
> > >>>
> > >>> * None
> > >>>
> > >>> Fixed issues
> > >>> * Regression in email
> > >>> * LDAP case sensitivity
> > >>> * one_failed task not being run: now seems to pass suddenly (so
> fixed?)
> > >> -> need to investigate why
> > >>> * Email attachments
> > >>> * Pinned jinja2 to < 2.9.0 (2.9.1 has a confirmed regression)
> > >>> * Improve time units for task performance charts
> > >>> * XCom throws an duplicate / locking error
> > >>> * Add execution_date to trigger_dag
> > >>>
> > >>> Pending features:
> > >>> * DAG.catchup : minor changes needed, documentation still required,
> > >> integration tests seem to pass flawlessly
> > >>> * Cgroups + impersonation: clean up of patches on going, more tests
> and
> > >> more elaborate documentation required. Integration tests not executed
> > yet
> > >>> * Schedule all pending DAG runs in a single scheduler loop: no
> progress
> > >> (**)
> > >>>
> > >>> Cheers!
> > >>> Bolke
> > >>
> > >>
> >
> >
>


Re: Last minute open meetup speaking slot

2017-01-11 Thread Dan Davydov
Thank you Arthur for stepping up, and my sincere apologies. I just don't
want to get anyone sick. I hope to present the talk at the next meetup.

On Jan 10, 2017 11:22 PM, "George Leslie-Waksman"

wrote:

> Listing updated. Thanks for stepping in on short notice.
>
> On Tue, Jan 10, 2017 at 8:15 PM Arthur Wiedmer 
> wrote:
>
> > George,
> >
> > I'm in. Here is some info if you want to update the meetup page. Thanks
> for
> > the opportunity !
> >
> >
> > The working title is :
> > Using Apache Airflow as a platform for data engineering frameworks.
> >
> > A short abstract would be :
> > Airbnb uses Airflow ability to dynamically generate pipelines to power
> > frameworks addressing the needs of the data teams. We will explore some
> of
> > Airflow expressiveness via a couple of examples running in production at
> > Airbnb.
> >
> > Best,
> > Arthur.
> >
> > On Jan 10, 2017 8:03 PM, "George Leslie-Waksman"
> >  wrote:
> >
> > 20 minutes. If you want to fill in, that would be great.
> >
> > Regards,
> > --George
> >
> > On Tue, Jan 10, 2017 at 4:56 PM Arthur Wiedmer  >
> > wrote:
> >
> > > George,
> > >
> > > How long are the time slots?
> > >
> > > I might be able to put something together about some of the frameworks
> we
> > > have been using at Airbnb on top of Airflow.
> > >
> > > Best,
> > > Arthur
> > >
> > > On Tue, Jan 10, 2017 at 4:36 PM, George Leslie-Waksman <
> > > geo...@cloverhealth.com.invalid> wrote:
> > >
> > > > One of the speakers for tomorrow's meetup has come down with a cold.
> > > >
> > > > Is there anyone that would like to claim the third time slot?
> > > >
> > > > If not, we'll have extra time for Q, updates, and general
> meeting-up.
> > > >
> > > > --George
> > > >
> > >
> >
>


Re: Subsequent Airflow Meetup: 2017/01/11

2017-01-04 Thread Dan Davydov
Title: Operations & Support for Airflow
Brief Description: Several ideas for how to help catch and debug
operational issues with Airflow, as well as how to effectively deal with
common user issues.

On Wed, Jan 4, 2017 at 10:32 AM, George Leslie-Waksman <
geo...@cloverhealth.com.invalid> wrote:

> Kevin, Dan, do you have titles and (maybe) a brief paragraph for the meetup
> description, or should I just make something from the descriptions earlier
> in this thread?
>
> --George
>
> On Tue, Jan 3, 2017 at 4:00 PM Kevin Mandich <kevinmand...@gmail.com>
> wrote:
>
> > Hi George,
> >
> > Confirmed - would like give a talk. Thanks,
> >
> > Kevin Mandich
> >
> > On Tue, Jan 3, 2017 at 5:40 AM, Dan Davydov <dan.davy...@airbnb.com
> > .invalid>
> > wrote:
> >
> > > Confirmed.
> > >
> > > On Sun, Jan 1, 2017 at 9:16 PM, George Leslie-Waksman <
> > > geo...@cloverhealth.com.invalid> wrote:
> > >
> > > > Sorry for the delayed response, end of year and holidays stole my
> > > attention
> > > > for a bit.
> > > >
> > > > With the new year, I was just looking to pick things back up and
> > solicit
> > > > presenters for the meetup. Given we're looking for two more, and
> > > > Dan(Airbnb) and Kevin(Agari) have already expressed interest, I'd be
> > > happy
> > > > to give them the spots.
> > > >
> > > > I hope the delay in my response isn't too much of an inconvenience
> for
> > > > anyone. Dan, Kevin: confirm and I'll add you to the line up.
> > > >
> > > > --George
> > > >
> > > > On Sun, Nov 20, 2016 at 8:44 PM siddharth anand <san...@apache.org>
> > > wrote:
> > > >
> > > > > I suspect Clover Health is extremely busy with all of the benefit
> > > > > enrollments going on right now..
> > > > >
> > > > > George,
> > > > > When you come up for air, it looks like both Dan(Airbnb) and
> > > Kevin(Agari)
> > > > > have talk ideas.
> > > > >
> > > > > -s
> > > > >
> > > > > On Wed, Nov 16, 2016 at 11:50 PM, Dan Davydov <
> > > > > dan.davy...@airbnb.com.invalid> wrote:
> > > > >
> > > > > > Based on chatting with a couple of people today at the Airflow
> > > meet-up
> > > > I
> > > > > > think there has been some demand for an airflow operations talk,
> > > > > > specifically around monitoring/alerting. If there is still room I
> > can
> > > > > give
> > > > > > a talk about this, let me know George.
> > > > > >
> > > > > > On Thu, Nov 10, 2016 at 10:17 AM, siddharth anand <
> > san...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Kevin,
> > > > > > > Here's a link to the 1Q17 meet-up.
> > > > > > >
> > > > > https://www.meetup.com/Bay-Area-Apache-Airflow-
> > > Incubating-Meetup/events/
> > > > > > > 235259523/
> > > > > > >
> > > > > > > Both upcoming meet-ups (next week at WePay and 1Q17 at Clover
> > > Health)
> > > > > can
> > > > > > > be found on http://www.meetup.com/Bay-Area-Apache-Airflow-
> > > > > > > Incubating-Meetup/
> > > > > > >
> > > > > > > -s
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 9, 2016 at 4:24 PM, Kevin Mandich <
> > > > kevinmand...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi George,
> > > > > > > >
> > > > > > > > If there is still room, I'd like to give a talk about how we
> > use
> > > > > > Airflow
> > > > > > > at
> > > > > > > > my company, Agari. We are a data company that is working to
> > > > eliminate
> > > > > > > > inbound, targeted e-mail attacks to our customers
> > > > (spear-phishing). I
> > > > > > am
> > > > > > > > currently working as a data scientist who is also responsible
> > for
> > > > > > > shipping
> > > > > > > > my work to production.
> > > > > > > >
> > > > > > > > We currently use Airflow to build models from our telemetry
> > data
> > > > > which
> > > > > > > are
> > > > > > > > then used for scoring in our near-real-time pipeline. I'd
> like
> > to
> > > > > talk
> > > > > > > > about some of the DAGs we've set up to do this.
> > > > > > > >
> > > > > > > > Please let me know if this sounds reasonable. Thank you,
> > > > > > > >
> > > > > > > > Kevin Mandich
> > > > > > > > Agari Data, Inc.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Oct 31, 2016 at 11:27 PM, George Leslie-Waksman <
> > > > > > > > geo...@cloverhealth.com.invalid> wrote:
> > > > > > > >
> > > > > > > > > I know it's a bit far in advance, but to make sure there's
> > > space
> > > > > (and
> > > > > > > > food
> > > > > > > > > and drink), I've scheduled and booked the subsequent meetup
> > for
> > > > > > January
> > > > > > > > > 11th at Clover Health in SF.
> > > > > > > > >
> > > > > > > > > If anyone wants to volunteer to talk, let me know,
> otherwise
> > > I'll
> > > > > > > > probably
> > > > > > > > > start bugging folks sometime after Thanksgiving and before
> > the
> > > > > > December
> > > > > > > > > holidays.
> > > > > > > > >
> > > > > > > > > --George Leslie-Waksman
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Airflow 1.8.0 Alpha 1

2017-01-04 Thread Dan Davydov
It should be fine to delete them, hopefully noone is depending on them.

On Jan 4, 2017 11:41 AM, "Chris Riccomini" <criccom...@apache.org> wrote:

> @Bolke, thanks for creating the branch! Your plan sounds good to me. Re:
> deleting airbnb branches, I'll leave Dan/Max/Paul/Arthur/etc to comment on
> that. :)
>
> On Wed, Jan 4, 2017 at 7:59 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > I have created branch “v1-8-test”. For now I want to keep master and
> > v1-8-test in sync and do not do any cherry picking. The reason for this
> is
> > that we have a lot of catching up to do between 1.7.1.3 and 1.8.0, next
> to
> > that master is (at least to me) in an unknown state. If someone has a
> > better way to do this I am open to suggestions.
> >
> > When we release 1.8.0 I will create branch v-1-8-stable. This should
> track
> > point releases (e.g., 1.8.1, 1.8.2).
> >
> > On a side note I have deleted many old branches. This is what is left:
> >
> >   remotes/apache/airbnb_rb1.7.1
> >   remotes/apache/airbnb_rb1.7.1_2
> >   remotes/apache/airbnb_rb1.7.1_3
> >   remotes/apache/airbnb_rb1.7.1_4
> >   remotes/apache/master
> >   remotes/apache/v1-8-test
> >
> > I would like to remove the Airbnb branches as well. Can I? Maybe leave
> one
> > in as it reflect 1.7.1.3? (Which one?)
> >
> > - Bolke
> >
> >
> > > On 3 Jan 2017, at 20:34, Chris Riccomini <criccom...@apache.org>
> wrote:
> > >
> > > Hey Bolke,
> > >
> > > Thanks for taking this on. I'm definitely up for running stuff in our
> > > environments to verify everything is working.
> > >
> > > Can I ask that you create a 1.8 alpha 1 branch in the git repo? This
> will
> > > make it easier for us to track what changes are getting cherry picked
> > into
> > > the branch, and will also make it easier for users to pip install, if
> > they
> > > want to do so via github.
> > >
> > > Also, yea, when we switch to beta, we need to stop merging anything
> other
> > > than bug fixes into the release branch.
> > >
> > > Cheers,
> > > Chris
> > >
> > > On Tue, Jan 3, 2017 at 10:31 AM, Dan Davydov <dan.davy...@airbnb.com.
> > invalid
> > >> wrote:
> > >
> > >> All very reasonable to me, one reason we may not have hit the bugs in
> > our
> > >> production is because we are running off a different merge base and
> our
> > >> cherries aren't 1-1 with what we are running in production (we still
> > test
> > >> them but we can't run them in production), that being said I don't
> > think I
> > >> authored the commits you are referring to so I don't have full
> context.
> > >>
> > >> On Tue, Jan 3, 2017 at 1:27 PM, Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >>
> > >>> Hi Dan et al,
> > >>>
> > >>> That sounds good to me, however I will be pretty critical of the
> > changes
> > >>> in the scheduler and the cleanliness of the patches. This is due to
> the
> > >>> fact I have been chasing quite some bugs in master that were pretty
> > hard
> > >> to
> > >>> track down even with a debugger at hand. I’m surprised that those
> > didn’t
> > >>> pop up in your production or maybe I am concerned ;-). Anyways, I
> hope
> > >> you
> > >>> understand I might be a bit picky in understanding and needing
> (design)
> > >>> documentation for some of the changes.
> > >>>
> > >>> What I would like to suggest is that for the Alpha versions we still
> > >>> accept “new” features so these PRs can get in, but from Beta we will
> > not
> > >>> accept new features anymore. For new features in the area of the
> > >> scheduler
> > >>> an integration DummyDag should be supplied, so others can test the
> > >>> behaviour. Does this sound ok?
> > >>>
> > >>> My list of open code items for a release looks now like this:
> > >>>
> > >>> Blockers
> > >>> * one_failed not honoured
> > >>> * Alex’s sensor issue
> > >>>
> > >>> New features:
> > >>> * Schedule all pending DAGs in a single loop
> > >>> * Add support for backfill true/false
> > >>> * Impersonation
> >

Re: Airflow 1.8.0 Alpha 1

2017-01-03 Thread Dan Davydov
All very reasonable to me, one reason we may not have hit the bugs in our
production is because we are running off a different merge base and our
cherries aren't 1-1 with what we are running in production (we still test
them but we can't run them in production), that being said I don't think I
authored the commits you are referring to so I don't have full context.

On Tue, Jan 3, 2017 at 1:27 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi Dan et al,
>
> That sounds good to me, however I will be pretty critical of the changes
> in the scheduler and the cleanliness of the patches. This is due to the
> fact I have been chasing quite some bugs in master that were pretty hard to
> track down even with a debugger at hand. I’m surprised that those didn’t
> pop up in your production or maybe I am concerned ;-). Anyways, I hope you
> understand I might be a bit picky in understanding and needing (design)
> documentation for some of the changes.
>
> What I would like to suggest is that for the Alpha versions we still
> accept “new” features so these PRs can get in, but from Beta we will not
> accept new features anymore. For new features in the area of the scheduler
> an integration DummyDag should be supplied, so others can test the
> behaviour. Does this sound ok?
>
> My list of open code items for a release looks now like this:
>
> Blockers
> * one_failed not honoured
> * Alex’s sensor issue
>
> New features:
> * Schedule all pending DAGs in a single loop
> * Add support for backfill true/false
> * Impersonation
> * CGroups
> * Add Cloud Storage updated sensor
>
> Alpha2 I will package tomorrow. Packages are signed now by my apache.org <
> http://apache.org/> key. Please verify and let me know if something is
> off. I’m still waiting for access to the incubating dist repository.
>
> Bolke
>
>
> > On 3 Jan 2017, at 14:38, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >
> > I have also started on this effort, recently Alex Guziel and I have been
> > pushing Airbnb's custom cherries onto master to get Airbnb back onto
> master
> > in order for us to do a release.
> >
> > I think it might make sense to wait for these two commits to get merged
> in
> > since they would be quite nice to have for all Airflow users and seem
> like
> > they will be merged soon:
> > Schedule all pending DAG runs in a single scheduler loop -
> > https://github.com/apache/incubator-airflow/pull/1906 <
> https://github.com/apache/incubator-airflow/pull/1906>
> > Add Support for dag.backfill=(True|False) Option -
> > https://github.com/apache/incubator-airflow/pull/1830 <
> https://github.com/apache/incubator-airflow/pull/1830>
> > Impersonation Support + Cgroups - https://github.com/apache/ <
> https://github.com/apache/>
> > incubator-airflow/pull/1934 (this is kind of important from the Airbnb
> side
> > so that we can help test the new master without having to cherrypick this
> > PR on top of it which would make the testing unreliable for others).
> >
> > If there are PRs that affect the core of Airflow that other committers
> > think are important to merge we could include these too. I can commit to
> > pushing out the Impersonation/Cgroups PR this week pending PR comments.
> > What do you think Bolke?
> >
> > On Tue, Jan 3, 2017 at 4:26 AM, Bolke de Bruin <bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>> wrote:
> >
> >> Hey Alex,
> >>
> >> I have noticed the same, and it is also the reason why we have Alpha
> >> versions. For now I have noticed the following:
> >>
> >> * Tasks can get in limbo between scheduler and executor:
> >> https://github.com/apache/incubator-airflow/pull/1948 <
> https://github.com/apache/incubator-airflow/pull/1948> <
> >> https://github.com/apache/incubator-airflow/pull/1948 <
> https://github.com/apache/incubator-airflow/pull/1948>>
> >> * Try_number not increased due to reset in LocalTaskJob:
> >> https://github.com/apache/incubator-airflow/pull/1969 <
> https://github.com/apache/incubator-airflow/pull/1969> <
> >> https://github.com/apache/incubator-airflow/pull/1969 <
> https://github.com/apache/incubator-airflow/pull/1969>>
> >> * one_failed trigger not executed
> >>
> >> My idea is to move to a Samba style of releases eventually, but for now
> I
> >> would like to get master into a state that we understand and therefore
> not
> >> accept any patches that do not address any bugs.
> >>
> >> If you (or anyone else) can review the above PRs and

Re: Subsequent Airflow Meetup: 2017/01/11

2017-01-03 Thread Dan Davydov
Confirmed.

On Sun, Jan 1, 2017 at 9:16 PM, George Leslie-Waksman <
geo...@cloverhealth.com.invalid> wrote:

> Sorry for the delayed response, end of year and holidays stole my attention
> for a bit.
>
> With the new year, I was just looking to pick things back up and solicit
> presenters for the meetup. Given we're looking for two more, and
> Dan(Airbnb) and Kevin(Agari) have already expressed interest, I'd be happy
> to give them the spots.
>
> I hope the delay in my response isn't too much of an inconvenience for
> anyone. Dan, Kevin: confirm and I'll add you to the line up.
>
> --George
>
> On Sun, Nov 20, 2016 at 8:44 PM siddharth anand <san...@apache.org> wrote:
>
> > I suspect Clover Health is extremely busy with all of the benefit
> > enrollments going on right now..
> >
> > George,
> > When you come up for air, it looks like both Dan(Airbnb) and Kevin(Agari)
> > have talk ideas.
> >
> > -s
> >
> > On Wed, Nov 16, 2016 at 11:50 PM, Dan Davydov <
> > dan.davy...@airbnb.com.invalid> wrote:
> >
> > > Based on chatting with a couple of people today at the Airflow meet-up
> I
> > > think there has been some demand for an airflow operations talk,
> > > specifically around monitoring/alerting. If there is still room I can
> > give
> > > a talk about this, let me know George.
> > >
> > > On Thu, Nov 10, 2016 at 10:17 AM, siddharth anand <san...@apache.org>
> > > wrote:
> > >
> > > > Kevin,
> > > > Here's a link to the 1Q17 meet-up.
> > > >
> > https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/
> > > > 235259523/
> > > >
> > > > Both upcoming meet-ups (next week at WePay and 1Q17 at Clover Health)
> > can
> > > > be found on http://www.meetup.com/Bay-Area-Apache-Airflow-
> > > > Incubating-Meetup/
> > > >
> > > > -s
> > > >
> > > >
> > > > On Wed, Nov 9, 2016 at 4:24 PM, Kevin Mandich <
> kevinmand...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi George,
> > > > >
> > > > > If there is still room, I'd like to give a talk about how we use
> > > Airflow
> > > > at
> > > > > my company, Agari. We are a data company that is working to
> eliminate
> > > > > inbound, targeted e-mail attacks to our customers
> (spear-phishing). I
> > > am
> > > > > currently working as a data scientist who is also responsible for
> > > > shipping
> > > > > my work to production.
> > > > >
> > > > > We currently use Airflow to build models from our telemetry data
> > which
> > > > are
> > > > > then used for scoring in our near-real-time pipeline. I'd like to
> > talk
> > > > > about some of the DAGs we've set up to do this.
> > > > >
> > > > > Please let me know if this sounds reasonable. Thank you,
> > > > >
> > > > > Kevin Mandich
> > > > > Agari Data, Inc.
> > > > >
> > > > >
> > > > > On Mon, Oct 31, 2016 at 11:27 PM, George Leslie-Waksman <
> > > > > geo...@cloverhealth.com.invalid> wrote:
> > > > >
> > > > > > I know it's a bit far in advance, but to make sure there's space
> > (and
> > > > > food
> > > > > > and drink), I've scheduled and booked the subsequent meetup for
> > > January
> > > > > > 11th at Clover Health in SF.
> > > > > >
> > > > > > If anyone wants to volunteer to talk, let me know, otherwise I'll
> > > > > probably
> > > > > > start bugging folks sometime after Thanksgiving and before the
> > > December
> > > > > > holidays.
> > > > > >
> > > > > > --George Leslie-Waksman
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Airflow 1.8.0 Alpha 1

2017-01-03 Thread Dan Davydov
I have also started on this effort, recently Alex Guziel and I have been
pushing Airbnb's custom cherries onto master to get Airbnb back onto master
in order for us to do a release.

I think it might make sense to wait for these two commits to get merged in
since they would be quite nice to have for all Airflow users and seem like
they will be merged soon:
Schedule all pending DAG runs in a single scheduler loop -
https://github.com/apache/incubator-airflow/pull/1906
Add Support for dag.backfill=(True|False) Option -
https://github.com/apache/incubator-airflow/pull/1830
Impersonation Support + Cgroups - https://github.com/apache/
incubator-airflow/pull/1934 (this is kind of important from the Airbnb side
so that we can help test the new master without having to cherrypick this
PR on top of it which would make the testing unreliable for others).

If there are PRs that affect the core of Airflow that other committers
think are important to merge we could include these too. I can commit to
pushing out the Impersonation/Cgroups PR this week pending PR comments.
What do you think Bolke?

On Tue, Jan 3, 2017 at 4:26 AM, Bolke de Bruin  wrote:

> Hey Alex,
>
> I have noticed the same, and it is also the reason why we have Alpha
> versions. For now I have noticed the following:
>
> * Tasks can get in limbo between scheduler and executor:
> https://github.com/apache/incubator-airflow/pull/1948 <
> https://github.com/apache/incubator-airflow/pull/1948>
> * Try_number not increased due to reset in LocalTaskJob:
> https://github.com/apache/incubator-airflow/pull/1969 <
> https://github.com/apache/incubator-airflow/pull/1969>
> * one_failed trigger not executed
>
> My idea is to move to a Samba style of releases eventually, but for now I
> would like to get master into a state that we understand and therefore not
> accept any patches that do not address any bugs.
>
> If you (or anyone else) can review the above PRs and add your own as well
> then I can create another Alpha version. I’ll be on gitter as much as I can
> so we can speed up if needed.
>
> - Bolke
>
> > On 3 Jan 2017, at 08:51, Alex Van Boxel  wrote:
> >
> > Hey Bolke,
> >
> > thanks for getting this moving. But I already have some blockers, since I
> > moved up master to this release (moved from end November to now)
> stability
> > has gone down (certainly on Celary). I'm trying to identify the core
> > problems and see if I can fix them.
> >
> > On Sat, Dec 31, 2016 at 9:52 PM Bolke de Bruin  > wrote:
> >
> > Dear All,
> >
> > On the verge of the New Year, I decided to be a little bit cheeky and to
> > make available an Airflow 1.8.0 Alpha 1. We have been talking about it
> for
> > a long time now and by doing this I wanted bootstrap the process. It
> should
> > by no means be considered an Apache release yet. This is for testing
> > purposes in the dev community around Airflow, nothing else.
> >
> > The build is exactly the same as the state of master (git 410736d) plus
> the
> > change to version “1.8.0.alpha1” in version.py.
> >
> > I am dedicating quite some time next week and beyond to get a release
> out.
> > Hopefully we can get some help with testing, changelog etc. To make this
> > possible I would like to propose a freeze to adding new features for at
> > least two weeks - say until Jan 15.
> >
> > You can find the tar here: http://people.apache.org/~bolke/ <
> > http://people.apache.org/~bolke/ > .
> It isn’t signed. Following versions
> > will be. SHA is available.
> >
> > Lastly, Alpha 1 does not have the fix for retries yet. So we will get an
> > Alpha 2 :-). @Max / @Dan / @Paul: a potential fix is in
> > https://github.com/apache/incubator-airflow/pull/1948 <
> https://github.com/apache/incubator-airflow/pull/1948> <
> > https://github.com/apache/incubator-airflow/pull/1948 <
> https://github.com/apache/incubator-airflow/pull/1948>> , but your
> feedback
> > is required as it is entrenched in new processing code that you are
> running
> > in production afaik - so I wonder what happens in your fork.
> >
> > Happy New Year!
> >
> > Bolke
> >
> >
> >
> > --
> >  _/
> > _/ Alex Van Boxel
>
>


Re: Subsequent Airflow Meetup: 2017/01/11

2016-11-16 Thread Dan Davydov
Based on chatting with a couple of people today at the Airflow meet-up I
think there has been some demand for an airflow operations talk,
specifically around monitoring/alerting. If there is still room I can give
a talk about this, let me know George.

On Thu, Nov 10, 2016 at 10:17 AM, siddharth anand  wrote:

> Kevin,
> Here's a link to the 1Q17 meet-up.
> https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/
> 235259523/
>
> Both upcoming meet-ups (next week at WePay and 1Q17 at Clover Health) can
> be found on http://www.meetup.com/Bay-Area-Apache-Airflow-
> Incubating-Meetup/
>
> -s
>
>
> On Wed, Nov 9, 2016 at 4:24 PM, Kevin Mandich 
> wrote:
>
> > Hi George,
> >
> > If there is still room, I'd like to give a talk about how we use Airflow
> at
> > my company, Agari. We are a data company that is working to eliminate
> > inbound, targeted e-mail attacks to our customers (spear-phishing). I am
> > currently working as a data scientist who is also responsible for
> shipping
> > my work to production.
> >
> > We currently use Airflow to build models from our telemetry data which
> are
> > then used for scoring in our near-real-time pipeline. I'd like to talk
> > about some of the DAGs we've set up to do this.
> >
> > Please let me know if this sounds reasonable. Thank you,
> >
> > Kevin Mandich
> > Agari Data, Inc.
> >
> >
> > On Mon, Oct 31, 2016 at 11:27 PM, George Leslie-Waksman <
> > geo...@cloverhealth.com.invalid> wrote:
> >
> > > I know it's a bit far in advance, but to make sure there's space (and
> > food
> > > and drink), I've scheduled and booked the subsequent meetup for January
> > > 11th at Clover Health in SF.
> > >
> > > If anyone wants to volunteer to talk, let me know, otherwise I'll
> > probably
> > > start bugging folks sometime after Thanksgiving and before the December
> > > holidays.
> > >
> > > --George Leslie-Waksman
> > >
> >
>


Re: Shout out!

2016-08-12 Thread Dan Davydov
+1

On Aug 12, 2016 8:02 AM, "Chris Riccomini"  wrote:

> Same. It's awesome.
>
> On Thu, Aug 11, 2016 at 7:28 PM, siddharth anand 
> wrote:
> > FYI!
> > Just wanted to give a special shout-out for jlowin for writing a great
> > merge tool for committers. Thx to this tool, merging your PR is super
> easy.
> >
> > -s
>


Re: Speeding up the scheduler - request for comments

2016-06-03 Thread Dan Davydov
Scheduler loop times are definitely a concern (at least for Airbnb), and +1
for option 2 as well if it can be implemented correctly. What is important
for me is that we should always be able to easily tell which of the
dependencies are met and which aren't in the event based model.

On Fri, Jun 3, 2016 at 5:53 PM, Chris Riccomini 
wrote:

> Hey Bolke,
>
> > Are scheduler loop times a concern at all?
>
> Yes, I strongly believe that they are. Especially as we add more
> DAGs/tasks.
>
> I am not a fan of (1). Caching is just going to create cache consistency
> issues, and be really annoying to manage, IMO.
>
> I agree that (2) seems more appealing. I can't comment on the feasibility
> of it, as I'm not well acquainted enough with the scheduler yet.
>
> Cheers,
> Chris
>
> On Fri, Jun 3, 2016 at 2:26 PM, Bolke de Bruin  wrote:
>
> > Hi,
> >
> > I am looking at speeding up the scheduler. Currently loop times increase
> > with the amount of tasks in a dag. This is due to
> > TaskInstance.are_depedencies_met executing several aggregation functions
> on
> > the database. These calls are expensive: between 0.05-0.15s per task and
> > for every scheduler loop this gets called twice. This call is where the
> > scheduler spends around 90% of its time when evaluating dags and is the
> > reason for people that have a large amount of tasks per dag to so quite
> > large loop times (north of 600s).
> >
> > I see 2 options to optimize the loop without going to a multiprocessing
> > approach which will just put the problem down the line (ie. the db or
> when
> > you don’t have enough cores anymore).
> >
> > 1. Cache the call to TI.are_dependencies_met by either caching in a
> > something like memcache or removing the need for the double call
> > (update_state and process_dag both make the call to
> > TI.are_dependencies_met). This would more or less cut the time in half.
> >
> > 2. Notify the downstream tasks of a state change of a upstream task. This
> > would remove the need for the aggregation as the task would just ‘know’.
> It
> > is a bit harder to implement correctly as you need to make sure you keep
> > being in a consistent state. Obviously you could still run a integrity
> > check once in a while. This option would make the aggregation event based
> > and significantly reduce the time spend here to around 1-5% of the
> current
> > scheduler. There is a slight overhead added at a state change of the
> > TaskInstance (managed by the TaskInstance itself).
> >
> > What do you think? My preferred option is #2. Am i missing any other
> > options? Are scheduler loop times a concern at all?
> >
> > Thanks
> > Bolke
> >
> >
> >
>


Re: Voting Changes for Scheduler-related PRs/Commits

2016-05-12 Thread Dan Davydov
@Jakob
What if we made it more generic, e.g. a +1 from any commiter from a company
that is running at a certain scale (e.g. at least X workers) and willing to
help stage releases in their prods until we have more comprehensive test
coverage/an open source staging environment? This is in Airflow's best
interests as otherwise stability will suffer.

On Thu, May 12, 2016 at 1:44 PM, Chris Riccomini 
wrote:

> @Sid, perhaps defining a cool-off window before a scheduler change can be
> committed. That way, everyone that cares can have a look at it? Also,
> having more than one +1 seems OK with me for scheduler changes. We will
> have to decide what "scheduler change" means, though.
>
> On Thu, May 12, 2016 at 1:39 PM, Jakob Homan  wrote:
>
> > Hey Sid-
> >Thanks for the discussion.  It's a good chance to the new
> > contributors to get more experience with the ASF.
> >
> >Unfortunately, what you propose is not possible in ASF.  As a
> > meritocracy, ASF does not recognize individual's employers (or lack
> > thereof).  Merit is earned by the individual and follows them as they
> > move from organization to organization.  This is true even for
> > podlings.  Employees of certain organizations are not given extra
> > power over a project or vote due to their relationship with the
> > employer.
> >
> >ASF does recognize that at times people will be representing their
> > employer (with my $EMPLOYER hat on, is a common way of expressing
> > this), but expects that everyone is acting in the best interest of the
> > project.
> >
> > -Jakob
> >
> > On 12 May 2016 at 12:58, Siddharth Anand  wrote:
> > > Hi Folks!As many of you know, Apache Airflow (incubating) came from
> > Airbnb, where it currently still represents the largest Airflow
> deployment.
> > Airflow entered the Apache Incubator shortly over a month ago but still
> > depends on Airbnb's production deployment to vet its release candidates.
> As
> > Airflow's adoption increases, we expect to leverage multiple companies in
> > conjunction with Apache Infra resources to vet some of the more
> performance
> > critical pieces of the code base (e.g. scheduler). We're not there yet.
> > > So, for future commits and PRs involving the scheduler (and possibly
> > other components, e.g. executors), I propose a 2 vote system : at least 1
> > vote from an Airbnb committer and at least 1 vote from a non-Airbnb
> > committer, separate from the PR author. This will more readily stabilize
> > the Airbnb production system that we rely on to vet and cut releases,
> > speeding up our release cycle.
> > > Please share your thoughts on the matter along with a vote for/against.
> > > -s
> >
>


Re: 1.7.1 release status

2016-04-28 Thread Dan Davydov
Definitely, here were the issues we hit:
- airbnb/airflow#1365 occured
- Webservers/scheduler were timing out and stuck in restart cycles due to
increased time spent on parsing DAGs due to airbnb/airflow#1213/files
- Failed tasks that ran after the upgrade and the revert (after we reverted
the upgrade) were unable to be cleared (but running the tasks through the
UI worked without clearing them)
- The way log files were stored on S3 was changed (airflow now requires a
connection to be setup) which broke log storage
- Some DAGs were broken (unable to be parsed) due to package reorganization
in open-source (the import paths were changed) (the utils refactor commit)

On Thu, Apr 28, 2016 at 12:17 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Dan,
>
> Are you able to share some of the bugs you have been hitting and connected
> commits?
>
> We could at the very least learn from them and maybe even improve testing.
>
> Bolke
>
>
> > Op 28 apr. 2016, om 06:51 heeft Dan Davydov
> <dan.davy...@airbnb.com.INVALID> het volgende geschreven:
> >
> > All of the blockers were fixed as of yesterday (there was some issue that
> > Jeremiah was looking at with the last release candidate which I think is
> > fixed but I'm not sure). I started staging the airbnb_1.7.1rc3 tag
> earlier
> > today, so as long as metrics look OK and the 1.7.1rc2 issues seem
> resolved
> > tomorrow I will release internally either tomorrow or Monday (we try to
> > avoid releases on Friday). If there aren't any issues we can push the
> 1.7.1
> > tag on Monday/Tuesday.
> >
> > @Sid
> > I think we were originally aiming to deploy internally once every two
> weeks
> > but we decided to do it once a month in the end. I'm not too sure about
> > that so Max can comment there.
> >
> > We have been running 1.7.0 in production for about a month now and it
> > stable.
> >
> > I think what really slowed down this release cycle is some commits that
> > caused severe bugs that we decided to roll-forward with instead of
> rolling
> > back. We can potentially try reverting these commits next time while the
> > fixes are applied for the next version, although this is not always
> trivial
> > to do.
> >
> > On Wed, Apr 27, 2016 at 9:31 PM, Siddharth Anand <
> > siddharthan...@yahoo.com.invalid> wrote:
> >
> >> Btw, is anyone of the committers running 1.7.0 or later in any staging
> or
> >> production env? I have to say that given that 1.6.2 was the most stable
> >> release and is 4 or more months old does not say much for our release
> >> cadence or process. What's our plan for 1.7.1?
> >>
> >> Sent from Sid's iPhone
> >>
> >>> On Apr 27, 2016, at 9:05 PM, Chris Riccomini <criccom...@apache.org>
> >> wrote:
> >>>
> >>> Hey all,
> >>>
> >>> I just wanted to check in on the 1.7.1 release status. I know there
> have
> >>> been some major-ish bugs, as well as several people doing tests. Should
> >> we
> >>> create a 1.7.1 release JIRA, and track outstanding issues there?
> >>>
> >>> Cheers,
> >>> Chris
> >>
> >>
>
>