Re: Airflow 2.0

2016-11-21 Thread twinkle
Hi,

Like we have an admin panel,where we can configure the database connections
and query them . Similarly based on the executor backend chosen, some
information should be provided.

Like for Airflow + rabbit Mq + Celery backend, if rabbit mq goes down, it
keeps on showing the message that task has been submitted in the queue,
which comes out to be false. There should be some area in UI, to view this.

Regards,
Twinkle



On Sat, Nov 19, 2016 at 11:34 PM, siddharth anand  wrote:

> I feel a lot of changes happen to areas of the code shared by both
> scheduler and webserver, such as models. Any time we have changes to these
> shared areas, we will need to release the scheduler as well.
>
> Also, it's not clear to me (out of ignorance perhaps) how the above would
> speed up releasing.
>
> FYI, on the topic of fixing bugs, nice fix just popped up (and got merged)
> from Vijay Bhat:
> https://github.com/apache/incubator-airflow/pull/1892
>
> -s
>
> On Fri, Nov 18, 2016 at 5:58 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Totally agree on all your points Sid.
> >
> > My feeling is that at the moment the most critical thing for the project
> is
> > to get a release out and get to a steady pace of high quality releases.
> >
> > Somehow breaking down the package seem to me like it would really help
> with
> > the release process. Maybe an idea is to make 1.9 a release made of
> smaller
> > packages where airflow = `(airflow-core + airflow-operators +
> > airflow-webserver)` or something like that. I'm thinking it would allow
> to
> > release often on airflow-operators & airflow-webserver.
> >
> > Max
> >
> > On Fri, Nov 18, 2016 at 5:34 PM, siddharth anand 
> > wrote:
> >
> > > David
> > > https://issues.apache.org/jira/browse/AIRFLOW-558 (i.e. http
> > > s://github.com/apache/incubator-airflow/pull/1830 ) Is on my plate..
> > have
> > > already gone through many rounds of reviews, testing, and fixes with
> the
> > > submitter and does not need to wait till 2.0. We should be able to
> merge
> > it
> > > soon. BTW, you are encouraged to vote on these PRs so maintainers can
> > > prioritize their time.
> > >
> > > Max,
> > >
> > > Thanks for kicking off this thread.
> > >
> > > Regarding 2.0, we've associated feature deprecation and non-backward
> > > compatible changes with 2.0. Some of this work might be pretty
> > > earth-shaking to Airflow users. IMHO, changes that increase user pain
> at
> > > upgrade time need to be carefully balanced against value.
> > >
> > > Watching both Gitter and the email list, there are a variety of
> stumbling
> > > points (for new users) that many of us who have been using the product
> > for
> > > 1-2 years have forgotten. A fair number of people still mention that
> > > getting Airflow up and running is no simple task - i.e. Alex mentioned
> > this
> > > in his talk at the last meet-up. The recent BlueYonder talk referenced
> > > https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
> > >
> > > Though we may be numerically near 2.0 in terms of release numbers, I'd
> > > prefer to prioritize a few things higher than releasing 2.0. We need to
> > > build and exercise a few necessary muscles : timely PR processing &
> > timely
> > > Apache releases (i.e. quarterly). Beyond that, I'd like to prioritize
> the
> > > "common pitfall" problems to ease on-boarding. Some of these don't need
> > to
> > > wait for a major release. The ones that do can be developed on a
> separate
> > > 2.0 branch and baked, reviewed, and voted on by the community before we
> > > consider dropping it into master.
> > >
> > > That way, we can keep master healthy to support the increasing rate of
> > > community-submitted PRs that we are seeing and reduce the cycle time of
> > > cutting stable releases, all while working on big-bang changes for 2.0
> > > independently.
> > >
> > > Just my $0.02
> > > -s
> > >
> > > On Fri, Nov 18, 2016 at 3:57 PM, Chris Riccomini <
> criccom...@apache.org>
> > > wrote:
> > >
> > > > > RIP out the charting application and the data profiler
> > > >
> > > > Yes please! +1
> > > >
> > > > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin
> > > >  wrote:
> > > > > Another point that may be controversial for Airflow 2.0: RIP out
> the
> > > > > charting application and the data profiler. Even though it's nice
> to
> > > have
> > > > > it there, it's just out of scope and has major security
> > > > issues/implications.
> > > > >
> > > > > I'm not sure how popular it actually is. We may need to run a
> survey
> > at
> > > > > some point around this kind of questions.
> > > > >
> > > > > Max
> > > > >
> > > > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > > > maximebeauche...@gmail.com> wrote:
> > > > >
> > > > >> Using FAB's Model, we get pretty much all of that (REST API,
> > > auth/perms,
> > > > >> CRUD) for free:
> > > > >> http://flask-appbuilder.readthedocs.io/en/latest/
> > > > >> quickhowto.html?highlight=rest#exposed-methods
> 

RE: Airflow 2.0

2016-11-21 Thread Ryabchuk, Pavlo
-1. We extremely rely on data profiling, as a pipeline health monitoring tool 

-Original Message-
From: Chris Riccomini [mailto:criccom...@apache.org] 
Sent: Saturday, November 19, 2016 1:57 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow 2.0

> RIP out the charting application and the data profiler

Yes please! +1

On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin  
wrote:
> Another point that may be controversial for Airflow 2.0: RIP out the 
> charting application and the data profiler. Even though it's nice to 
> have it there, it's just out of scope and has major security 
> issues/implications.
>
> I'm not sure how popular it actually is. We may need to run a survey 
> at some point around this kind of questions.
>
> Max
>
> On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin < 
> maximebeauche...@gmail.com> wrote:
>
>> Using FAB's Model, we get pretty much all of that (REST API, 
>> auth/perms,
>> CRUD) for free:
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
>> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
>> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
>> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
>> quickhowto.html?highlight=rest#exposed-methods
>>
>> I'm pretty intimate with FAB since I use it (and contributed to it) 
>> for Superset/Caravel.
>>
>> All that's needed is to derive FAB's model class instead of 
>> SqlAlchemy's model class (which FAB's model wraps and adds 
>> functionality to and is 100% compatible AFAICT).
>>
>> Max
>>
>> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini 
>> 
>> wrote:
>>
>>> > It may be doable to run this as a different package
>>> `airflow-webserver`, an
>>> > alternate UI at first, and to eventually rip out the old UI off of 
>>> > the
>>> main
>>> > package.
>>>
>>> This is the same strategy that I was thinking of for AIRFLOW-85. You 
>>> can build the new UI in parallel, and then delete the old one later. 
>>> I really think that a REST interface should be a pre-req to any 
>>> large/new UI changes, though. Getting unified so that everything is 
>>> driven through REST will be a big win.
>>>
>>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin 
>>>  wrote:
>>> > A multi-tenant UI with composable roles on top of granular permissions.
>>> >
>>> > Migrating from Flask-Admin to Flask App Builder would be an 
>>> > easy-ish win (since they're both Flask). FAB Provides a good 
>>> > authentication and permission model that ships out-of-the-box with 
>>> > a REST api. Suffice to define FAB models (derivative of 
>>> > SQLAlchemy's model) and you get a set
>>> of
>>> > perms for the model (can_show, can_list, can_add, can_change,
>>> can_delete,
>>> > ...) and a set of CRUD REST endpoints. It would also allow us to 
>>> > rip out the authentication backend code out of Airflow and rely on FAB 
>>> > for that.
>>> > Also every single view gets permissions auto-created for it, and 
>>> > there
>>> are
>>> > easy way to define row-level type filters based on user permissions.
>>> >
>>> > It may be doable to run this as a different package
>>> `airflow-webserver`, an
>>> > alternate UI at first, and to eventually rip out the old UI off of 
>>> > the
>>> main
>>> > package.
>>> >
>>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%
>>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64
>>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%2BjcEs8%
>>> > 3D&reserved=0
>>> >
>>> > I'd love to carve some time and lead this.
>>> >
>>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini 
>>> > >> >
>>> > wrote:
>>> >
>>> >> Full-fledged REST API (that the UI also uses) would be great in 2.0.
>>> >>
>>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley  wrote:
>>> >> > Hi All,
>>> >> >
>>> >> > We have been using Airflow heavily for the last couple months 
>>> >> > and
>>> it’s
>>> >> been great so far. Here are a few things we’d like to see 
>>> >> prioritized
>>> in
>>> >> 2.0.
>>> >> >
>>> >> > 1) Role based access to DAGs:
>>> >> > We would like to see better role based access through the UI.
>>> There’s a
>>> >> related ticket out there but it hasn’t seen any action in a few 
>>> >> months
>>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
>>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7C01
>>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391
>>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvxNvB
>>> >> > N531k%3D&reserved=0
>>> >> >
>>> >> > We use a templating system to create/deploy DAGs dynamically 
>>> >> > based on
>>> >> some directory/file structure. This allows analysts to quickly 
>>> >> deploy
>>> and
>>> >> schedule their ETL code without having to interact with the 
>>> >> Airflow installation directly. It would be great if those same 
>>> >> analy

Re: Airflow 2.0

2016-11-21 Thread Sergei Iakhnin
I've been running Airflow on 1500 cores in the context of scientific
workflows for the past year and a half. Features that would be important to
me for 2.0:

- Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
- Ensure scheduler can be run continuously without needing restarts. Right
now it gets into some ill-determined bad state forcing me to restart it
every 20 minutes.
- Ensure scheduler can handle tens of thousands of active workflows. Right
now this results in extremely long scheduling times and inconsistent
scheduling even at 2 thousand active workflows.
- Add more flexible task scheduling prioritization. The default
prioritization is the opposite of the behaviour I want. I would prefer that
downstream tasks always have higher priority than upstream tasks to cause
entire workflows to tend to complete sooner, rather than scheduling tasks
from other workflows. Having a few scheduling prioritization strategies
would be beneficial here.
- Provide better support for manually-triggered DAGs on the UI i.e. by
showing them as queued.
- Provide some resource management capabilities via something like slots
that can be defined on workers and occupied by tasks. Using celery's
concurrency parameter at the airflow server level is too coarse-grained as
it forces all workers to be the same, and does not allow proper resource
management when different workflow tasks have different resource
requirements thus hurting utilization (a worker could run 8 parallel tasks
with small memory footprint, but only 1 task with large memory footprint
for instance).

With best regards,

Sergei.


On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo 
wrote:

> -1. We extremely rely on data profiling, as a pipeline health monitoring
> tool
>
> -Original Message-
> From: Chris Riccomini [mailto:criccom...@apache.org]
> Sent: Saturday, November 19, 2016 1:57 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow 2.0
>
> > RIP out the charting application and the data profiler
>
> Yes please! +1
>
> On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> > Another point that may be controversial for Airflow 2.0: RIP out the
> > charting application and the data profiler. Even though it's nice to
> > have it there, it's just out of scope and has major security
> issues/implications.
> >
> > I'm not sure how popular it actually is. We may need to run a survey
> > at some point around this kind of questions.
> >
> > Max
> >
> > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Using FAB's Model, we get pretty much all of that (REST API,
> >> auth/perms,
> >> CRUD) for free:
> >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
> >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
> >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
> >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> >> quickhowto.html?highlight=rest#exposed-methods
> >>
> >> I'm pretty intimate with FAB since I use it (and contributed to it)
> >> for Superset/Caravel.
> >>
> >> All that's needed is to derive FAB's model class instead of
> >> SqlAlchemy's model class (which FAB's model wraps and adds
> >> functionality to and is 100% compatible AFAICT).
> >>
> >> Max
> >>
> >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> >> 
> >> wrote:
> >>
> >>> > It may be doable to run this as a different package
> >>> `airflow-webserver`, an
> >>> > alternate UI at first, and to eventually rip out the old UI off of
> >>> > the
> >>> main
> >>> > package.
> >>>
> >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
> >>> can build the new UI in parallel, and then delete the old one later.
> >>> I really think that a REST interface should be a pre-req to any
> >>> large/new UI changes, though. Getting unified so that everything is
> >>> driven through REST will be a big win.
> >>>
> >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> >>>  wrote:
> >>> > A multi-tenant UI with composable roles on top of granular
> permissions.
> >>> >
> >>> > Migrating from Flask-Admin to Flask App Builder would be an
> >>> > easy-ish win (since they're both Flask). FAB Provides a good
> >>> > authentication and permission model that ships out-of-the-box with
> >>> > a REST api. Suffice to define FAB models (derivative of
> >>> > SQLAlchemy's model) and you get a set
> >>> of
> >>> > perms for the model (can_show, can_list, can_add, can_change,
> >>> can_delete,
> >>> > ...) and a set of CRUD REST endpoints. It would also allow us to
> >>> > rip out the authentication backend code out of Airflow and rely on
> FAB for that.
> >>> > Also every single view gets permissions auto-created for it, and
> >>> > there
> >>> are
> >>> > easy way to define row-level type filters based on user permissions.
> >>> >
> >>> > It may be doable t

Re: Airflow 2.0

2016-11-21 Thread David Batista
A small request, which might be handy.

Having the possibility to select multiple tasks and mark them as
Success/Clear/etc.

Allow the UI to select individual tasks (i.e., inside the Tree View) and
then have a button to mark them as Success/Clear/etc.

On 21 November 2016 at 14:22, Sergei Iakhnin  wrote:

> I've been running Airflow on 1500 cores in the context of scientific
> workflows for the past year and a half. Features that would be important to
> me for 2.0:
>
> - Add FK to dag_run to the task_instance table on Postgres so that
> task_instances can be uniquely attributed to dag runs.
> - Ensure scheduler can be run continuously without needing restarts. Right
> now it gets into some ill-determined bad state forcing me to restart it
> every 20 minutes.
> - Ensure scheduler can handle tens of thousands of active workflows. Right
> now this results in extremely long scheduling times and inconsistent
> scheduling even at 2 thousand active workflows.
> - Add more flexible task scheduling prioritization. The default
> prioritization is the opposite of the behaviour I want. I would prefer that
> downstream tasks always have higher priority than upstream tasks to cause
> entire workflows to tend to complete sooner, rather than scheduling tasks
> from other workflows. Having a few scheduling prioritization strategies
> would be beneficial here.
> - Provide better support for manually-triggered DAGs on the UI i.e. by
> showing them as queued.
> - Provide some resource management capabilities via something like slots
> that can be defined on workers and occupied by tasks. Using celery's
> concurrency parameter at the airflow server level is too coarse-grained as
> it forces all workers to be the same, and does not allow proper resource
> management when different workflow tasks have different resource
> requirements thus hurting utilization (a worker could run 8 parallel tasks
> with small memory footprint, but only 1 task with large memory footprint
> for instance).
>
> With best regards,
>
> Sergei.
>
>
> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> ext-pavlo.ryabc...@here.com>
> wrote:
>
> > -1. We extremely rely on data profiling, as a pipeline health monitoring
> > tool
> >
> > -Original Message-
> > From: Chris Riccomini [mailto:criccom...@apache.org]
> > Sent: Saturday, November 19, 2016 1:57 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow 2.0
> >
> > > RIP out the charting application and the data profiler
> >
> > Yes please! +1
> >
> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> > > Another point that may be controversial for Airflow 2.0: RIP out the
> > > charting application and the data profiler. Even though it's nice to
> > > have it there, it's just out of scope and has major security
> > issues/implications.
> > >
> > > I'm not sure how popular it actually is. We may need to run a survey
> > > at some point around this kind of questions.
> > >
> > > Max
> > >
> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > >> Using FAB's Model, we get pretty much all of that (REST API,
> > >> auth/perms,
> > >> CRUD) for free:
> > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> > >> quickhowto.html?highlight=rest#exposed-methods
> > >>
> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
> > >> for Superset/Caravel.
> > >>
> > >> All that's needed is to derive FAB's model class instead of
> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> > >> functionality to and is 100% compatible AFAICT).
> > >>
> > >> Max
> > >>
> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > >> 
> > >> wrote:
> > >>
> > >>> > It may be doable to run this as a different package
> > >>> `airflow-webserver`, an
> > >>> > alternate UI at first, and to eventually rip out the old UI off of
> > >>> > the
> > >>> main
> > >>> > package.
> > >>>
> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
> > >>> can build the new UI in parallel, and then delete the old one later.
> > >>> I really think that a REST interface should be a pre-req to any
> > >>> large/new UI changes, though. Getting unified so that everything is
> > >>> driven through REST will be a big win.
> > >>>
> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > >>>  wrote:
> > >>> > A multi-tenant UI with composable roles on top of granular
> > permissions.
> > >>> >
> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
> > >>> > authentication and permission model that ships out-of-the-box with
> > >>> > a REST api. Suffice 

Dynamic creation of DAG

2016-11-21 Thread Deepak Kumar Malladi
Hi,

I want to dynamically create DAG during run time. I tried the snippet given
in the documentation. But it didnt work for me.

Any pointer on how to trigger DAGs which aren't actually present in DAG
folder but are created through code execution (dynamically created)?


Thanks & Regards,
Deepak


Re: Airflow 2.0

2016-11-21 Thread Chris Riccomini
> Ensure scheduler can be run continuously without needing restarts

+1

On Mon, Nov 21, 2016 at 5:25 AM, David Batista  wrote:
> A small request, which might be handy.
>
> Having the possibility to select multiple tasks and mark them as
> Success/Clear/etc.
>
> Allow the UI to select individual tasks (i.e., inside the Tree View) and
> then have a button to mark them as Success/Clear/etc.
>
> On 21 November 2016 at 14:22, Sergei Iakhnin  wrote:
>
>> I've been running Airflow on 1500 cores in the context of scientific
>> workflows for the past year and a half. Features that would be important to
>> me for 2.0:
>>
>> - Add FK to dag_run to the task_instance table on Postgres so that
>> task_instances can be uniquely attributed to dag runs.
>> - Ensure scheduler can be run continuously without needing restarts. Right
>> now it gets into some ill-determined bad state forcing me to restart it
>> every 20 minutes.
>> - Ensure scheduler can handle tens of thousands of active workflows. Right
>> now this results in extremely long scheduling times and inconsistent
>> scheduling even at 2 thousand active workflows.
>> - Add more flexible task scheduling prioritization. The default
>> prioritization is the opposite of the behaviour I want. I would prefer that
>> downstream tasks always have higher priority than upstream tasks to cause
>> entire workflows to tend to complete sooner, rather than scheduling tasks
>> from other workflows. Having a few scheduling prioritization strategies
>> would be beneficial here.
>> - Provide better support for manually-triggered DAGs on the UI i.e. by
>> showing them as queued.
>> - Provide some resource management capabilities via something like slots
>> that can be defined on workers and occupied by tasks. Using celery's
>> concurrency parameter at the airflow server level is too coarse-grained as
>> it forces all workers to be the same, and does not allow proper resource
>> management when different workflow tasks have different resource
>> requirements thus hurting utilization (a worker could run 8 parallel tasks
>> with small memory footprint, but only 1 task with large memory footprint
>> for instance).
>>
>> With best regards,
>>
>> Sergei.
>>
>>
>> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
>> ext-pavlo.ryabc...@here.com>
>> wrote:
>>
>> > -1. We extremely rely on data profiling, as a pipeline health monitoring
>> > tool
>> >
>> > -Original Message-
>> > From: Chris Riccomini [mailto:criccom...@apache.org]
>> > Sent: Saturday, November 19, 2016 1:57 AM
>> > To: dev@airflow.incubator.apache.org
>> > Subject: Re: Airflow 2.0
>> >
>> > > RIP out the charting application and the data profiler
>> >
>> > Yes please! +1
>> >
>> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
>> > maximebeauche...@gmail.com> wrote:
>> > > Another point that may be controversial for Airflow 2.0: RIP out the
>> > > charting application and the data profiler. Even though it's nice to
>> > > have it there, it's just out of scope and has major security
>> > issues/implications.
>> > >
>> > > I'm not sure how popular it actually is. We may need to run a survey
>> > > at some point around this kind of questions.
>> > >
>> > > Max
>> > >
>> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
>> > > maximebeauche...@gmail.com> wrote:
>> > >
>> > >> Using FAB's Model, we get pretty much all of that (REST API,
>> > >> auth/perms,
>> > >> CRUD) for free:
>> > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
>> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
>> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
>> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
>> > >> quickhowto.html?highlight=rest#exposed-methods
>> > >>
>> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
>> > >> for Superset/Caravel.
>> > >>
>> > >> All that's needed is to derive FAB's model class instead of
>> > >> SqlAlchemy's model class (which FAB's model wraps and adds
>> > >> functionality to and is 100% compatible AFAICT).
>> > >>
>> > >> Max
>> > >>
>> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
>> > >> 
>> > >> wrote:
>> > >>
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>>
>> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
>> > >>> can build the new UI in parallel, and then delete the old one later.
>> > >>> I really think that a REST interface should be a pre-req to any
>> > >>> large/new UI changes, though. Getting unified so that everything is
>> > >>> driven through REST will be a big win.
>> > >>>
>> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
>> > >>>  wrote:
>> > >>> > A multi-tenant UI with composable roles on top of granular
>> > permissions.
>> > >>> 

Re: Airflow 2.0

2016-11-21 Thread Arunprasad Venkatraman
 > Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
>  Ensure scheduler can be run continuously without needing restarts.
>  Ensure scheduler can handle tens of thousands of active workflows

+1

We are planning to run around 40,000 tasks a day using airflow and some of
them are critical to give quick feedback to developers. Currently having
execution date to uniquely identify tasks does not work for us since we
mainly trigger dags (instead of running them on schedule). And we collide
with 1 sec granularity on several occasions.  Having a task uuid or
associating dag_run to task_instance as suggested by Sergei table will help
mitigate this issue for us and would make it easy for us to update task
results too. We would be happy to start working on this if it makes sense.

Also we are wondering if there were any work done in community to support
multiple schedulers(or alternates to mysql/Postgres) because 1 scheduler
does not scale for us well and we see slow down of up to couple of minute
sometimes when there are several pending tasks.

Thanks



On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini 
wrote:

> > Ensure scheduler can be run continuously without needing restarts
>
> +1
>
> On Mon, Nov 21, 2016 at 5:25 AM, David Batista  wrote:
> > A small request, which might be handy.
> >
> > Having the possibility to select multiple tasks and mark them as
> > Success/Clear/etc.
> >
> > Allow the UI to select individual tasks (i.e., inside the Tree View) and
> > then have a button to mark them as Success/Clear/etc.
> >
> > On 21 November 2016 at 14:22, Sergei Iakhnin  wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be
> important to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabc...@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -Original Message-
> >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauche...@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data profiler. Even though it's nice to
> >> > > have it there, it's just out of scope and has major security
> >> > issues/implications.
> >> > >
> >> > > I'm not sure how popular it actually is. We may need to run a survey
> >> > > at some point around this kind of questions.
> >> > >
> >> > > Max
> >> > >
> >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> >> > > maximebeauche...@gmail.com> wrote:
> >> > >
> >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> >> > >> auth/perms,
> >> > >> CRUD) for free:
> >> > >> https://emea01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Ffla
> >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%
> 7C%7C0064f
> >> > >> 74fd0d940a

Re: Airflow 2.0

2016-11-21 Thread Gerard Toonstra
+1 on driving everything through a REST API including the UI. This unifies
the access to the scheduler and increases stability.

Consider running a very small webserver (node.js + socket.io), which
enables airflow to communicate scheduler events as they happen
to anything that connects to it through socket.io, including browsers. This
way, the scheduler can forward any task state changes to the UI
so that explicit refreshes are no longer needed. It is possible to make
this optional functionality. If the nodejs server is not there, it won't
affect the functionality, because standard REST still gets the latest state.



On Mon, Nov 21, 2016 at 6:57 PM, Chris Riccomini 
wrote:

> > Ensure scheduler can be run continuously without needing restarts
>
> +1
>
> On Mon, Nov 21, 2016 at 5:25 AM, David Batista  wrote:
> > A small request, which might be handy.
> >
> > Having the possibility to select multiple tasks and mark them as
> > Success/Clear/etc.
> >
> > Allow the UI to select individual tasks (i.e., inside the Tree View) and
> > then have a button to mark them as Success/Clear/etc.
> >
> > On 21 November 2016 at 14:22, Sergei Iakhnin  wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be
> important to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabc...@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -Original Message-
> >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauche...@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data profiler. Even though it's nice to
> >> > > have it there, it's just out of scope and has major security
> >> > issues/implications.
> >> > >
> >> > > I'm not sure how popular it actually is. We may need to run a survey
> >> > > at some point around this kind of questions.
> >> > >
> >> > > Max
> >> > >
> >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> >> > > maximebeauche...@gmail.com> wrote:
> >> > >
> >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> >> > >> auth/perms,
> >> > >> CRUD) for free:
> >> > >> https://emea01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Ffla
> >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%
> 7C%7C0064f
> >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6
> 4919%7C1&sd
> >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> >> > >> quickhowto.html?highlight=rest#exposed-methods
> >> > >>
> >> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
> >> > >> for Superset/Caravel.
> >> > >>
> >> > >> All that's needed is to derive FAB's model class instead of
> >> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> >> > >> functionality to and is 100% compa

Re: Airflow 2.0

2016-11-21 Thread siddharth anand
Sergei,
These are some great ideas -- I would classify at least half of them as
pain points.

Folks!
I suggest people (on the dev list) keep feeding this thread at least for
the next 2 days. I can then float a survey based on these ideas and give
the community a chance to vote so we can prioritize the wish list.

-s

On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin  wrote:

> I've been running Airflow on 1500 cores in the context of scientific
> workflows for the past year and a half. Features that would be important to
> me for 2.0:
>
> - Add FK to dag_run to the task_instance table on Postgres so that
> task_instances can be uniquely attributed to dag runs.
> - Ensure scheduler can be run continuously without needing restarts. Right
> now it gets into some ill-determined bad state forcing me to restart it
> every 20 minutes.
> - Ensure scheduler can handle tens of thousands of active workflows. Right
> now this results in extremely long scheduling times and inconsistent
> scheduling even at 2 thousand active workflows.
> - Add more flexible task scheduling prioritization. The default
> prioritization is the opposite of the behaviour I want. I would prefer that
> downstream tasks always have higher priority than upstream tasks to cause
> entire workflows to tend to complete sooner, rather than scheduling tasks
> from other workflows. Having a few scheduling prioritization strategies
> would be beneficial here.
> - Provide better support for manually-triggered DAGs on the UI i.e. by
> showing them as queued.
> - Provide some resource management capabilities via something like slots
> that can be defined on workers and occupied by tasks. Using celery's
> concurrency parameter at the airflow server level is too coarse-grained as
> it forces all workers to be the same, and does not allow proper resource
> management when different workflow tasks have different resource
> requirements thus hurting utilization (a worker could run 8 parallel tasks
> with small memory footprint, but only 1 task with large memory footprint
> for instance).
>
> With best regards,
>
> Sergei.
>
>
> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> ext-pavlo.ryabc...@here.com>
> wrote:
>
> > -1. We extremely rely on data profiling, as a pipeline health monitoring
> > tool
> >
> > -Original Message-
> > From: Chris Riccomini [mailto:criccom...@apache.org]
> > Sent: Saturday, November 19, 2016 1:57 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow 2.0
> >
> > > RIP out the charting application and the data profiler
> >
> > Yes please! +1
> >
> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> > > Another point that may be controversial for Airflow 2.0: RIP out the
> > > charting application and the data profiler. Even though it's nice to
> > > have it there, it's just out of scope and has major security
> > issues/implications.
> > >
> > > I'm not sure how popular it actually is. We may need to run a survey
> > > at some point around this kind of questions.
> > >
> > > Max
> > >
> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > >> Using FAB's Model, we get pretty much all of that (REST API,
> > >> auth/perms,
> > >> CRUD) for free:
> > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> > >> quickhowto.html?highlight=rest#exposed-methods
> > >>
> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
> > >> for Superset/Caravel.
> > >>
> > >> All that's needed is to derive FAB's model class instead of
> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> > >> functionality to and is 100% compatible AFAICT).
> > >>
> > >> Max
> > >>
> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > >> 
> > >> wrote:
> > >>
> > >>> > It may be doable to run this as a different package
> > >>> `airflow-webserver`, an
> > >>> > alternate UI at first, and to eventually rip out the old UI off of
> > >>> > the
> > >>> main
> > >>> > package.
> > >>>
> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
> > >>> can build the new UI in parallel, and then delete the old one later.
> > >>> I really think that a REST interface should be a pre-req to any
> > >>> large/new UI changes, though. Getting unified so that everything is
> > >>> driven through REST will be a big win.
> > >>>
> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > >>>  wrote:
> > >>> > A multi-tenant UI with composable roles on top of granular
> > permissions.
> > >>> >
> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
> > >>> > authentication and permissio

Re: Airflow 2.0

2016-11-21 Thread siddharth anand
Also, a survey will be a little less noisy and easier to summarize than +1s
in this email thread.
-s (Sid)

On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand  wrote:

> Sergei,
> These are some great ideas -- I would classify at least half of them as
> pain points.
>
> Folks!
> I suggest people (on the dev list) keep feeding this thread at least for
> the next 2 days. I can then float a survey based on these ideas and give
> the community a chance to vote so we can prioritize the wish list.
>
> -s
>
> On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin  wrote:
>
>> I've been running Airflow on 1500 cores in the context of scientific
>> workflows for the past year and a half. Features that would be important
>> to
>> me for 2.0:
>>
>> - Add FK to dag_run to the task_instance table on Postgres so that
>> task_instances can be uniquely attributed to dag runs.
>> - Ensure scheduler can be run continuously without needing restarts. Right
>> now it gets into some ill-determined bad state forcing me to restart it
>> every 20 minutes.
>> - Ensure scheduler can handle tens of thousands of active workflows. Right
>> now this results in extremely long scheduling times and inconsistent
>> scheduling even at 2 thousand active workflows.
>> - Add more flexible task scheduling prioritization. The default
>> prioritization is the opposite of the behaviour I want. I would prefer
>> that
>> downstream tasks always have higher priority than upstream tasks to cause
>> entire workflows to tend to complete sooner, rather than scheduling tasks
>> from other workflows. Having a few scheduling prioritization strategies
>> would be beneficial here.
>> - Provide better support for manually-triggered DAGs on the UI i.e. by
>> showing them as queued.
>> - Provide some resource management capabilities via something like slots
>> that can be defined on workers and occupied by tasks. Using celery's
>> concurrency parameter at the airflow server level is too coarse-grained as
>> it forces all workers to be the same, and does not allow proper resource
>> management when different workflow tasks have different resource
>> requirements thus hurting utilization (a worker could run 8 parallel tasks
>> with small memory footprint, but only 1 task with large memory footprint
>> for instance).
>>
>> With best regards,
>>
>> Sergei.
>>
>>
>> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
>> ext-pavlo.ryabc...@here.com>
>> wrote:
>>
>> > -1. We extremely rely on data profiling, as a pipeline health monitoring
>> > tool
>> >
>> > -Original Message-
>> > From: Chris Riccomini [mailto:criccom...@apache.org]
>> > Sent: Saturday, November 19, 2016 1:57 AM
>> > To: dev@airflow.incubator.apache.org
>> > Subject: Re: Airflow 2.0
>> >
>> > > RIP out the charting application and the data profiler
>> >
>> > Yes please! +1
>> >
>> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
>> > maximebeauche...@gmail.com> wrote:
>> > > Another point that may be controversial for Airflow 2.0: RIP out the
>> > > charting application and the data profiler. Even though it's nice to
>> > > have it there, it's just out of scope and has major security
>> > issues/implications.
>> > >
>> > > I'm not sure how popular it actually is. We may need to run a survey
>> > > at some point around this kind of questions.
>> > >
>> > > Max
>> > >
>> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
>> > > maximebeauche...@gmail.com> wrote:
>> > >
>> > >> Using FAB's Model, we get pretty much all of that (REST API,
>> > >> auth/perms,
>> > >> CRUD) for free:
>> > >> https://emea01.safelinks.protection.outlook.com/?url=http%
>> 3A%2F%2Ffla
>> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7
>> C%7C0064f
>> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649
>> 19%7C1&sd
>> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
>> > >> quickhowto.html?highlight=rest#exposed-methods
>> > >>
>> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
>> > >> for Superset/Caravel.
>> > >>
>> > >> All that's needed is to derive FAB's model class instead of
>> > >> SqlAlchemy's model class (which FAB's model wraps and adds
>> > >> functionality to and is 100% compatible AFAICT).
>> > >>
>> > >> Max
>> > >>
>> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
>> > >> 
>> > >> wrote:
>> > >>
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>>
>> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
>> > >>> can build the new UI in parallel, and then delete the old one later.
>> > >>> I really think that a REST interface should be a pre-req to any
>> > >>> large/new UI changes, though. Getting unified so that everything is
>> > >>> driven through REST will be a big win.
>> > >>>
>> > >>> On Fri, Nov 18, 2016 at 1:51 PM, M

Re: Airflow 2.0

2016-11-21 Thread Boris Tyukin
I am still deciding between Airflow and oozie for our brand new Hadoop
project but here is a few things that I did not like during my limited
testing:

1) pain with scheduler/webserver restarts - things magically begin working
after restart or disappear (like DAG tasks that are no longer part of DAG)
2) no security - a big deal for enterprise-like companies like the one I
work for (a large healthcare organization).
3) backfill concept is a bit weird to me. I think Gerard put it pretty well
- backfills should be run for the entire missing window, not day by day.
Logging for backfills should be consistent with normal DAG Runs.
4) confusion around execution time and start time - i wish UI would clearly
distinct them. Execution time only covers interval to a previous DAG run -
I wish it would go back the LAST successful DAG run. That way I can rely on
it to use it as watermarks for incremental processes.
5) UTC confusion - not all companies have a luxury to run all the systems
on UTC.


On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand  wrote:

> Also, a survey will be a little less noisy and easier to summarize than +1s
> in this email thread.
> -s (Sid)
>
> On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand 
> wrote:
>
> > Sergei,
> > These are some great ideas -- I would classify at least half of them as
> > pain points.
> >
> > Folks!
> > I suggest people (on the dev list) keep feeding this thread at least for
> > the next 2 days. I can then float a survey based on these ideas and give
> > the community a chance to vote so we can prioritize the wish list.
> >
> > -s
> >
> > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin 
> wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be important
> >> to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> >> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabc...@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -Original Message-
> >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauche...@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data profiler. Even though it's nice to
> >> > > have it there, it's just out of scope and has major security
> >> > issues/implications.
> >> > >
> >> > > I'm not sure how popular it actually is. We may need to run a survey
> >> > > at some point around this kind of questions.
> >> > >
> >> > > Max
> >> > >
> >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> >> > > maximebeauche...@gmail.com> wrote:
> >> > >
> >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> >> > >> auth/perms,
> >> > >> CRUD) for free:
> >> > >> https://emea01.safelinks.protection.outlook.com/?url=http%
> >> 3A%2F%2Ffla
> >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7
> >> C%7C0064f
> >> > >> 74fd0d94

Re: Airflow 2.0

2016-11-21 Thread siddharth anand
1) The restart should not be needed, but if folks are reporting it, I'm
curious what the problem might be. If yo are running on master, then you
may not be aware of the min_file_process_interval setting.

[scheduler]

min_file_process_interval = 0

max_threads = 4

2) Yes.. security is not there. It's often something added to a maturing
project a little late in its growth - after feature completeness,
performance, etc... For example, Azkaban grew at LinkedIn to be widely
adopted for a few years before Azkaban2 came around and introduced security
features. If it's important to you, then vote. It may not be there on your
timeframe, but it will surely be something we land in 2017. Also if you run
in the cloud, there are some options that be make your installation more
secure.

Great feedback. I know Max kicked this thread off in order to figure out
how to get his team to consider the community's needs when picking what to
fix. This information is in fact helpful to us all.

-s

On Mon, Nov 21, 2016 at 6:13 PM, Boris Tyukin  wrote:

> I am still deciding between Airflow and oozie for our brand new Hadoop
> project but here is a few things that I did not like during my limited
> testing:
>
> 1) pain with scheduler/webserver restarts - things magically begin working
> after restart or disappear (like DAG tasks that are no longer part of DAG)
> 2) no security - a big deal for enterprise-like companies like the one I
> work for (a large healthcare organization).
> 3) backfill concept is a bit weird to me. I think Gerard put it pretty well
> - backfills should be run for the entire missing window, not day by day.
> Logging for backfills should be consistent with normal DAG Runs.
> 4) confusion around execution time and start time - i wish UI would clearly
> distinct them. Execution time only covers interval to a previous DAG run -
> I wish it would go back the LAST successful DAG run. That way I can rely on
> it to use it as watermarks for incremental processes.
> 5) UTC confusion - not all companies have a luxury to run all the systems
> on UTC.
>
>
> On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand 
> wrote:
>
> > Also, a survey will be a little less noisy and easier to summarize than
> +1s
> > in this email thread.
> > -s (Sid)
> >
> > On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand 
> > wrote:
> >
> > > Sergei,
> > > These are some great ideas -- I would classify at least half of them as
> > > pain points.
> > >
> > > Folks!
> > > I suggest people (on the dev list) keep feeding this thread at least
> for
> > > the next 2 days. I can then float a survey based on these ideas and
> give
> > > the community a chance to vote so we can prioritize the wish list.
> > >
> > > -s
> > >
> > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin 
> > wrote:
> > >
> > >> I've been running Airflow on 1500 cores in the context of scientific
> > >> workflows for the past year and a half. Features that would be
> important
> > >> to
> > >> me for 2.0:
> > >>
> > >> - Add FK to dag_run to the task_instance table on Postgres so that
> > >> task_instances can be uniquely attributed to dag runs.
> > >> - Ensure scheduler can be run continuously without needing restarts.
> > Right
> > >> now it gets into some ill-determined bad state forcing me to restart
> it
> > >> every 20 minutes.
> > >> - Ensure scheduler can handle tens of thousands of active workflows.
> > Right
> > >> now this results in extremely long scheduling times and inconsistent
> > >> scheduling even at 2 thousand active workflows.
> > >> - Add more flexible task scheduling prioritization. The default
> > >> prioritization is the opposite of the behaviour I want. I would prefer
> > >> that
> > >> downstream tasks always have higher priority than upstream tasks to
> > cause
> > >> entire workflows to tend to complete sooner, rather than scheduling
> > tasks
> > >> from other workflows. Having a few scheduling prioritization
> strategies
> > >> would be beneficial here.
> > >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> > >> showing them as queued.
> > >> - Provide some resource management capabilities via something like
> slots
> > >> that can be defined on workers and occupied by tasks. Using celery's
> > >> concurrency parameter at the airflow server level is too
> coarse-grained
> > as
> > >> it forces all workers to be the same, and does not allow proper
> resource
> > >> management when different workflow tasks have different resource
> > >> requirements thus hurting utilization (a worker could run 8 parallel
> > tasks
> > >> with small memory footprint, but only 1 task with large memory
> footprint
> > >> for instance).
> > >>
> > >> With best regards,
> > >>
> > >> Sergei.
> > >>
> > >>
> > >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> > >> ext-pavlo.ryabc...@here.com>
> > >> wrote:
> > >>
> > >> > -1. We extremely rely on data profiling, as a pipeline health
> > monitoring
> > >> > tool
> > >> >
> > >> > -Original Message-

Re: Dynamic creation of DAG

2016-11-21 Thread Maxime Beauchemin
I just added a bit of information about dynamic DAG creation here:
https://github.com/apache/incubator-airflow/pull/1889/files#diff-c6f0a0722c6a2f86277535d7bcec7f8cR162

Let me know if it helps.

Max

On Mon, Nov 21, 2016 at 2:58 AM, Deepak Kumar Malladi 
wrote:

> Hi,
>
> I want to dynamically create DAG during run time. I tried the snippet given
> in the documentation. But it didnt work for me.
>
> Any pointer on how to trigger DAGs which aren't actually present in DAG
> folder but are created through code execution (dynamically created)?
>
>
> Thanks & Regards,
> Deepak
>


Re: Airflow 2.0

2016-11-21 Thread Gerard Toonstra
More ideas:

- An "airflow"  plugin at the moment is more of an extension; operators,
hooks, macros.
  Consider an additional plugin API + default implementation for code
inside airflow that
   has a cross-cutting concern, like:
   * We start to use datadog for heavier monitoring of what's going on.
That's a very specific API.
  Rather than build something specific, we should create a generic API
that monitoring implementations
  can use to record how many tasks get scheduled, queued, executed and
succeeded/failed over time.
  So this considers all kinds of metrics important to running airflow
and what we want to monitor to
  determine if things run properly.
   * Same thing for "alerting", or wrap this in the same component.
   * Security concerns that do not fit the "role-based" access behavior.
   * Better secret management:  at the company level, it's usually better
to keep passwords and
 secrets in a single place. API secrets, keys, etc.  Some tools exist
that integrate with AWS / gcloud that will create
 temporary access keys for you that are valid for one hour. This way,
airflow has less work to do and access management
 is done from a centralized place.

an example of such a tool is vault:
https://www.hashicorp.com/blog/vault.html

- A way for tasks/operators to communicate to airflow how much work was
done in a given task instance as a simple dict:
   * number of records read/written
   * number of API calls
   * number of lines read/written/transferred.

- Data lineage: Add meta-description elements to DAG and task instances
that add information as to
   how data flows through airflow workflows, then visualize how that data
gets used through a Sankey diagram.
   Maxime once hinted about data lineage in a youtube video of 2015 about
airflow, but I haven't seen steps taken
   on that from there. It is something that is increasingly more important
for us from a data security and
   analysis perspective.

Gerard



On Tue, Nov 22, 2016 at 3:47 AM, siddharth anand  wrote:

> 1) The restart should not be needed, but if folks are reporting it, I'm
> curious what the problem might be. If yo are running on master, then you
> may not be aware of the min_file_process_interval setting.
>
> [scheduler]
>
> min_file_process_interval = 0
>
> max_threads = 4
>
> 2) Yes.. security is not there. It's often something added to a maturing
> project a little late in its growth - after feature completeness,
> performance, etc... For example, Azkaban grew at LinkedIn to be widely
> adopted for a few years before Azkaban2 came around and introduced security
> features. If it's important to you, then vote. It may not be there on your
> timeframe, but it will surely be something we land in 2017. Also if you run
> in the cloud, there are some options that be make your installation more
> secure.
>
> Great feedback. I know Max kicked this thread off in order to figure out
> how to get his team to consider the community's needs when picking what to
> fix. This information is in fact helpful to us all.
>
> -s
>
> On Mon, Nov 21, 2016 at 6:13 PM, Boris Tyukin 
> wrote:
>
> > I am still deciding between Airflow and oozie for our brand new Hadoop
> > project but here is a few things that I did not like during my limited
> > testing:
> >
> > 1) pain with scheduler/webserver restarts - things magically begin
> working
> > after restart or disappear (like DAG tasks that are no longer part of
> DAG)
> > 2) no security - a big deal for enterprise-like companies like the one I
> > work for (a large healthcare organization).
> > 3) backfill concept is a bit weird to me. I think Gerard put it pretty
> well
> > - backfills should be run for the entire missing window, not day by day.
> > Logging for backfills should be consistent with normal DAG Runs.
> > 4) confusion around execution time and start time - i wish UI would
> clearly
> > distinct them. Execution time only covers interval to a previous DAG run
> -
> > I wish it would go back the LAST successful DAG run. That way I can rely
> on
> > it to use it as watermarks for incremental processes.
> > 5) UTC confusion - not all companies have a luxury to run all the systems
> > on UTC.
> >
> >
> > On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand 
> > wrote:
> >
> > > Also, a survey will be a little less noisy and easier to summarize than
> > +1s
> > > in this email thread.
> > > -s (Sid)
> > >
> > > On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand 
> > > wrote:
> > >
> > > > Sergei,
> > > > These are some great ideas -- I would classify at least half of them
> as
> > > > pain points.
> > > >
> > > > Folks!
> > > > I suggest people (on the dev list) keep feeding this thread at least
> > for
> > > > the next 2 days. I can then float a survey based on these ideas and
> > give
> > > > the community a chance to vote so we can prioritize the wish list.
> > > >
> > > > -s
> > > >
> > > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin 
> > > wrote:
> > > >
> > > >