Re: Airflow 2.0

Gerard Toonstra Mon, 21 Nov 2016 23:28:36 -0800

More ideas:

- An "airflow"  plugin at the moment is more of an extension; operators,
hooks, macros.
  Consider an additional plugin API + default implementation for code
inside airflow that
   has a cross-cutting concern, like:
   * We start to use datadog for heavier monitoring of what's going on.
That's a very specific API.
      Rather than build something specific, we should create a generic API
that monitoring implementations
      can use to record how many tasks get scheduled, queued, executed and
succeeded/failed over time.
      So this considers all kinds of metrics important to running airflow
and what we want to monitor to
      determine if things run properly.
   * Same thing for "alerting", or wrap this in the same component.
   * Security concerns that do not fit the "role-based" access behavior.
   * Better secret management:  at the company level, it's usually better
to keep passwords and
     secrets in a single place. API secrets, keys, etc.  Some tools exist
that integrate with AWS / gcloud that will create
     temporary access keys for you that are valid for one hour. This way,
airflow has less work to do and access management
     is done from a centralized place.


    an example of such a tool is vault:
https://www.hashicorp.com/blog/vault.html

- A way for tasks/operators to communicate to airflow how much work was
done in a given task instance as a simple dict:
   * number of records read/written
   * number of API calls
   * number of lines read/written/transferred.

- Data lineage: Add meta-description elements to DAG and task instances
that add information as to
   how data flows through airflow workflows, then visualize how that data
gets used through a Sankey diagram.
   Maxime once hinted about data lineage in a youtube video of 2015 about
airflow, but I haven't seen steps taken
   on that from there. It is something that is increasingly more important
for us from a data security and
   analysis perspective.

Gerard



On Tue, Nov 22, 2016 at 3:47 AM, siddharth anand <san...@apache.org> wrote:

> 1) The restart should not be needed, but if folks are reporting it, I'm
> curious what the problem might be. If yo are running on master, then you
> may not be aware of the min_file_process_interval setting.
>
> [scheduler]
>
> min_file_process_interval = 0
>
> max_threads = 4
>
> 2) Yes.. security is not there. It's often something added to a maturing
> project a little late in its growth - after feature completeness,
> performance, etc... For example, Azkaban grew at LinkedIn to be widely
> adopted for a few years before Azkaban2 came around and introduced security
> features. If it's important to you, then vote. It may not be there on your
> timeframe, but it will surely be something we land in 2017. Also if you run
> in the cloud, there are some options that be make your installation more
> secure.
>
> Great feedback. I know Max kicked this thread off in order to figure out
> how to get his team to consider the community's needs when picking what to
> fix. This information is in fact helpful to us all.
>
> -s
>
> On Mon, Nov 21, 2016 at 6:13 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
> > I am still deciding between Airflow and oozie for our brand new Hadoop
> > project but here is a few things that I did not like during my limited
> > testing:
> >
> > 1) pain with scheduler/webserver restarts - things magically begin
> working
> > after restart or disappear (like DAG tasks that are no longer part of
> DAG)
> > 2) no security - a big deal for enterprise-like companies like the one I
> > work for (a large healthcare organization).
> > 3) backfill concept is a bit weird to me. I think Gerard put it pretty
> well
> > - backfills should be run for the entire missing window, not day by day.
> > Logging for backfills should be consistent with normal DAG Runs.
> > 4) confusion around execution time and start time - i wish UI would
> clearly
> > distinct them. Execution time only covers interval to a previous DAG run
> -
> > I wish it would go back the LAST successful DAG run. That way I can rely
> on
> > it to use it as watermarks for incremental processes.
> > 5) UTC confusion - not all companies have a luxury to run all the systems
> > on UTC.
> >
> >
> > On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand <san...@apache.org>
> > wrote:
> >
> > > Also, a survey will be a little less noisy and easier to summarize than
> > +1s
> > > in this email thread.
> > > -s (Sid)
> > >
> > > On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand <san...@apache.org>
> > > wrote:
> > >
> > > > Sergei,
> > > > These are some great ideas -- I would classify at least half of them
> as
> > > > pain points.
> > > >
> > > > Folks!
> > > > I suggest people (on the dev list) keep feeding this thread at least
> > for
> > > > the next 2 days. I can then float a survey based on these ideas and
> > give
> > > > the community a chance to vote so we can prioritize the wish list.
> > > >
> > > > -s
> > > >
> > > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <lle...@gmail.com>
> > > wrote:
> > > >
> > > >> I've been running Airflow on 1500 cores in the context of scientific
> > > >> workflows for the past year and a half. Features that would be
> > important
> > > >> to
> > > >> me for 2.0:
> > > >>
> > > >> - Add FK to dag_run to the task_instance table on Postgres so that
> > > >> task_instances can be uniquely attributed to dag runs.
> > > >> - Ensure scheduler can be run continuously without needing restarts.
> > > Right
> > > >> now it gets into some ill-determined bad state forcing me to restart
> > it
> > > >> every 20 minutes.
> > > >> - Ensure scheduler can handle tens of thousands of active workflows.
> > > Right
> > > >> now this results in extremely long scheduling times and inconsistent
> > > >> scheduling even at 2 thousand active workflows.
> > > >> - Add more flexible task scheduling prioritization. The default
> > > >> prioritization is the opposite of the behaviour I want. I would
> prefer
> > > >> that
> > > >> downstream tasks always have higher priority than upstream tasks to
> > > cause
> > > >> entire workflows to tend to complete sooner, rather than scheduling
> > > tasks
> > > >> from other workflows. Having a few scheduling prioritization
> > strategies
> > > >> would be beneficial here.
> > > >> - Provide better support for manually-triggered DAGs on the UI i.e.
> by
> > > >> showing them as queued.
> > > >> - Provide some resource management capabilities via something like
> > slots
> > > >> that can be defined on workers and occupied by tasks. Using celery's
> > > >> concurrency parameter at the airflow server level is too
> > coarse-grained
> > > as
> > > >> it forces all workers to be the same, and does not allow proper
> > resource
> > > >> management when different workflow tasks have different resource
> > > >> requirements thus hurting utilization (a worker could run 8 parallel
> > > tasks
> > > >> with small memory footprint, but only 1 task with large memory
> > footprint
> > > >> for instance).
> > > >>
> > > >> With best regards,
> > > >>
> > > >> Sergei.
> > > >>
> > > >>
> > > >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> > > >> ext-pavlo.ryabc...@here.com>
> > > >> wrote:
> > > >>
> > > >> > -1. We extremely rely on data profiling, as a pipeline health
> > > monitoring
> > > >> > tool
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> > > >> > Sent: Saturday, November 19, 2016 1:57 AM
> > > >> > To: dev@airflow.incubator.apache.org
> > > >> > Subject: Re: Airflow 2.0
> > > >> >
> > > >> > > RIP out the charting application and the data profiler
> > > >> >
> > > >> > Yes please! +1
> > > >> >
> > > >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > > >> > maximebeauche...@gmail.com> wrote:
> > > >> > > Another point that may be controversial for Airflow 2.0: RIP out
> > the
> > > >> > > charting application and the data profiler. Even though it's
> nice
> > to
> > > >> > > have it there, it's just out of scope and has major security
> > > >> > issues/implications.
> > > >> > >
> > > >> > > I'm not sure how popular it actually is. We may need to run a
> > survey
> > > >> > > at some point around this kind of questions.
> > > >> > >
> > > >> > > Max
> > > >> > >
> > > >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > >> > > maximebeauche...@gmail.com> wrote:
> > > >> > >
> > > >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> > > >> > >> auth/perms,
> > > >> > >> CRUD) for free:
> > > >> > >> https://emea01.safelinks.protection.outlook.com/?url=http%
> > > >> 3A%2F%2Ffla
> > > >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7
> > > >> C%7C0064f
> > > >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649
> > > >> 19%7C1&sd
> > > >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZM
> k6dR%2FkhazZwVE%3D&reserved=0
> > > >> > >> quickhowto.html?highlight=rest#exposed-methods
> > > >> > >>
> > > >> > >> I'm pretty intimate with FAB since I use it (and contributed to
> > it)
> > > >> > >> for Superset/Caravel.
> > > >> > >>
> > > >> > >> All that's needed is to derive FAB's model class instead of
> > > >> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> > > >> > >> functionality to and is 100% compatible AFAICT).
> > > >> > >>
> > > >> > >> Max
> > > >> > >>
> > > >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > > >> > >> <criccom...@apache.org>
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >>> > It may be doable to run this as a different package
> > > >> > >>> `airflow-webserver`, an
> > > >> > >>> > alternate UI at first, and to eventually rip out the old UI
> > off
> > > of
> > > >> > >>> > the
> > > >> > >>> main
> > > >> > >>> > package.
> > > >> > >>>
> > > >> > >>> This is the same strategy that I was thinking of for
> AIRFLOW-85.
> > > You
> > > >> > >>> can build the new UI in parallel, and then delete the old one
> > > later.
> > > >> > >>> I really think that a REST interface should be a pre-req to
> any
> > > >> > >>> large/new UI changes, though. Getting unified so that
> everything
> > > is
> > > >> > >>> driven through REST will be a big win.
> > > >> > >>>
> > > >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > > >> > >>> <maximebeauche...@gmail.com> wrote:
> > > >> > >>> > A multi-tenant UI with composable roles on top of granular
> > > >> > permissions.
> > > >> > >>> >
> > > >> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
> > > >> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
> > > >> > >>> > authentication and permission model that ships
> out-of-the-box
> > > with
> > > >> > >>> > a REST api. Suffice to define FAB models (derivative of
> > > >> > >>> > SQLAlchemy's model) and you get a set
> > > >> > >>> of
> > > >> > >>> > perms for the model (can_show, can_list, can_add,
> can_change,
> > > >> > >>> can_delete,
> > > >> > >>> > ...) and a set of CRUD REST endpoints. It would also allow
> us
> > to
> > > >> > >>> > rip out the authentication backend code out of Airflow and
> > rely
> > > on
> > > >> > FAB for that.
> > > >> > >>> > Also every single view gets permissions auto-created for it,
> > and
> > > >> > >>> > there
> > > >> > >>> are
> > > >> > >>> > easy way to define row-level type filters based on user
> > > >> permissions.
> > > >> > >>> >
> > > >> > >>> > It may be doable to run this as a different package
> > > >> > >>> `airflow-webserver`, an
> > > >> > >>> > alternate UI at first, and to eventually rip out the old UI
> > off
> > > of
> > > >> > >>> > the
> > > >> > >>> main
> > > >> > >>> > package.
> > > >> > >>> >
> > > >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https%
> > > >> 3A%2F%2
> > > >> > >>> > Fflask-appbuilder.readthedocs.io
> %2Fen%2Flatest%2F&data=01%7C
> > > >> 01%7C%
> > > >> > >>> > 7C0064f74fd0d940ab732808d4100e
> 9c3f%7C6d4034cd72254f72b85391f
> > > >> eaea64
> > > >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%
> > > >> 2BjcEs8%
> > > >> > >>> > 3D&reserved=0
> > > >> > >>> >
> > > >> > >>> > I'd love to carve some time and lead this.
> > > >> > >>> >
> > > >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> > > >> > >>> > <criccom...@apache.org
> > > >> > >>> >
> > > >> > >>> > wrote:
> > > >> > >>> >
> > > >> > >>> >> Full-fledged REST API (that the UI also uses) would be
> great
> > in
> > > >> 2.0.
> > > >> > >>> >>
> > > >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <k...@b23.io
> >
> > > >> wrote:
> > > >> > >>> >> > Hi All,
> > > >> > >>> >> >
> > > >> > >>> >> > We have been using Airflow heavily for the last couple
> > months
> > > >> > >>> >> > and
> > > >> > >>> it’s
> > > >> > >>> >> been great so far. Here are a few things we’d like to see
> > > >> > >>> >> prioritized
> > > >> > >>> in
> > > >> > >>> >> 2.0.
> > > >> > >>> >> >
> > > >> > >>> >> > 1) Role based access to DAGs:
> > > >> > >>> >> > We would like to see better role based access through the
> > UI.
> > > >> > >>> There’s a
> > > >> > >>> >> related ticket out there but it hasn’t seen any action in a
> > few
> > > >> > >>> >> months
> > > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> > https%
> > > >> 3A%2
> > > >> > >>> >> > F%2Fissues.apache.org%2Fjira%
> > 2Fbrowse%2FAIRFLOW-85&data=01%7
> > > >> C01
> > > >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%
> > > >> 7C6d4034cd72254f72b85391
> > > >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
> > 2FZkkWhzAvx
> > > >> NvB
> > > >> > >>> >> > N531k%3D&reserved=0
> > > >> > >>> >> >
> > > >> > >>> >> > We use a templating system to create/deploy DAGs
> > dynamically
> > > >> > >>> >> > based on
> > > >> > >>> >> some directory/file structure. This allows analysts to
> > quickly
> > > >> > >>> >> deploy
> > > >> > >>> and
> > > >> > >>> >> schedule their ETL code without having to interact with the
> > > >> > >>> >> Airflow installation directly. It would be great if those
> > same
> > > >> > >>> >> analysts could access to their own DAGs in the UI so that
> > they
> > > >> > >>> >> can clear DAG runs,
> > > >> > >>> mark
> > > >> > >>> >> success, etc. while keeping them away from our core ETL and
> > > other
> > > >> > >>> >> people's/organization's DAGs. Some of this can be
> > accomplished
> > > >> > >>> >> with
> > > >> > >>> ‘filter
> > > >> > >>> >> by owner’ but it doesn’t address the use case where a DAG
> can
> > > be
> > > >> > >>> maintained
> > > >> > >>> >> by multiple users in the same organization when they have
> > > >> > >>> >> separate
> > > >> > >>> Airflow
> > > >> > >>> >> user accounts.
> > > >> > >>> >> >
> > > >> > >>> >> > 2) An option to turn off backfill:
> > > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> > https%
> > > >> 3A%2
> > > >> > >>> >> > F%2Fissues.apache.org%2Fjira%
> 2Fbrowse%2FAIRFLOW-558&data=
> > 01%
> > > >> 7C0
> > > >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%
> > > >> 7C6d4034cd72254f72b8539
> > > >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%
> > > >> 2BVSS5Y%2B
> > > >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
> > > >> > >>> >> > overwrite on a table every day.
> > > >> > >>> >> This might be a realistic option for the current version
> but
> > I
> > > >> > >>> >> just
> > > >> > >>> wanted
> > > >> > >>> >> to call attention to this feature request.
> > > >> > >>> >> >
> > > >> > >>> >> > Best,
> > > >> > >>> >> > David
> > > >> > >>> >> >
> > > >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
> > > >> > >>> >> maximebeauche...@gmail.com<mailto:maximebeauchemin@gmail.
> com
> > >>
> > > >> > wrote:
> > > >> > >>> >> >
> > > >> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!*
> > > >> > >>> >> >
> > > >> > >>> >> > I wanted to share some ideas around what I would like to
> do
> > > in
> > > >> > >>> Airflow
> > > >> > >>> >> 2.0
> > > >> > >>> >> > and would love to hear what others are thinking. I'll
> > compile
> > > >> > >>> >> > the
> > > >> > >>> ideas
> > > >> > >>> >> > that are shared in this thread in a Wiki once the
> > > conversation
> > > >> > fades.
> > > >> > >>> >> >
> > > >> > >>> >> > -------------------------------------------
> > > >> > >>> >> >
> > > >> > >>> >> > First idea, to get the conversation started:
> > > >> > >>> >> >
> > > >> > >>> >> > *Breaking down the package*
> > > >> > >>> >> > `pip install airflow-common airflow-scheduler
> > > airflow-webserver
> > > >> > >>> >> > airflow-operators-googlecloud ...`
> > > >> > >>> >> >
> > > >> > >>> >> > It seems to me like we're getting to a point where having
> > > >> > >>> >> > different repositories and different packages would make
> > > things
> > > >> > >>> >> > much easier in
> > > >> > >>> all
> > > >> > >>> >> > sorts of ways. For instance the web server is a lot less
> > > >> > >>> >> > sensitive
> > > >> > >>> than
> > > >> > >>> >> the
> > > >> > >>> >> > scheduler, and changes to operators should/could be
> > deployed
> > > at
> > > >> > >>> >> > will, independently from the main package. People in
> their
> > > >> > >>> >> > environment
> > > >> > >>> could
> > > >> > >>> >> > upgrade only certain packages when needed. Travis builds
> > > would
> > > >> > >>> >> > be
> > > >> > >>> more
> > > >> > >>> >> > targeted, and take less time, ...
> > > >> > >>> >> >
> > > >> > >>> >> > Also, the whole current "extra_requires" approach to
> > optional
> > > >> > >>> >> dependencies
> > > >> > >>> >> > (in setup.py) is kind getting out-of-hand.
> > > >> > >>> >> >
> > > >> > >>> >> > Of course `pip install airflow` would bring in a
> collection
> > > of
> > > >> > >>> >> sub-packages
> > > >> > >>> >> > similar in functionality to what it does now, perhaps
> > without
> > > >> > >>> >> > so many operators you probably don't need in your
> > > environment.
> > > >> > >>> >> >
> > > >> > >>> >> > The release process is the main pain-point and the
> biggest
> > > risk
> > > >> > >>> >> > for
> > > >> > >>> the
> > > >> > >>> >> > project, and I feel like this a solid solution to address
> > it.
> > > >> > >>> >> >
> > > >> > >>> >> > Max
> > > >> > >>> >> >
> > > >> > >>> >>
> > > >> > >>>
> > > >> > >>
> > > >> > >>
> > > >> >
> > > >> --
> > > >>
> > > >> Sergei
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Airflow 2.0

Reply via email to