More ideas: - An "airflow" plugin at the moment is more of an extension; operators, hooks, macros. Consider an additional plugin API + default implementation for code inside airflow that has a cross-cutting concern, like: * We start to use datadog for heavier monitoring of what's going on. That's a very specific API. Rather than build something specific, we should create a generic API that monitoring implementations can use to record how many tasks get scheduled, queued, executed and succeeded/failed over time. So this considers all kinds of metrics important to running airflow and what we want to monitor to determine if things run properly. * Same thing for "alerting", or wrap this in the same component. * Security concerns that do not fit the "role-based" access behavior. * Better secret management: at the company level, it's usually better to keep passwords and secrets in a single place. API secrets, keys, etc. Some tools exist that integrate with AWS / gcloud that will create temporary access keys for you that are valid for one hour. This way, airflow has less work to do and access management is done from a centralized place.
an example of such a tool is vault: https://www.hashicorp.com/blog/vault.html - A way for tasks/operators to communicate to airflow how much work was done in a given task instance as a simple dict: * number of records read/written * number of API calls * number of lines read/written/transferred. - Data lineage: Add meta-description elements to DAG and task instances that add information as to how data flows through airflow workflows, then visualize how that data gets used through a Sankey diagram. Maxime once hinted about data lineage in a youtube video of 2015 about airflow, but I haven't seen steps taken on that from there. It is something that is increasingly more important for us from a data security and analysis perspective. Gerard On Tue, Nov 22, 2016 at 3:47 AM, siddharth anand <san...@apache.org> wrote: > 1) The restart should not be needed, but if folks are reporting it, I'm > curious what the problem might be. If yo are running on master, then you > may not be aware of the min_file_process_interval setting. > > [scheduler] > > min_file_process_interval = 0 > > max_threads = 4 > > 2) Yes.. security is not there. It's often something added to a maturing > project a little late in its growth - after feature completeness, > performance, etc... For example, Azkaban grew at LinkedIn to be widely > adopted for a few years before Azkaban2 came around and introduced security > features. If it's important to you, then vote. It may not be there on your > timeframe, but it will surely be something we land in 2017. Also if you run > in the cloud, there are some options that be make your installation more > secure. > > Great feedback. I know Max kicked this thread off in order to figure out > how to get his team to consider the community's needs when picking what to > fix. This information is in fact helpful to us all. > > -s > > On Mon, Nov 21, 2016 at 6:13 PM, Boris Tyukin <bo...@boristyukin.com> > wrote: > > > I am still deciding between Airflow and oozie for our brand new Hadoop > > project but here is a few things that I did not like during my limited > > testing: > > > > 1) pain with scheduler/webserver restarts - things magically begin > working > > after restart or disappear (like DAG tasks that are no longer part of > DAG) > > 2) no security - a big deal for enterprise-like companies like the one I > > work for (a large healthcare organization). > > 3) backfill concept is a bit weird to me. I think Gerard put it pretty > well > > - backfills should be run for the entire missing window, not day by day. > > Logging for backfills should be consistent with normal DAG Runs. > > 4) confusion around execution time and start time - i wish UI would > clearly > > distinct them. Execution time only covers interval to a previous DAG run > - > > I wish it would go back the LAST successful DAG run. That way I can rely > on > > it to use it as watermarks for incremental processes. > > 5) UTC confusion - not all companies have a luxury to run all the systems > > on UTC. > > > > > > On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand <san...@apache.org> > > wrote: > > > > > Also, a survey will be a little less noisy and easier to summarize than > > +1s > > > in this email thread. > > > -s (Sid) > > > > > > On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand <san...@apache.org> > > > wrote: > > > > > > > Sergei, > > > > These are some great ideas -- I would classify at least half of them > as > > > > pain points. > > > > > > > > Folks! > > > > I suggest people (on the dev list) keep feeding this thread at least > > for > > > > the next 2 days. I can then float a survey based on these ideas and > > give > > > > the community a chance to vote so we can prioritize the wish list. > > > > > > > > -s > > > > > > > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <lle...@gmail.com> > > > wrote: > > > > > > > >> I've been running Airflow on 1500 cores in the context of scientific > > > >> workflows for the past year and a half. Features that would be > > important > > > >> to > > > >> me for 2.0: > > > >> > > > >> - Add FK to dag_run to the task_instance table on Postgres so that > > > >> task_instances can be uniquely attributed to dag runs. > > > >> - Ensure scheduler can be run continuously without needing restarts. > > > Right > > > >> now it gets into some ill-determined bad state forcing me to restart > > it > > > >> every 20 minutes. > > > >> - Ensure scheduler can handle tens of thousands of active workflows. > > > Right > > > >> now this results in extremely long scheduling times and inconsistent > > > >> scheduling even at 2 thousand active workflows. > > > >> - Add more flexible task scheduling prioritization. The default > > > >> prioritization is the opposite of the behaviour I want. I would > prefer > > > >> that > > > >> downstream tasks always have higher priority than upstream tasks to > > > cause > > > >> entire workflows to tend to complete sooner, rather than scheduling > > > tasks > > > >> from other workflows. Having a few scheduling prioritization > > strategies > > > >> would be beneficial here. > > > >> - Provide better support for manually-triggered DAGs on the UI i.e. > by > > > >> showing them as queued. > > > >> - Provide some resource management capabilities via something like > > slots > > > >> that can be defined on workers and occupied by tasks. Using celery's > > > >> concurrency parameter at the airflow server level is too > > coarse-grained > > > as > > > >> it forces all workers to be the same, and does not allow proper > > resource > > > >> management when different workflow tasks have different resource > > > >> requirements thus hurting utilization (a worker could run 8 parallel > > > tasks > > > >> with small memory footprint, but only 1 task with large memory > > footprint > > > >> for instance). > > > >> > > > >> With best regards, > > > >> > > > >> Sergei. > > > >> > > > >> > > > >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo < > > > >> ext-pavlo.ryabc...@here.com> > > > >> wrote: > > > >> > > > >> > -1. We extremely rely on data profiling, as a pipeline health > > > monitoring > > > >> > tool > > > >> > > > > >> > -----Original Message----- > > > >> > From: Chris Riccomini [mailto:criccom...@apache.org] > > > >> > Sent: Saturday, November 19, 2016 1:57 AM > > > >> > To: dev@airflow.incubator.apache.org > > > >> > Subject: Re: Airflow 2.0 > > > >> > > > > >> > > RIP out the charting application and the data profiler > > > >> > > > > >> > Yes please! +1 > > > >> > > > > >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin < > > > >> > maximebeauche...@gmail.com> wrote: > > > >> > > Another point that may be controversial for Airflow 2.0: RIP out > > the > > > >> > > charting application and the data profiler. Even though it's > nice > > to > > > >> > > have it there, it's just out of scope and has major security > > > >> > issues/implications. > > > >> > > > > > >> > > I'm not sure how popular it actually is. We may need to run a > > survey > > > >> > > at some point around this kind of questions. > > > >> > > > > > >> > > Max > > > >> > > > > > >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin < > > > >> > > maximebeauche...@gmail.com> wrote: > > > >> > > > > > >> > >> Using FAB's Model, we get pretty much all of that (REST API, > > > >> > >> auth/perms, > > > >> > >> CRUD) for free: > > > >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% > > > >> 3A%2F%2Ffla > > > >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7 > > > >> C%7C0064f > > > >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649 > > > >> 19%7C1&sd > > > >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZM > k6dR%2FkhazZwVE%3D&reserved=0 > > > >> > >> quickhowto.html?highlight=rest#exposed-methods > > > >> > >> > > > >> > >> I'm pretty intimate with FAB since I use it (and contributed to > > it) > > > >> > >> for Superset/Caravel. > > > >> > >> > > > >> > >> All that's needed is to derive FAB's model class instead of > > > >> > >> SqlAlchemy's model class (which FAB's model wraps and adds > > > >> > >> functionality to and is 100% compatible AFAICT). > > > >> > >> > > > >> > >> Max > > > >> > >> > > > >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini > > > >> > >> <criccom...@apache.org> > > > >> > >> wrote: > > > >> > >> > > > >> > >>> > It may be doable to run this as a different package > > > >> > >>> `airflow-webserver`, an > > > >> > >>> > alternate UI at first, and to eventually rip out the old UI > > off > > > of > > > >> > >>> > the > > > >> > >>> main > > > >> > >>> > package. > > > >> > >>> > > > >> > >>> This is the same strategy that I was thinking of for > AIRFLOW-85. > > > You > > > >> > >>> can build the new UI in parallel, and then delete the old one > > > later. > > > >> > >>> I really think that a REST interface should be a pre-req to > any > > > >> > >>> large/new UI changes, though. Getting unified so that > everything > > > is > > > >> > >>> driven through REST will be a big win. > > > >> > >>> > > > >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin > > > >> > >>> <maximebeauche...@gmail.com> wrote: > > > >> > >>> > A multi-tenant UI with composable roles on top of granular > > > >> > permissions. > > > >> > >>> > > > > >> > >>> > Migrating from Flask-Admin to Flask App Builder would be an > > > >> > >>> > easy-ish win (since they're both Flask). FAB Provides a good > > > >> > >>> > authentication and permission model that ships > out-of-the-box > > > with > > > >> > >>> > a REST api. Suffice to define FAB models (derivative of > > > >> > >>> > SQLAlchemy's model) and you get a set > > > >> > >>> of > > > >> > >>> > perms for the model (can_show, can_list, can_add, > can_change, > > > >> > >>> can_delete, > > > >> > >>> > ...) and a set of CRUD REST endpoints. It would also allow > us > > to > > > >> > >>> > rip out the authentication backend code out of Airflow and > > rely > > > on > > > >> > FAB for that. > > > >> > >>> > Also every single view gets permissions auto-created for it, > > and > > > >> > >>> > there > > > >> > >>> are > > > >> > >>> > easy way to define row-level type filters based on user > > > >> permissions. > > > >> > >>> > > > > >> > >>> > It may be doable to run this as a different package > > > >> > >>> `airflow-webserver`, an > > > >> > >>> > alternate UI at first, and to eventually rip out the old UI > > off > > > of > > > >> > >>> > the > > > >> > >>> main > > > >> > >>> > package. > > > >> > >>> > > > > >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https% > > > >> 3A%2F%2 > > > >> > >>> > Fflask-appbuilder.readthedocs.io > %2Fen%2Flatest%2F&data=01%7C > > > >> 01%7C% > > > >> > >>> > 7C0064f74fd0d940ab732808d4100e > 9c3f%7C6d4034cd72254f72b85391f > > > >> eaea64 > > > >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO% > > > >> 2BjcEs8% > > > >> > >>> > 3D&reserved=0 > > > >> > >>> > > > > >> > >>> > I'd love to carve some time and lead this. > > > >> > >>> > > > > >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini > > > >> > >>> > <criccom...@apache.org > > > >> > >>> > > > > >> > >>> > wrote: > > > >> > >>> > > > > >> > >>> >> Full-fledged REST API (that the UI also uses) would be > great > > in > > > >> 2.0. > > > >> > >>> >> > > > >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <k...@b23.io > > > > > >> wrote: > > > >> > >>> >> > Hi All, > > > >> > >>> >> > > > > >> > >>> >> > We have been using Airflow heavily for the last couple > > months > > > >> > >>> >> > and > > > >> > >>> it’s > > > >> > >>> >> been great so far. Here are a few things we’d like to see > > > >> > >>> >> prioritized > > > >> > >>> in > > > >> > >>> >> 2.0. > > > >> > >>> >> > > > > >> > >>> >> > 1) Role based access to DAGs: > > > >> > >>> >> > We would like to see better role based access through the > > UI. > > > >> > >>> There’s a > > > >> > >>> >> related ticket out there but it hasn’t seen any action in a > > few > > > >> > >>> >> months > > > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url= > > https% > > > >> 3A%2 > > > >> > >>> >> > F%2Fissues.apache.org%2Fjira% > > 2Fbrowse%2FAIRFLOW-85&data=01%7 > > > >> C01 > > > >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f% > > > >> 7C6d4034cd72254f72b85391 > > > >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu% > > 2FZkkWhzAvx > > > >> NvB > > > >> > >>> >> > N531k%3D&reserved=0 > > > >> > >>> >> > > > > >> > >>> >> > We use a templating system to create/deploy DAGs > > dynamically > > > >> > >>> >> > based on > > > >> > >>> >> some directory/file structure. This allows analysts to > > quickly > > > >> > >>> >> deploy > > > >> > >>> and > > > >> > >>> >> schedule their ETL code without having to interact with the > > > >> > >>> >> Airflow installation directly. It would be great if those > > same > > > >> > >>> >> analysts could access to their own DAGs in the UI so that > > they > > > >> > >>> >> can clear DAG runs, > > > >> > >>> mark > > > >> > >>> >> success, etc. while keeping them away from our core ETL and > > > other > > > >> > >>> >> people's/organization's DAGs. Some of this can be > > accomplished > > > >> > >>> >> with > > > >> > >>> ‘filter > > > >> > >>> >> by owner’ but it doesn’t address the use case where a DAG > can > > > be > > > >> > >>> maintained > > > >> > >>> >> by multiple users in the same organization when they have > > > >> > >>> >> separate > > > >> > >>> Airflow > > > >> > >>> >> user accounts. > > > >> > >>> >> > > > > >> > >>> >> > 2) An option to turn off backfill: > > > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url= > > https% > > > >> 3A%2 > > > >> > >>> >> > F%2Fissues.apache.org%2Fjira% > 2Fbrowse%2FAIRFLOW-558&data= > > 01% > > > >> 7C0 > > > >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f% > > > >> 7C6d4034cd72254f72b8539 > > > >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy% > > > >> 2BVSS5Y%2B > > > >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert > > > >> > >>> >> > overwrite on a table every day. > > > >> > >>> >> This might be a realistic option for the current version > but > > I > > > >> > >>> >> just > > > >> > >>> wanted > > > >> > >>> >> to call attention to this feature request. > > > >> > >>> >> > > > > >> > >>> >> > Best, > > > >> > >>> >> > David > > > >> > >>> >> > > > > >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin < > > > >> > >>> >> maximebeauche...@gmail.com<mailto:maximebeauchemin@gmail. > com > > >> > > > >> > wrote: > > > >> > >>> >> > > > > >> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!* > > > >> > >>> >> > > > > >> > >>> >> > I wanted to share some ideas around what I would like to > do > > > in > > > >> > >>> Airflow > > > >> > >>> >> 2.0 > > > >> > >>> >> > and would love to hear what others are thinking. I'll > > compile > > > >> > >>> >> > the > > > >> > >>> ideas > > > >> > >>> >> > that are shared in this thread in a Wiki once the > > > conversation > > > >> > fades. > > > >> > >>> >> > > > > >> > >>> >> > ------------------------------------------- > > > >> > >>> >> > > > > >> > >>> >> > First idea, to get the conversation started: > > > >> > >>> >> > > > > >> > >>> >> > *Breaking down the package* > > > >> > >>> >> > `pip install airflow-common airflow-scheduler > > > airflow-webserver > > > >> > >>> >> > airflow-operators-googlecloud ...` > > > >> > >>> >> > > > > >> > >>> >> > It seems to me like we're getting to a point where having > > > >> > >>> >> > different repositories and different packages would make > > > things > > > >> > >>> >> > much easier in > > > >> > >>> all > > > >> > >>> >> > sorts of ways. For instance the web server is a lot less > > > >> > >>> >> > sensitive > > > >> > >>> than > > > >> > >>> >> the > > > >> > >>> >> > scheduler, and changes to operators should/could be > > deployed > > > at > > > >> > >>> >> > will, independently from the main package. People in > their > > > >> > >>> >> > environment > > > >> > >>> could > > > >> > >>> >> > upgrade only certain packages when needed. Travis builds > > > would > > > >> > >>> >> > be > > > >> > >>> more > > > >> > >>> >> > targeted, and take less time, ... > > > >> > >>> >> > > > > >> > >>> >> > Also, the whole current "extra_requires" approach to > > optional > > > >> > >>> >> dependencies > > > >> > >>> >> > (in setup.py) is kind getting out-of-hand. > > > >> > >>> >> > > > > >> > >>> >> > Of course `pip install airflow` would bring in a > collection > > > of > > > >> > >>> >> sub-packages > > > >> > >>> >> > similar in functionality to what it does now, perhaps > > without > > > >> > >>> >> > so many operators you probably don't need in your > > > environment. > > > >> > >>> >> > > > > >> > >>> >> > The release process is the main pain-point and the > biggest > > > risk > > > >> > >>> >> > for > > > >> > >>> the > > > >> > >>> >> > project, and I feel like this a solid solution to address > > it. > > > >> > >>> >> > > > > >> > >>> >> > Max > > > >> > >>> >> > > > > >> > >>> >> > > > >> > >>> > > > >> > >> > > > >> > >> > > > >> > > > > >> -- > > > >> > > > >> Sergei > > > >> > > > > > > > > > > > > > >