Re: Graduation resolution passed - Airflow is a TLP

2018-12-20 Thread Maxime Beauchemin
"They grow up so fast!" :) This is huge! Congratulations to everyone involved. On Thu, Dec 20, 2018 at 3:53 PM Feng Lu wrote: > Fantastic news!! Congrats everyone! > > On Thu, Dec 20, 2018 at 2:18 PM Tao Feng wrote: > > > Thanks Jakob for driving the graduation! Great news! > > > > On Thu,

Re: timeout not working in SqlSensor?

2018-12-20 Thread Maxime Beauchemin
I think it's `timeout` not `time_out`. https://airflow.apache.org/code.html#basesensoroperator On Thu, Dec 20, 2018 at 12:12 PM Scott Halgrim wrote: > Does the timeout param work the way I think it should in Airflow 1.8? My > query just pokes at the poke interval indefinitely. I want it to

Re: [IE] [VOTE] Graduate the Apache Airflow as a TLP

2018-11-30 Thread Maxime Beauchemin
e > Apache Airflow Project: > > * Alex Guziel > * Alex Van Boxel > * Arthur Wiedmer > * Ash Berlin-Taylor > * Bolke de Bruin > * Chris Riccomini > * Dan Davydov > * Fokko Driesprong > * Hitesh Shah > * Jakob Homan > * Jeremiah Lowin > * Joy Gao >

Re: grant edit permission for airflow wiki

2018-11-26 Thread Maxime Beauchemin
Granted! Max On Mon, Nov 26, 2018 at 10:21 PM Tao Feng wrote: > Could any Airflow wiki admin grant me the edit permisssion? My wiki user > name is tfeng. > > Thanks, > -Tao >

Re: [DISCUSS] Apache Airflow graduation from the incubator

2018-11-26 Thread Maxime Beauchemin
This is great to see happen, it's been a long time coming! Also count me in, I'll be happy to help on that last push! On Mon, Nov 26, 2018 at 5:21 PM Hitesh Shah wrote: > +1. The Airflow community has come a long way since its addition to the > Incubator and I believe it is more than ready to

Re: programmatically creating and airflow quirks

2018-11-25 Thread Maxime Beauchemin
The historical reason is that people would check in scripts in the repo that had actual compute or other forms or undesired effect in module scope (scripts with no "if __name__ == '__main__':") and Airflow would just run this script while seeking for DAGs. So we added this mitigation patch that

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-11-09 Thread Maxime Beauchemin
rawbacks > mentioned already. > > Get Outlook for Android<https://aka.ms/ghei36> > > > From: Maxime Beauchemin > Sent: Friday, November 9, 2018 5:03:02 PM > To: dev@airflow.incubator.apache.org > Cc: d...@airflow.apache.org; yr

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-11-09 Thread Maxime Beauchemin
I mean at that point it's just as easy (or easier) to do things properly: get the scheduler subprocesses to take a lock on the DAG it's about to process, and release it when it's done. Add a lock timestamp and bit of logic to expire locks (to self heal if the process ever crashed and failed at

Re: Duplicate key unique constraint error

2018-11-02 Thread Maxime Beauchemin
throw this error. I am not sure then when > can this error happen. > > > > On 2 November 2018 at 8:37:20 AM, Maxime Beauchemin ( > maximebeauche...@gmail.com) wrote: > > The scheduler should never fail hard. The schedule logic that tries to > insert the new task instance s

Re: Duplicate key unique constraint error

2018-11-02 Thread Maxime Beauchemin
k Sinha (abhis...@infoworks.io) > wrote: > > Max, > > The schedule interval is 1 day. > > > > Sent from my iPhone > > > On 30-Oct-2018, at 9:29 PM, Maxime Beauchemin < > maximebeauche...@gmail.com> > wrote: > > > > Also what's your schedul

Re: Deployment / Execution Model

2018-10-31 Thread Maxime Beauchemin
Deploying the DAGs should be decoupled from deploying Airflow itself. You can just use a resource that syncs the DAGs repo to the boxes on the Airflow cluster periodically (say every minute). Resource orchestrators like Chef, Ansible, Puppet, should have some easy way to do that. Either that or

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-10-31 Thread Maxime Beauchemin
A few related thoughts: * there may be hiccups around concurrency (pools, queues), though the worker should double-checks that the constraints are still met when firing the task, so in theory this should be ok * there may be more "misfires" meaning the task gets sent to the worker, but by the time

Re: Duplicate key unique constraint error

2018-10-30 Thread Maxime Beauchemin
till > any > >> backfill involved. > >> > >> Is there a way where I can find out in logs, if more than one instance > of > >> scheduler is running? > >> > >> > >> On 29 October 2018 at 10:43:19 PM, Maxime Beauchemin ( > >>

Re: Duplicate key unique constraint error

2018-10-29 Thread Maxime Beauchemin
om multiple scheduler instances > running? > > > On 29 October 2018 at 9:30:56 PM, Maxime Beauchemin ( > maximebeauche...@gmail.com) wrote: > > Abhishek, are you running more than one scheduler instance at once? > > Max > > On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha

Re: Duplicate key unique constraint error

2018-10-29 Thread Maxime Beauchemin
Abhishek, are you running more than one scheduler instance at once? Max On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha wrote: > The issue is happening more frequently now. Can someone please look into > this? > > > > > On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (abhis...@infoworks.io

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin
ttps://pip.pypa.io/en/latest/user_guide/#constraints-files > > (sorry for the brief message) > > > On 19 Oct 2018, at 17:02, Maxime Beauchemin > wrote: > > > >> releases in pip should have stable (pinned deps) > > I think that's an issue. When setup.py (the o

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin
> releases in pip should have stable (pinned deps) I think that's an issue. When setup.py (the only reqs that setuptools/pip knows about) is restrictive, there's no way to change that in your environment, install will just fail if you deviate (are there any hacks/solutions around that that I don't

Re: Pinning dependencies for Apache Airflow

2018-10-07 Thread Maxime Beauchemin
pip-tools can definitely help here to ship a reference [locked] `requirements.txt` that can be used in [all or part of] the CI. It's actually kind of important to get CI to fail when a new [backward incompatible] lib comes out and break things while allowing version ranges. I think there may be

Re: Manual validation operator

2018-10-05 Thread Maxime Beauchemin
It's a bit of a hack, but to save up slots you could just have an instantly-failing PythonOperator (just raise an exception in the callable) that would go in a failed state. Marking it as "success" when the conditions are met would act as a trigger. On Fri, Oct 5, 2018 at 9:07 AM Brian Greene

Re: Airflow Docs - RTD vs Apache Site

2018-10-05 Thread Maxime Beauchemin
A few thoughts: * we absolutely have to serve a project site off of `airflow.apache.org`, that's an ASF requirement * maybe `airflow.apache.org` could be setup as a proxy to readthedocs-latest (?) [I'm on vacation and have very slow internet, so didn't research whether that's a documented

Re: execution_date - can we stop the confusion?

2018-10-01 Thread Maxime Beauchemin
t; > > > started > > > > > > > > > > > > > > by a worker? > > > > > The lack of clarity and completeness around these suggestions, > > > > > alongside > > > > > > > > > inane declarations l

Flask App Builder [FAB] support

2018-09-28 Thread Maxime Beauchemin
The new [experimental] web UI is based on Flask App Builder [FAB] to which we contributed security fixes recently. It's been hard to get the main maintainer of the project's attention to release to Pypi. For context, I have write access to the repository, but no access to release to Pypi. Please

Re: execution_date - can we stop the confusion?

2018-09-26 Thread Maxime Beauchemin
I think if you have a functional mindset (as in "functional data engineering ") as opposed to a cron mindset, using the left bound of the time interval makes a lot of sense.

Re: Airflow: Apache Graduation

2018-09-20 Thread Maxime Beauchemin
Yeah let's make it happen! I'm happy to set some time aside to help with the final push. Max On Thu, Sep 20, 2018 at 9:53 AM Sid Anand wrote: > Folks! (specifically Bolke, Fokko, Ash) > What's needed to graduate from Apache? > > Can we make 1.10.1 be about meeting our licensing needs to allow

Re: Connection Management in Multi-tenancy Scenario

2018-09-19 Thread Maxime Beauchemin
Another clear solution is for connection management to go through the [upcoming] REST API we've been talking about. Then of course we'll need one permission per connection and a "all_connections" perm that can be added to roles (much like DAGs but for connections). Max On Wed, Sep 19, 2018 at

Re: Database referral integrity

2018-09-18 Thread Maxime Beauchemin
The database migration creating the FK will/would need to have something that either creates dummy missing PKs first, or delete the orphaned keys to insure the operation of creating the FK doesn't error out. Seems like adding dummy keys is a better approach. Then you'll have to make sure that

Re: Guidelines on Contrib vs Non-contrib

2018-09-18 Thread Maxime Beauchemin
+1 for deprecating operators/hooks as plugins, let's use Python's good old python packages and maybe python "entry points" if we want to inject them in "airflow.operators"/"airflow.hooks" (which is probably not necessary) On Tue, Sep 18, 2018 at 2:12 AM Ash Berlin-Taylor wrote: > Operators and

Re: Sep Airflow Bay Area Meetup @ Google

2018-09-17 Thread Maxime Beauchemin
: > > >> > > >> We are 3 weeks away from the meetup and still have a few lightening > > talks > > >> open, please take the chance and share your cool ideas/work ;) > > >> Meanwhile, speakers could you please send me and Trishka ( > > t

Re: Duplicate key unique constraint error

2018-09-12 Thread Maxime Beauchemin
Can you share the full python stack trace? On Wed, Sep 12, 2018 at 5:31 PM Abhishek Sinha wrote: > Got the following error on Airflow 1.8.2 version: > > duplicate key value violates unique constraint "task_instance_pkey" > > DETAIL: Key (task_id, dag_id, execution_date)=(PB_BPNZ, master_v2, >

Re: Cold-case PRs

2018-09-10 Thread Maxime Beauchemin
It doesn't deal with Jiras, just PRs and GH issues (which we don't use...) Max On Mon, Sep 10, 2018 at 6:58 PM Sid Anand wrote: > Max, > How do these manage the JIRAs? > > -s > > On Mon, Sep 10, 2018 at 6:14 PM Maxime Beauchemin < > maximebeauche...@gmail.com> w

Re: Cold-case PRs

2018-09-10 Thread Maxime Beauchemin
I've used https://github.com/bstriner/github-bot-close-inactive-issues in the past to auto-close issues / PRs based on a policy around inactivity. It worked alright. There's also https://github.com/probot/stale which seems to be one of the leading solutions, but it may require an Apache INFRA

Re: TriggerDagRunOperator sub tasks are scheduled to run after few hours

2018-09-07 Thread Maxime Beauchemin
Is the issue timezone related? Personally I've only used Airflow in UTC-aligned environments so I can't help much on this topic. Bolke as contributed timezone awareness to the codebase in the past, I'm not sure what the common caveats may be. Max On Fri, Sep 7, 2018 at 4:29 AM Goutam Kumar Sahoo

Re: Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin
e it. > > > > On Wed, Aug 29, 2018 at 8:25 PM Maxime Beauchemin < > > maximebeauche...@gmail.com> wrote: > > > >> Looks like both. > >> > >> On Wed, Aug 29, 2018 at 12:18 PM Kaxil Naik > wrote: > >> > >> > Hi Max, &

Re: Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin
Looks like both. On Wed, Aug 29, 2018 at 12:18 PM Kaxil Naik wrote: > Hi Max, > > Did you see that on readthedocs or airflow.apache one? > > On Wed, 29 Aug 2018, 20:15 Maxime Beauchemin, > wrote: > > > Hey committers, > > > > I noticed that some of

Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin
Hey committers, I noticed that some of the operators are missing from the API reference part of the docs (HiveOperator for instance). I'm guessing a committer generated / pushed the docs with some libs missing and that the operators depending on those missing libs got skipped. We may have to

Re: PR Review Dashboard?

2018-08-28 Thread Maxime Beauchemin
t with JIRA for now and if we end up > moving to GH we can move it over. I've got a fork of the dashboard I've > been using in Apache Beam as well as the Spark one so it shouldn't take me > too long to generalize it again. > > > >> On Sun, Aug 26, 2018 at 8:50 PM Maxime Bea

Re: Lazily load input for Airflow operators

2018-08-27 Thread Maxime Beauchemin
This is reasonable, it could be nice to have a generic way to replace operators kwargs with callables. In the meantime you can try this hack deriving an operator inline with your DAG definition. In this hack, the callable receives the operator's context object which is nice, it provides a handle

Re: Airflow 1.10.0 is released on PyPI

2018-08-27 Thread Maxime Beauchemin
Thanks to everyone who contributed to this release, and special thanks to those who worked on packaging and releasing 1.10 ! I took a quick look at the change log and it looks like this release adds *780 commits* on top of `1.9.0` which was released last January. Let's take a moment to appreciate

Re: PR Review Dashboard?

2018-08-26 Thread Maxime Beauchemin
nd doc is >>> here: >>> >>> >>> https://github.com/kubernetes/kubernetes/blob/master/hack/cherry_pick_pull.sh >>> >>> https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md >>> >>> And more gene

Re: Why not mark inactive DAGs in the main scheduler loop?

2018-08-22 Thread Maxime Beauchemin
I'd rather the scheduler delegate that to one of the minions (subprocess) if possible. We should keep everything we can off the main thread. BTW I've been speaking about renaming the scheduler to "supervisor" for a while now. While renaming may be a bit tricky (updating all references in the

Re: [RESULT][VOTE] Release Airflow 1.10.0

2018-08-21 Thread Maxime Beauchemin
I can, what's your PyPI ID? Max On Mon, Aug 20, 2018 at 2:11 PM Driesprong, Fokko wrote: > Thanks Bolke! > > I've just pushed the artifacts to Apache Dist: > > https://dist.apache.org/repos/dist/release/incubator/airflow/1.10.0-incubating/ > > I don't have any access to pypi, this means that

Re: Sep Airflow Bay Area Meetup @ Google

2018-08-12 Thread Maxime Beauchemin
Hey Feng, Sign me up for a session on "Challenges ahead - taking airflow to the next level". I'm planning on recycling the content from the talk @Google next Friday. Max On Fri, Aug 10, 2018 at 3:22 PM Feng Lu wrote: > Hi all, > > We still have 1-2 regular sessions and 4-5 lightening sessions

Re: Replacement ShortCircuitOperator

2018-08-10 Thread Maxime Beauchemin
Hey, I agree, I always thought this would be handled through a BaseOperator flag `only_run_latest=True`. There are some potentially confusing / incompatible scenarios like: * can't have both depend_on_past and only_run_latest (obviously...) * can't have any downstream tasks of only_run_latest

Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Maxime Beauchemin
The change on perf for the DAG table would be extremely negligible. Maybe for task_instances (large table with millions of rows, 3 fields composite key) it *could* be a decent idea. Though you'd then need to have two indexes to store and maintain and we may have to change the code to actually use

Re: Custom authentication with RBAC

2018-08-08 Thread Maxime Beauchemin
You can define your own AirflowSecurityManager based on FAB's SecurityManager http://flask-appbuilder.readthedocs.io/en/latest/security.html docs. We should publish docs on how to do this. Max On Wed, Aug 8, 2018 at 2:31 PM Gabriel Silk wrote: > Hello Airflow devs, > > It seems that it is not

Re: Basic modeling question

2018-08-08 Thread Maxime Beauchemin
There's also the hack of using templating to skip executions. Say for a BashOperator: {% if execution_date.weekday() == 1 %} echo "skipping today" {% else %} ./run_workload.sh {% endif %} On Wed, Aug 8, 2018 at 4:27 PM Gabriel Silk wrote: > Alexis, do you mean you would have done this using an

Re: The need for LocalTaskJob

2018-08-06 Thread Maxime Beauchemin
vement from where we are now. > > > > > B. > > > > Verstuurd vanaf mijn iPad > > > >> Op 4 aug. 2018 om 19:40 heeft Ash Berlin-Taylor < > ash_airflowl...@firemirror.com> het volgende geschreven: > >> > >> Comments inline.

Re: The need for LocalTaskJob

2018-08-04 Thread Maxime Beauchemin
Let me confirm I'm understanding this right, we're talking specifically about the CeleryExecutor not starting and `airflow run` (not --raw) command, and fire up a LocalTaskJob instead? Then we'd still have the worker fire up the `airflow run --raw` command? Seems reasonable. One thing to keep in

Re: Use 'watch' feature of Github instead of this list?

2018-08-03 Thread Maxime Beauchemin
We have an open issue with Apache Infra about this that you can track here: https://issues.apache.org/jira/browse/INFRA-16854 On Fri, Aug 3, 2018 at 11:29 AM Trent Robbins wrote: > Hi All, > > Is it possible that people who want to see a notification for every issue > can subscribe to

Re: Apache Airflow welcome new committer/PMC member : Feng Tao (a.k.a. feng-tao)

2018-08-03 Thread Maxime Beauchemin
Well deserved, welcome aboard! On Fri, Aug 3, 2018 at 9:07 AM Mark Grover wrote: > Congrats Tao! > > On Fri, Aug 3, 2018, 08:52 Jin Chang wrote: > > > Congrats, Tao!! > > > > On Fri, Aug 3, 2018 at 8:20 AM Taylor Edmiston > > wrote: > > > > > Congratulations, Feng! > > > > > > *Taylor

Re: We've migrated to Github to repo!

2018-07-31 Thread Maxime Beauchemin
started plowing though my mailbox and merged a commit > without squash and merge, but it changes history as you mention. > Nice thing of Github is if you change it, it remembers your preference > which is Squash and Merge :-) > > Love the Gitbox so far, great work! > > Cheers,

Re: We've migrated to Github to repo!

2018-07-31 Thread Maxime Beauchemin
"Squash & Merge" (the default) does the right thing (squashes the multiple commit and replays the resulting commit on top of master), we should use that most of the times. We'd only want to merge if we wanted to preserve history from within the PR (multiple collaborators or multiple important

Re: We've migrated to Github to repo!

2018-07-30 Thread Maxime Beauchemin
We should ask Apache infra to send the GH notifs to another mailing list. Max On Mon, Jul 30, 2018 at 11:35 AM Ash Berlin-Taylor < ash_airflowl...@firemirror.com> wrote: > It appears we also have comments on Github issues being auto-duplicated to > the dev mailing list -- this will increase the

Re: Using large numbers of sensors, resource consumption

2018-07-15 Thread Maxime Beauchemin
There have been conversations in the past around the idea of adding an `evaluation_method` argument in BaseSensor that would allow for different options: 1. the current approach which is taking up a slot and poking periodically (heavy on slot usage) 2. one approach closer to fail/retry approach,

Re: [DISCUSS] AIP - Time for Airflow Improvement Proposals?

2018-07-15 Thread Maxime Beauchemin
+1 On Tue, Jul 10, 2018 at 1:09 PM Sid Anand wrote: > +1 > > On Tue, Jul 10, 2018 at 1:02 PM George Leslie-Waksman > wrote: > > > +1 > > > > On Tue, Jul 10, 2018 at 11:50 AM Jakob Homan wrote: > > > > > Lots of Apache projects use ?IPs - Whatever Improvement Proposal - to > > > document and

Re: Airflow's JS code (and dependencies) manageable via npm and webpack

2018-07-15 Thread Maxime Beauchemin
Glad to see this is happening! Max On Mon, Jul 9, 2018 at 6:37 AM Ash Berlin-Taylor < ash_airflowl...@firemirror.com> wrote: > Great! Thanks for doing this. I've left some review comments on your PR. > > -ash > > > On 9 Jul 2018, at 11:45, Verdan Mahmood > wrote: > > > > ​Hey Guys, ​ > > > >

Re: What information is passed around different components of Airflow?

2018-07-06 Thread Maxime Beauchemin
The MQ (rabbit / redis / ...) gets the `airflow run {dag_id} {task_id} {...}` command to execute, and I think the worker runs it blindly as far as I remember it. It's not ideal as far as security goes since if the MQ is compromised, there's an open vector to the workers. Eventually it would be

Re: Deprecating Run task from Airflow webUI

2018-07-01 Thread Maxime Beauchemin
Few thoughts: * in our environment at Lyft, cleared tasks do get picked up by the scheduler. Is there an issue opened for the bug you are referring to? is that on 1.9.0? * "clearing" in both the web ui and CLI also flips the DagRun state back to running as the intent of clearing is usually to get

Re: Scheduler crashed due to mysql connectivity errors

2018-06-29 Thread Maxime Beauchemin
I'd open an issue with the full stack trace. There should be exception handling wrapping the scheduler loop so I'm curious to see which part isn't handled properly. In the meantime I would highly recommend using something like `runit` to restart the process if it exits for some reason. The fact

Re: conn_id breaking change; once more with feeling

2018-06-29 Thread Maxime Beauchemin
> Breaking changes could be maintained via deprecation warnings for a number of releases to avoid deterring users, whilst pushing towards a cleaner interface. No one wants to go and alter hundreds of DAGs, thousands of operator calls. I know for a fact that the task would be monumental at both

Re: Securing Connections

2018-06-29 Thread Maxime Beauchemin
It certainly sounds doable and similar to the DAG-level access controls in many ways (see the soon to be merged PR ). The new `airflow sync_perm` CLI command could insure the existence of one perm per "conn_id" as well as a "all_conn_id" perm.

Re: Apache Airflow 1.10.0b3

2018-06-28 Thread Maxime Beauchemin
It would be so nice to have a fast test suite. Having to wait for Travis for up to an hour makes many workflows (like working on a release) super painful. I spoke with folks at Astronomer recently about moving all operators and hooks to another Python package that airflow would import. This would

Re: airflow.exceptions.AirflowException dag_id not found

2018-06-11 Thread Maxime Beauchemin
t; > > > On Jun 11, 2018, at 3:11 PM, Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > > > > DagBag import timeouts happen when people do more than just > "configuration > > as code" in their module scope (say doing actual compute in module s

Re: airflow.exceptions.AirflowException dag_id not found

2018-06-11 Thread Maxime Beauchemin
DagBag import timeouts happen when people do more than just "configuration as code" in their module scope (say doing actual compute in module scope, which is a no-no). They may also happen if you read things from flimsy external systems that may introduce delays. Say you read pipeline

Re: Accessing execution_date inside a function

2018-06-08 Thread Maxime Beauchemin
If you look at the source code for the TimeDeltaSensor, it's about 10 lines of code. You can easily derive `BaseSensorOperator` and write your own DynamicTimeDeltaSensor that has its own logic, or one that receives a callable `are_conditions_met(execution_date)` that receives the execution date

Re: Is `airflow backfill` disfunctional?

2018-06-08 Thread Maxime Beauchemin
t;> the run will immediately time out (since it still thinks it's been > running > >> since the previous backfill). This will cause tasks to deadlock > spuriously, > >> making backfills extremely cumbersome to carry out. > >> > >> *Mark Whitfield* > &g

Re: Is `airflow backfill` disfunctional?

2018-06-06 Thread Maxime Beauchemin
e side of > “it’s a better behavior [to have failed tasks re-run when cleared in a > backfill" > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin < > maximebeauche...@gmail.com>, wrote: > >> @Jeremiah Lowin & @Bolke de Bruin > I > >> think

Airflow-related Talk: Functional Data Engineering, a set of Best Practices

2018-06-05 Thread Maxime Beauchemin
I'm taking the freedom to share my talk from DataEngConf 2018 with this group since it's somewhat related to Airflow. https://www.youtube.com/watch?v=4Spo2QRTz1k Related blog post:

Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Maxime Beauchemin
anks, > -Tao > > On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > > > So I'm running a backfill for what feels like the first time in years > using > > a simple `airflow backfill --local` commands. > > > > Fi

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin
;> indexing of public instances (as most public instances will be > >> accidentally > >>> public, statistically speaking). If you truly want your Airflow > instance > >>> public and indexed, you should have to go out of your way to permit > that. &

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin
What about a clear alert on the UI showing when auth is off? Perhaps a large red triangle-exclamation icon on the navbar with a tooltip "Authentication is off, this Airflow instance in not secure." and clicking take you to the doc's security page. Well and then of course people should make sure

Re: Dealing with data latency

2018-06-04 Thread Maxime Beauchemin
The common standard is to have the execution_date aligned with the partition date in the database (say 2018-08-08) and contain data from 2018-08-08T00:00:000 to 2018-08-09T23:59:999. The partition date and execution_date match and correspond to the left bound of the time interval processed. Then

Re: Enable Travis CI Auto Cancellation?

2018-05-31 Thread Maxime Beauchemin
Good call, I opened https://issues.apache.org/jira/browse/INFRA-16604 since none of the committers have the required rights to do this. Max On Wed, May 30, 2018 at 9:26 PM Craig Rodrigues wrote: > Can someone who has administrator access to Travis CI enable > Auto-Cancellation on branch and

Re: Disable Processing of DAG file

2018-05-30 Thread Maxime Beauchemin
The TLDR of how the processor works is: while True: * sets a multiprocessing queue with N processes (say 32) * main process looks for the list of all .py files in DAGS_FOLDER * fills in the queue with all .py * each one of the 32 suprocess opens a file and interprets it (it's insulated from the

Re: What are the rules / policies for graduating classes out of airflow.contrib?

2018-05-29 Thread Maxime Beauchemin
* At least one active committer that runs that code in their environment and cares enough and has enough context to review / fix things if need be * Decent code quality * Decent unit test coverage * Decent underlying libraries (no dependencies on unmaintained/unpopular libs) About the wiki I

Re: conn_id breaking change; once more with feeling

2018-05-29 Thread Maxime Beauchemin
The main reason for the conn_id prefix is to facilitate the use of `default_args`. Because of this you can set all your connections at the top of your script and from that point on you just instantiate tasks without re-stating connections. It's common for people to define multiple "operating

Re: Using Airflow with dataset dependant flows (not date)

2018-05-29 Thread Maxime Beauchemin
Hi, Assuming the shape of your DAG is the same across runs, the prescribed way is to go with the DAG with a schedule_interval=None and to create your DAG Runs on demand. You can do so programmatically (using the ORM: airflow.models.DagRun) (cli: airflow trigger_dag) or through REST. If your DAG

Re: Convert Dag Run from Backfill to Scheduled?

2018-05-29 Thread Maxime Beauchemin
various tasks via `task_failed_deps` indicated > the > > tasks had all their dependencies filled. After running the update query, > > they’re all `scheduled__` dag runs. > > > > On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin < > > maximebeauche...@gm

Re: Convert Dag Run from Backfill to Scheduled?

2018-05-29 Thread Maxime Beauchemin
While this may work it's clearly not the prescribed way to do this. Clearing should just work. I'm trying to understand why the scheduler is not picking up the cleared task. Clearing should remove the task instance state and set the state of the related DAG Run to running so that the scheduler

Re: Problem with the scheduler?

2018-05-24 Thread Maxime Beauchemin
Also note that for example when setting up monthly jobs, the job with an execution_date of `2018-02-01` will be triggered soon after the wall clock hits `2018-03-01`, and that your start_date for the tasks in the DAG need to be prior to that execution_date, not the time at which you're expecting

Is `airflow backfill` disfunctional?

2018-05-24 Thread Maxime Beauchemin
So I'm running a backfill for what feels like the first time in years using a simple `airflow backfill --local` commands. First I start getting a ton of `logging.info` of each tasks that cannot be started just yet at every tick flooding my terminal with the keyword `FAILED` in it, looking like a

Re: Airflow cli to remote host

2018-05-23 Thread Maxime Beauchemin
A quick side note to say that it's common to deploy one or many Airflow sandboxes which are effectively the same configuration as a worker without an actual worker instance working on it. It's similar to the concept of a "gateway node" in Hadoop. Users typically work in user space with a modified

Re: 答复: Airflow REST API proof of concept.

2018-05-21 Thread Maxime Beauchemin
Personally I think we should keep the architecture as simple as possible and use the same web server for REST and UI. As mentioned FAB (Flask App Builder) manages authentication and RBAC, so we can have consistent access rights in the UI and CLI. Max On Fri, May 11, 2018 at 5:42 AM Luke Diment

Re: Dags getting failed after 24 hours

2018-05-21 Thread Maxime Beauchemin
Even though it's possible to set and `execution_timeout` on any task and/or a dagrun_timeout on DAG runs, by default it's all set to None (unless you're somehow setting the DAG's default parameters in some other ways). Maybe your have some OS-level policies on long-running processes in your

Re: Improving Airflow SLAs

2018-05-03 Thread Maxime Beauchemin
About de-coupling the SLA management process, I've had conversations in the direction of renaming the scheduler to "supervisor" to reflect the fact that it's not just scheduling processes, it does a lot more tasks than just that, SLA management being one of them. I still think the default should

Re: Managed Apache Airflow Service on Google Cloud Platform

2018-05-01 Thread Maxime Beauchemin
I'm sure the community agrees when I say that we're happy and honored to have Googlers on board. Congrats on the launch! Max On Tue, May 1, 2018 at 9:58 AM, Feng Lu wrote: > *Hello everyone,I want to let everyone know that today Google Cloud > launched a new managed

Re: How to clear all failed tasks for a DAG with batch ?

2018-04-30 Thread Maxime Beauchemin
x.n.ja...@gmail.com> wrote: > It would be a good webui update to add a multiselect option to clear by > task state. Or maybe clear anything but running/success by default and add > an "include success" option. > On Fri, Apr 27, 2018 at 06:47 Maxime Beauchemin &

Re: How to clear all failed tasks for a DAG with batch ?

2018-04-27 Thread Maxime Beauchemin
https://airflow.apache.org/cli.html#clear `airflow clear mydagid --only_failed` You can specify a date range, a task_id regex and other flags as well using this command. Max On Thu, Apr 26, 2018 at 11:04 PM, dong.yajun wrote: > Hi list, > > We run a DAG with about 450

Re: About how to pause the running task

2018-04-26 Thread Maxime Beauchemin
There are no semantics or concept of pause within a task, though you can clear/kill tasks which is essentially sending a poison pill to kill the task subprocess. Even if BaseOperator (common to all tasks) was to implement pausing semantics, they probably wouldn't be implemented for most

Re: About the project support in Airflow

2018-04-24 Thread Maxime Beauchemin
even use > airflow to manage your ci/cd pipeline. > > B. > > Sent from my iPhone > > > On 24 Apr 2018, at 18:33, Maxime Beauchemin <maximebeauche...@gmail.com> > wrote: > > > > People have been talking about namespacing DAGs in the past. I'd > reco

Re: About the project support in Airflow

2018-04-24 Thread Maxime Beauchemin
People have been talking about namespacing DAGs in the past. I'd recommend using tags (many to many) instead of categories/projects (one to many). It should be fairly easy to add this feature. One question is whether tags are defined as code or in the UI/db only. Max On Tue, Apr 24, 2018 at

Re: Benchmarking of Airflow Scheduler with Celery Executor

2018-04-13 Thread Maxime Beauchemin
If you're concerned about scheduler scalability I'd go with a bigger box. The scheduler uses multiprocessing so more CPU power means more throughput. Also you may want to provision a beefy MySQL box to make sure that doesn't become the bottleneck. 10k tasks heartbeating to the DB every 30 seconds

Give it up for Fokko!

2018-04-13 Thread Maxime Beauchemin
Hey all, I wanted to point out the amazing work that Fokko is doing, reviewing/merging PRs and doing fantastic committer & maintainer work. It takes a variety of contributions to make projects like Airflow thrive, but without this kind of involvement it wouldn't be possible to keep shipping

Re: Slides / Presentations for Airflow

2018-04-12 Thread Maxime Beauchemin
I have some slides but they're very Airbnb-styled, I need to make a new deck... Max On Thu, Apr 12, 2018 at 9:41 AM, Chris Riccomini wrote: > > I suspect we will be hearing from Joy, Chris R., et al shortly. > > Correct. Video editing is going on as we speak. :) > > On

Re: slow scheduler

2018-04-04 Thread Maxime Beauchemin
As a batch scheduler Airflow doesn't currently guarantee super low latency. The aim for the project has been to make it possible to do sub-minute latency at scale, but it's common for this go up to a few minutes in larger environments. I'd recommend making it clear to your users what your Airflow

Re: DAG Level permissions (was Re: RBAC Update)

2018-03-30 Thread Maxime Beauchemin
ssions table : >> 1) One toggle on the user level for broad access (ALL:ALL, RUN/CLEAR:ALL, >> VIEW:ALL) default NULL >> 2) More granular permissions at the DAG level. >> >> So in order, check the user's broad level permission first, then DAG >> level. For lar

Re: What are the advantages of plugins, not sure I see any?

2018-03-30 Thread Maxime Beauchemin
seem good > for redistribution, but if you're only working with operators and hooks and > aren't sharing that code then it might not make too much sense to use them. > > On Fri, Mar 30, 2018 at 4:23 PM Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > > > The orig

Re: What are the advantages of plugins, not sure I see any?

2018-03-30 Thread Maxime Beauchemin
The original intent was to use plugins as a way to share sets of objects and applications build on top of Airflow. For instance it'd be possible to ship the things listed bellow as Airflow plugin: * "validate-and-schedule my query" UI * a set of ML-related hooks and operators that match a

Re: RBAC Update

2018-03-29 Thread Maxime Beauchemin
> PyPi access :D I'll make the change to have Airflow Webserver's FAB > > >> dependency pointing to my fork for the mean time. > > >> > > >> For folks who are interested in RBAC, I will be giving a talk/demo at > > the Airflow > > >>

Awesome list of resources around Apache Airflow

2018-03-28 Thread Maxime Beauchemin
I was pleasantly surprised to stumble upon this recently: https://github.com/jghoman/awesome-apache-airflow Please contribute anything that you think is missing from the list! Max

  1   2   3   4   >