I think this is a great improvement and should be merged. To Daniel’s concerns, I would argue this is not a change to what a dag run is, it is rather a change to WHEN that dag run will be scheduled. I had implemented a similar change in my own version but ultimately backed so I didn’t have to patch after each new release. In my opinion the main flaw in the current scheduler, and I have brought this up before, is when you don’t have a consistent schedule interval (e.g. only run M-F). After backing out the “schedule at interval start” I had to switch to a daily schedule and go through and put a short circuit operator in each of my M-F dags to get the behavior that I wanted. This results in putting scheduling logic inside the dag, when scheduling logic should be in the scheduler.
-James > On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com> wrote: > > Re > >> What are people's feelings on changing the default execution to schedule >> interval start > > and > >> I'm in favor of doing that, but then exposing new variables of >> "interval_start" and "interval_end", etc. so that people write >> clearer-looking at-a-glance DAGs > > > While I am def on board with the spirit of this PR, I would vote we do not > accept this PR as is, because it cements a confusing option. > > *What is the right representation of a dag run?* > > Right now the representation is "dag run-at date minus 1 interval". It > should just be "dag run-at date". > > We don't need to address the question of whether execution date is the > start or the end of an interval; it doesn't matter. > > In all cases, a given dag run will be targeted for *some* initial "run-at > time"; so *that* should be the time that is part of the PK of a dag run, > and *that *is the time that should be exposed as the dag run "execution > date" > > *Interval of interest is not a dag_run attribute* > > We also mix in this question of the date interval that the *tasks* are > interested in. But the *dag run* need not concern itself with this in any > way. That is for the tasks to figure out: if they happen to need "dag > run-at date," then they can reference that; if they want the prior one, ask > for the prior one. > > Previously, I was in the camp that thought it was a great idea to rename > "execution_date" to "period_start" or "interval_start". But I now think > this is folly. It invokes this question of the "interval of interest" or > "period of interest". But the dag doesn't need to know anything about > that. > > Within the same dag you may have tasks with different intervals of > interest. So why make assumptions in the dag; just give the facts: this is > my run date; this is the prior run date, etc. It would be a regression > from the perspective of providing accurate names. > > *Proposal* > > So, I would propose we change "execution_date" to mean "dag run-at date" as > opposed to "dag run-at date minus 1". But we should do so without > reference to interval end or interval start. > > *Configurability* > > The more configuration options we have, the more noise there is as a user > trying to understand how to use airflow, so I'd rather us not make this > configurable at all. > > That said, perhaps a more clear and more explicit means making this > configurable would be to define an integer param > "dag_run_execution_date_interval_offset", which would control how many > intervals back from actual "dag run-at date" the "execution date" should > be. (current behavior = 1, new behavior = 0). > > *Side note* > > Hopefully not to derail discussion: I think there are additional, related > task attributes that may want to come into being: namely, low_watermark and > high_watermark. There is the potential, with attributes like this, for > adding better out-of-the-box support for common data workflows that we now > need to use xcom for, namely incremental loads. But I want to give it more > thought before proposing anything specific. > > > > > > > On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > >> Good one Damian. I will have a list of issues that can be possible to >> handle at the workshop, so that one goes there. >> >> J. >> >> Principal Software Engineer >> Phone: +48660796129 >> >> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < >> damian.sha...@credit-suisse.com> napisał: >> >>> I can't understate what a conceptual improvement this would be for the >> end >>> users of Airflow in our environment. I've written a lot of code so all >> our >>> configuration works like this anyway. But the UI still shows the Airflow >>> dates which still to this day sometimes confuse me. >>> >>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my first >>> PRs could be additional test cases around edge cases to do with DST and >>> cron scheduling that I have concerns about :) >>> >>> Damian >>> >>> -----Original Message----- >>> From: Ash Berlin-Taylor [mailto:a...@apache.org] >>> Sent: Friday, August 23, 2019 6:50 AM >>> To: dev@airflow.apache.org >>> Subject: Setting to add choice of schedule at end or schedule at start of >>> interval >>> >>> This has come up a few times before, someone has now opened a PR that >>> makes this a global+per-dag setting: >>> https://github.com/apache/airflow/pull/5787 and it also includes docs >>> that I think does a good job of illustrating the two modes. >>> >>> Does anyone object to this being merged? If no one says anything by >> midday >>> on Tuesday I will take that as assent and will merge it. >>> >>> The docs from the PR included below. >>> >>> Thanks, >>> Ash >>> >>> Scheduled Time vs Execution Time >>> '''''''''''''''''''''''''''''''' >>> >>> A DAG with a ``schedule_interval`` will execute once per interval. By >>> default, the execution of a DAG will occur at the **end** of the >>> schedule interval. >>> >>> A few examples: >>> >>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that processes >>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, >>> i.e. once that hour is over. >>> - A DAG with ``schedule_interval='@daily'``: The DAG run that processes >>> 2019-08-16 will start running shortly after 2019-08-17 00:00. >>> >>> The reasoning behind this execution vs scheduling behaviour is that >>> data for the interval to be processed won't be fully available until >>> the interval has elapsed. >>> >>> In cases where you wish the DAG to be executed at the **start** of the >>> interval, specify ``schedule_at_interval_end=False``, either in >>> ``airflow.cfg``, or on a per-DAG basis. >>> >>> >>> >>> >> =============================================================================== >>> >>> Please access the attached hyperlink for an important electronic >>> communications disclaimer: >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>> >> =============================================================================== >>> >>> >>