Re > What are people's feelings on changing the default execution to schedule > interval start
and > I'm in favor of doing that, but then exposing new variables of > "interval_start" and "interval_end", etc. so that people write > clearer-looking at-a-glance DAGs While I am def on board with the spirit of this PR, I would vote we do not accept this PR as is, because it cements a confusing option. *What is the right representation of a dag run?* Right now the representation is "dag run-at date minus 1 interval". It should just be "dag run-at date". We don't need to address the question of whether execution date is the start or the end of an interval; it doesn't matter. In all cases, a given dag run will be targeted for *some* initial "run-at time"; so *that* should be the time that is part of the PK of a dag run, and *that *is the time that should be exposed as the dag run "execution date" *Interval of interest is not a dag_run attribute* We also mix in this question of the date interval that the *tasks* are interested in. But the *dag run* need not concern itself with this in any way. That is for the tasks to figure out: if they happen to need "dag run-at date," then they can reference that; if they want the prior one, ask for the prior one. Previously, I was in the camp that thought it was a great idea to rename "execution_date" to "period_start" or "interval_start". But I now think this is folly. It invokes this question of the "interval of interest" or "period of interest". But the dag doesn't need to know anything about that. Within the same dag you may have tasks with different intervals of interest. So why make assumptions in the dag; just give the facts: this is my run date; this is the prior run date, etc. It would be a regression from the perspective of providing accurate names. *Proposal* So, I would propose we change "execution_date" to mean "dag run-at date" as opposed to "dag run-at date minus 1". But we should do so without reference to interval end or interval start. *Configurability* The more configuration options we have, the more noise there is as a user trying to understand how to use airflow, so I'd rather us not make this configurable at all. That said, perhaps a more clear and more explicit means making this configurable would be to define an integer param "dag_run_execution_date_interval_offset", which would control how many intervals back from actual "dag run-at date" the "execution date" should be. (current behavior = 1, new behavior = 0). *Side note* Hopefully not to derail discussion: I think there are additional, related task attributes that may want to come into being: namely, low_watermark and high_watermark. There is the potential, with attributes like this, for adding better out-of-the-box support for common data workflows that we now need to use xcom for, namely incremental loads. But I want to give it more thought before proposing anything specific. On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > Good one Damian. I will have a list of issues that can be possible to > handle at the workshop, so that one goes there. > > J. > > Principal Software Engineer > Phone: +48660796129 > > pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < > damian.sha...@credit-suisse.com> napisał: > > > I can't understate what a conceptual improvement this would be for the > end > > users of Airflow in our environment. I've written a lot of code so all > our > > configuration works like this anyway. But the UI still shows the Airflow > > dates which still to this day sometimes confuse me. > > > > I'll be at the NY meet ups on Monday and Tuesday, maybe some of my first > > PRs could be additional test cases around edge cases to do with DST and > > cron scheduling that I have concerns about :) > > > > Damian > > > > -----Original Message----- > > From: Ash Berlin-Taylor [mailto:a...@apache.org] > > Sent: Friday, August 23, 2019 6:50 AM > > To: dev@airflow.apache.org > > Subject: Setting to add choice of schedule at end or schedule at start of > > interval > > > > This has come up a few times before, someone has now opened a PR that > > makes this a global+per-dag setting: > > https://github.com/apache/airflow/pull/5787 and it also includes docs > > that I think does a good job of illustrating the two modes. > > > > Does anyone object to this being merged? If no one says anything by > midday > > on Tuesday I will take that as assent and will merge it. > > > > The docs from the PR included below. > > > > Thanks, > > Ash > > > > Scheduled Time vs Execution Time > > '''''''''''''''''''''''''''''''' > > > > A DAG with a ``schedule_interval`` will execute once per interval. By > > default, the execution of a DAG will occur at the **end** of the > > schedule interval. > > > > A few examples: > > > > - A DAG with ``schedule_interval='@hourly'``: The DAG run that processes > > 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, > > i.e. once that hour is over. > > - A DAG with ``schedule_interval='@daily'``: The DAG run that processes > > 2019-08-16 will start running shortly after 2019-08-17 00:00. > > > > The reasoning behind this execution vs scheduling behaviour is that > > data for the interval to be processed won't be fully available until > > the interval has elapsed. > > > > In cases where you wish the DAG to be executed at the **start** of the > > interval, specify ``schedule_at_interval_end=False``, either in > > ``airflow.cfg``, or on a per-DAG basis. > > > > > > > > > =============================================================================== > > > > Please access the attached hyperlink for an important electronic > > communications disclaimer: > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > > =============================================================================== > > > > >