Can't execution date can already mean different things depending on if the dag run was initiated via the scheduler or manually via command line/API? I agree that making it consistent might make it easier to explain to new users, but should we exchange that for breaking pretty much every existing dag by re-defining what execution date is? -James
On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com> wrote: > > > > To Daniel’s concerns, I would argue this is not a change to what a dag > run > > is, it is rather a change to WHEN that dag run will be scheduled. > > > Execution date is part of the definition of a dag_run; it is uniquely > identified by an execution_date and dag_id. > > When someone asks what is a dag_run, we should be able to provide an > answer. > > Imagine trying to explain what a dag run is, when execution_date can mean > different things. > Admin: "A dag run is an execution_date and a dag_id". > New user: "Ok. Clear as a bell. What's an execution_date?" > Admin: "Well, it can be one of two things. It *could* be when the dag > will be run... but it could *also* be 'the time when dag should be run > minus one schedule interval". It depends on whether you choose 'end' or > 'start' for 'schedule_interval_edge.' If you choose 'start' then > execution_date means 'when dag will be run'. If you choose 'end' then > execution_date means 'when dag will be run minus one interval.' If you > change the parameter after some time, then we don't necessarily know what > it means at all times". > > Why would we do this to ourselves? > > Alternatively, we can give dag_run a clear, unambiguous meaning: > * dag_run is dag_id + execution_date > * execution_date is when dag will be run (notwithstanding scheduler delay, > queuing) > > > > > > > > > > > Execution_date is defined as "run-at date minus 1 interval". The > assumption in this is that you tasks care about this particular date. > Obviously this makes sense for some tasks but not for others. > > I would prop > > > > > On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote: > > > I think this is a great improvement and should be merged. To Daniel’s > > concerns, I would argue this is not a change to what a dag run is, it is > > rather a change to WHEN that dag run will be scheduled. > > I had implemented a similar change in my own version but ultimately > backed > > so I didn’t have to patch after each new release. In my opinion the main > > flaw in the current scheduler, and I have brought this up before, is when > > you don’t have a consistent schedule interval (e.g. only run M-F). After > > backing out the “schedule at interval start” I had to switch to a daily > > schedule and go through and put a short circuit operator in each of my > M-F > > dags to get the behavior that I wanted. This results in putting > scheduling > > logic inside the dag, when scheduling logic should be in the scheduler. > > > > -James > > > > > > > On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com> > > wrote: > > > > > > Re > > > > > >> What are people's feelings on changing the default execution to > schedule > > >> interval start > > > > > > and > > > > > >> I'm in favor of doing that, but then exposing new variables of > > >> "interval_start" and "interval_end", etc. so that people write > > >> clearer-looking at-a-glance DAGs > > > > > > > > > While I am def on board with the spirit of this PR, I would vote we do > > not > > > accept this PR as is, because it cements a confusing option. > > > > > > *What is the right representation of a dag run?* > > > > > > Right now the representation is "dag run-at date minus 1 interval". It > > > should just be "dag run-at date". > > > > > > We don't need to address the question of whether execution date is the > > > start or the end of an interval; it doesn't matter. > > > > > > In all cases, a given dag run will be targeted for *some* initial > "run-at > > > time"; so *that* should be the time that is part of the PK of a dag > run, > > > and *that *is the time that should be exposed as the dag run "execution > > > date" > > > > > > *Interval of interest is not a dag_run attribute* > > > > > > We also mix in this question of the date interval that the *tasks* are > > > interested in. But the *dag run* need not concern itself with this in > > any > > > way. That is for the tasks to figure out: if they happen to need "dag > > > run-at date," then they can reference that; if they want the prior one, > > ask > > > for the prior one. > > > > > > Previously, I was in the camp that thought it was a great idea to > rename > > > "execution_date" to "period_start" or "interval_start". But I now > think > > > this is folly. It invokes this question of the "interval of interest" > or > > > "period of interest". But the dag doesn't need to know anything about > > > that. > > > > > > Within the same dag you may have tasks with different intervals of > > > interest. So why make assumptions in the dag; just give the facts: > this > > is > > > my run date; this is the prior run date, etc. It would be a regression > > > from the perspective of providing accurate names. > > > > > > *Proposal* > > > > > > So, I would propose we change "execution_date" to mean "dag run-at > date" > > as > > > opposed to "dag run-at date minus 1". But we should do so without > > > reference to interval end or interval start. > > > > > > *Configurability* > > > > > > The more configuration options we have, the more noise there is as a > user > > > trying to understand how to use airflow, so I'd rather us not make this > > > configurable at all. > > > > > > That said, perhaps a more clear and more explicit means making this > > > configurable would be to define an integer param > > > "dag_run_execution_date_interval_offset", which would control how many > > > intervals back from actual "dag run-at date" the "execution date" > should > > > be. (current behavior = 1, new behavior = 0). > > > > > > *Side note* > > > > > > Hopefully not to derail discussion: I think there are additional, > related > > > task attributes that may want to come into being: namely, low_watermark > > and > > > high_watermark. There is the potential, with attributes like this, for > > > adding better out-of-the-box support for common data workflows that we > > now > > > need to use xcom for, namely incremental loads. But I want to give it > > more > > > thought before proposing anything specific. > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <jarek.pot...@polidea.com > > > > > wrote: > > > > > >> Good one Damian. I will have a list of issues that can be possible to > > >> handle at the workshop, so that one goes there. > > >> > > >> J. > > >> > > >> Principal Software Engineer > > >> Phone: +48660796129 > > >> > > >> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < > > >> damian.sha...@credit-suisse.com> napisał: > > >> > > >>> I can't understate what a conceptual improvement this would be for > the > > >> end > > >>> users of Airflow in our environment. I've written a lot of code so > all > > >> our > > >>> configuration works like this anyway. But the UI still shows the > > Airflow > > >>> dates which still to this day sometimes confuse me. > > >>> > > >>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my > > first > > >>> PRs could be additional test cases around edge cases to do with DST > and > > >>> cron scheduling that I have concerns about :) > > >>> > > >>> Damian > > >>> > > >>> -----Original Message----- > > >>> From: Ash Berlin-Taylor [mailto:a...@apache.org] > > >>> Sent: Friday, August 23, 2019 6:50 AM > > >>> To: dev@airflow.apache.org > > >>> Subject: Setting to add choice of schedule at end or schedule at > start > > of > > >>> interval > > >>> > > >>> This has come up a few times before, someone has now opened a PR that > > >>> makes this a global+per-dag setting: > > >>> https://github.com/apache/airflow/pull/5787 and it also includes > docs > > >>> that I think does a good job of illustrating the two modes. > > >>> > > >>> Does anyone object to this being merged? If no one says anything by > > >> midday > > >>> on Tuesday I will take that as assent and will merge it. > > >>> > > >>> The docs from the PR included below. > > >>> > > >>> Thanks, > > >>> Ash > > >>> > > >>> Scheduled Time vs Execution Time > > >>> '''''''''''''''''''''''''''''''' > > >>> > > >>> A DAG with a ``schedule_interval`` will execute once per interval. By > > >>> default, the execution of a DAG will occur at the **end** of the > > >>> schedule interval. > > >>> > > >>> A few examples: > > >>> > > >>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that > > processes > > >>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, > > >>> i.e. once that hour is over. > > >>> - A DAG with ``schedule_interval='@daily'``: The DAG run that > processes > > >>> 2019-08-16 will start running shortly after 2019-08-17 00:00. > > >>> > > >>> The reasoning behind this execution vs scheduling behaviour is that > > >>> data for the interval to be processed won't be fully available until > > >>> the interval has elapsed. > > >>> > > >>> In cases where you wish the DAG to be executed at the **start** of > the > > >>> interval, specify ``schedule_at_interval_end=False``, either in > > >>> ``airflow.cfg``, or on a per-DAG basis. > > >>> > > >>> > > >>> > > >>> > > >> > > > =============================================================================== > > >>> > > >>> Please access the attached hyperlink for an important electronic > > >>> communications disclaimer: > > >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > >>> > > >> > > > =============================================================================== > > >>> > > >>> > > >> > > >