I agree with Daniel's rationale but I am also worried about backwards compatibility as this would perhaps be the most disruptive breaking change possible. I think maybe we should write down the different options available to us (AIP?) and call for a vote. What does everyone think?
On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote: > Can't execution date can already mean different things depending on if the > dag run was initiated via the scheduler or manually via command line/API? > I agree that making it consistent might make it easier to explain to new > users, but should we exchange that for breaking pretty much every existing > dag by re-defining what execution date is? > -James > > On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com> > wrote: > > > > > > > To Daniel’s concerns, I would argue this is not a change to what a dag > > run > > > is, it is rather a change to WHEN that dag run will be scheduled. > > > > > > Execution date is part of the definition of a dag_run; it is uniquely > > identified by an execution_date and dag_id. > > > > When someone asks what is a dag_run, we should be able to provide an > > answer. > > > > Imagine trying to explain what a dag run is, when execution_date can mean > > different things. > > Admin: "A dag run is an execution_date and a dag_id". > > New user: "Ok. Clear as a bell. What's an execution_date?" > > Admin: "Well, it can be one of two things. It *could* be when the > dag > > will be run... but it could *also* be 'the time when dag should be run > > minus one schedule interval". It depends on whether you choose 'end' or > > 'start' for 'schedule_interval_edge.' If you choose 'start' then > > execution_date means 'when dag will be run'. If you choose 'end' then > > execution_date means 'when dag will be run minus one interval.' If you > > change the parameter after some time, then we don't necessarily know what > > it means at all times". > > > > Why would we do this to ourselves? > > > > Alternatively, we can give dag_run a clear, unambiguous meaning: > > * dag_run is dag_id + execution_date > > * execution_date is when dag will be run (notwithstanding scheduler > delay, > > queuing) > > > > > > > > > > > > > > > > > > > > > > Execution_date is defined as "run-at date minus 1 interval". The > > assumption in this is that you tasks care about this particular date. > > Obviously this makes sense for some tasks but not for others. > > > > I would prop > > > > > > > > > > On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote: > > > > > I think this is a great improvement and should be merged. To Daniel’s > > > concerns, I would argue this is not a change to what a dag run is, it > is > > > rather a change to WHEN that dag run will be scheduled. > > > I had implemented a similar change in my own version but ultimately > > backed > > > so I didn’t have to patch after each new release. In my opinion the > main > > > flaw in the current scheduler, and I have brought this up before, is > when > > > you don’t have a consistent schedule interval (e.g. only run M-F). > After > > > backing out the “schedule at interval start” I had to switch to a daily > > > schedule and go through and put a short circuit operator in each of my > > M-F > > > dags to get the behavior that I wanted. This results in putting > > scheduling > > > logic inside the dag, when scheduling logic should be in the scheduler. > > > > > > -James > > > > > > > > > > On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com> > > > wrote: > > > > > > > > Re > > > > > > > >> What are people's feelings on changing the default execution to > > schedule > > > >> interval start > > > > > > > > and > > > > > > > >> I'm in favor of doing that, but then exposing new variables of > > > >> "interval_start" and "interval_end", etc. so that people write > > > >> clearer-looking at-a-glance DAGs > > > > > > > > > > > > While I am def on board with the spirit of this PR, I would vote we > do > > > not > > > > accept this PR as is, because it cements a confusing option. > > > > > > > > *What is the right representation of a dag run?* > > > > > > > > Right now the representation is "dag run-at date minus 1 interval". > It > > > > should just be "dag run-at date". > > > > > > > > We don't need to address the question of whether execution date is > the > > > > start or the end of an interval; it doesn't matter. > > > > > > > > In all cases, a given dag run will be targeted for *some* initial > > "run-at > > > > time"; so *that* should be the time that is part of the PK of a dag > > run, > > > > and *that *is the time that should be exposed as the dag run > "execution > > > > date" > > > > > > > > *Interval of interest is not a dag_run attribute* > > > > > > > > We also mix in this question of the date interval that the *tasks* > are > > > > interested in. But the *dag run* need not concern itself with this > in > > > any > > > > way. That is for the tasks to figure out: if they happen to need > "dag > > > > run-at date," then they can reference that; if they want the prior > one, > > > ask > > > > for the prior one. > > > > > > > > Previously, I was in the camp that thought it was a great idea to > > rename > > > > "execution_date" to "period_start" or "interval_start". But I now > > think > > > > this is folly. It invokes this question of the "interval of > interest" > > or > > > > "period of interest". But the dag doesn't need to know anything > about > > > > that. > > > > > > > > Within the same dag you may have tasks with different intervals of > > > > interest. So why make assumptions in the dag; just give the facts: > > this > > > is > > > > my run date; this is the prior run date, etc. It would be a > regression > > > > from the perspective of providing accurate names. > > > > > > > > *Proposal* > > > > > > > > So, I would propose we change "execution_date" to mean "dag run-at > > date" > > > as > > > > opposed to "dag run-at date minus 1". But we should do so without > > > > reference to interval end or interval start. > > > > > > > > *Configurability* > > > > > > > > The more configuration options we have, the more noise there is as a > > user > > > > trying to understand how to use airflow, so I'd rather us not make > this > > > > configurable at all. > > > > > > > > That said, perhaps a more clear and more explicit means making this > > > > configurable would be to define an integer param > > > > "dag_run_execution_date_interval_offset", which would control how > many > > > > intervals back from actual "dag run-at date" the "execution date" > > should > > > > be. (current behavior = 1, new behavior = 0). > > > > > > > > *Side note* > > > > > > > > Hopefully not to derail discussion: I think there are additional, > > related > > > > task attributes that may want to come into being: namely, > low_watermark > > > and > > > > high_watermark. There is the potential, with attributes like this, > for > > > > adding better out-of-the-box support for common data workflows that > we > > > now > > > > need to use xcom for, namely incremental loads. But I want to give > it > > > more > > > > thought before proposing anything specific. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk < > jarek.pot...@polidea.com > > > > > > > wrote: > > > > > > > >> Good one Damian. I will have a list of issues that can be possible > to > > > >> handle at the workshop, so that one goes there. > > > >> > > > >> J. > > > >> > > > >> Principal Software Engineer > > > >> Phone: +48660796129 > > > >> > > > >> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < > > > >> damian.sha...@credit-suisse.com> napisał: > > > >> > > > >>> I can't understate what a conceptual improvement this would be for > > the > > > >> end > > > >>> users of Airflow in our environment. I've written a lot of code so > > all > > > >> our > > > >>> configuration works like this anyway. But the UI still shows the > > > Airflow > > > >>> dates which still to this day sometimes confuse me. > > > >>> > > > >>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my > > > first > > > >>> PRs could be additional test cases around edge cases to do with DST > > and > > > >>> cron scheduling that I have concerns about :) > > > >>> > > > >>> Damian > > > >>> > > > >>> -----Original Message----- > > > >>> From: Ash Berlin-Taylor [mailto:a...@apache.org] > > > >>> Sent: Friday, August 23, 2019 6:50 AM > > > >>> To: dev@airflow.apache.org > > > >>> Subject: Setting to add choice of schedule at end or schedule at > > start > > > of > > > >>> interval > > > >>> > > > >>> This has come up a few times before, someone has now opened a PR > that > > > >>> makes this a global+per-dag setting: > > > >>> https://github.com/apache/airflow/pull/5787 and it also includes > > docs > > > >>> that I think does a good job of illustrating the two modes. > > > >>> > > > >>> Does anyone object to this being merged? If no one says anything by > > > >> midday > > > >>> on Tuesday I will take that as assent and will merge it. > > > >>> > > > >>> The docs from the PR included below. > > > >>> > > > >>> Thanks, > > > >>> Ash > > > >>> > > > >>> Scheduled Time vs Execution Time > > > >>> '''''''''''''''''''''''''''''''' > > > >>> > > > >>> A DAG with a ``schedule_interval`` will execute once per interval. > By > > > >>> default, the execution of a DAG will occur at the **end** of the > > > >>> schedule interval. > > > >>> > > > >>> A few examples: > > > >>> > > > >>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that > > > processes > > > >>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, > > > >>> i.e. once that hour is over. > > > >>> - A DAG with ``schedule_interval='@daily'``: The DAG run that > > processes > > > >>> 2019-08-16 will start running shortly after 2019-08-17 00:00. > > > >>> > > > >>> The reasoning behind this execution vs scheduling behaviour is that > > > >>> data for the interval to be processed won't be fully available > until > > > >>> the interval has elapsed. > > > >>> > > > >>> In cases where you wish the DAG to be executed at the **start** of > > the > > > >>> interval, specify ``schedule_at_interval_end=False``, either in > > > >>> ``airflow.cfg``, or on a per-DAG basis. > > > >>> > > > >>> > > > >>> > > > >>> > > > >> > > > > > > =============================================================================== > > > >>> > > > >>> Please access the attached hyperlink for an important electronic > > > >>> communications disclaimer: > > > >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > > >>> > > > >> > > > > > > =============================================================================== > > > >>> > > > >>> > > > >> > > > > > >