Execution date is execution date for a dag run no matter what. There is no end interval or start interval for a dag run. The only time this is relevant is when we calculate the next or previous dagrun.
So I don't Daniels rationale makes sense (?) Sent from my iPhone > On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagn...@gmail.com> wrote: > > I agree with Daniel's rationale but I am also worried about backwards > compatibility as this would perhaps be the most disruptive breaking change > possible. I think maybe we should write down the different options > available to us (AIP?) and call for a vote. What does everyone think? > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote: >> >> Can't execution date can already mean different things depending on if the >> dag run was initiated via the scheduler or manually via command line/API? >> I agree that making it consistent might make it easier to explain to new >> users, but should we exchange that for breaking pretty much every existing >> dag by re-defining what execution date is? >> -James >> >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com> >> wrote: >> >>>> >>>> To Daniel’s concerns, I would argue this is not a change to what a dag >>> run >>>> is, it is rather a change to WHEN that dag run will be scheduled. >>> >>> >>> Execution date is part of the definition of a dag_run; it is uniquely >>> identified by an execution_date and dag_id. >>> >>> When someone asks what is a dag_run, we should be able to provide an >>> answer. >>> >>> Imagine trying to explain what a dag run is, when execution_date can mean >>> different things. >>> Admin: "A dag run is an execution_date and a dag_id". >>> New user: "Ok. Clear as a bell. What's an execution_date?" >>> Admin: "Well, it can be one of two things. It *could* be when the >> dag >>> will be run... but it could *also* be 'the time when dag should be run >>> minus one schedule interval". It depends on whether you choose 'end' or >>> 'start' for 'schedule_interval_edge.' If you choose 'start' then >>> execution_date means 'when dag will be run'. If you choose 'end' then >>> execution_date means 'when dag will be run minus one interval.' If you >>> change the parameter after some time, then we don't necessarily know what >>> it means at all times". >>> >>> Why would we do this to ourselves? >>> >>> Alternatively, we can give dag_run a clear, unambiguous meaning: >>> * dag_run is dag_id + execution_date >>> * execution_date is when dag will be run (notwithstanding scheduler >> delay, >>> queuing) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Execution_date is defined as "run-at date minus 1 interval". The >>> assumption in this is that you tasks care about this particular date. >>> Obviously this makes sense for some tasks but not for others. >>> >>> I would prop >>> >>> >>> >>> >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote: >>>> >>>> I think this is a great improvement and should be merged. To Daniel’s >>>> concerns, I would argue this is not a change to what a dag run is, it >> is >>>> rather a change to WHEN that dag run will be scheduled. >>>> I had implemented a similar change in my own version but ultimately >>> backed >>>> so I didn’t have to patch after each new release. In my opinion the >> main >>>> flaw in the current scheduler, and I have brought this up before, is >> when >>>> you don’t have a consistent schedule interval (e.g. only run M-F). >> After >>>> backing out the “schedule at interval start” I had to switch to a daily >>>> schedule and go through and put a short circuit operator in each of my >>> M-F >>>> dags to get the behavior that I wanted. This results in putting >>> scheduling >>>> logic inside the dag, when scheduling logic should be in the scheduler. >>>> >>>> -James >>>> >>>> >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com> >>>> wrote: >>>>> >>>>> Re >>>>> >>>>>> What are people's feelings on changing the default execution to >>> schedule >>>>>> interval start >>>>> >>>>> and >>>>> >>>>>> I'm in favor of doing that, but then exposing new variables of >>>>>> "interval_start" and "interval_end", etc. so that people write >>>>>> clearer-looking at-a-glance DAGs >>>>> >>>>> >>>>> While I am def on board with the spirit of this PR, I would vote we >> do >>>> not >>>>> accept this PR as is, because it cements a confusing option. >>>>> >>>>> *What is the right representation of a dag run?* >>>>> >>>>> Right now the representation is "dag run-at date minus 1 interval". >> It >>>>> should just be "dag run-at date". >>>>> >>>>> We don't need to address the question of whether execution date is >> the >>>>> start or the end of an interval; it doesn't matter. >>>>> >>>>> In all cases, a given dag run will be targeted for *some* initial >>> "run-at >>>>> time"; so *that* should be the time that is part of the PK of a dag >>> run, >>>>> and *that *is the time that should be exposed as the dag run >> "execution >>>>> date" >>>>> >>>>> *Interval of interest is not a dag_run attribute* >>>>> >>>>> We also mix in this question of the date interval that the *tasks* >> are >>>>> interested in. But the *dag run* need not concern itself with this >> in >>>> any >>>>> way. That is for the tasks to figure out: if they happen to need >> "dag >>>>> run-at date," then they can reference that; if they want the prior >> one, >>>> ask >>>>> for the prior one. >>>>> >>>>> Previously, I was in the camp that thought it was a great idea to >>> rename >>>>> "execution_date" to "period_start" or "interval_start". But I now >>> think >>>>> this is folly. It invokes this question of the "interval of >> interest" >>> or >>>>> "period of interest". But the dag doesn't need to know anything >> about >>>>> that. >>>>> >>>>> Within the same dag you may have tasks with different intervals of >>>>> interest. So why make assumptions in the dag; just give the facts: >>> this >>>> is >>>>> my run date; this is the prior run date, etc. It would be a >> regression >>>>> from the perspective of providing accurate names. >>>>> >>>>> *Proposal* >>>>> >>>>> So, I would propose we change "execution_date" to mean "dag run-at >>> date" >>>> as >>>>> opposed to "dag run-at date minus 1". But we should do so without >>>>> reference to interval end or interval start. >>>>> >>>>> *Configurability* >>>>> >>>>> The more configuration options we have, the more noise there is as a >>> user >>>>> trying to understand how to use airflow, so I'd rather us not make >> this >>>>> configurable at all. >>>>> >>>>> That said, perhaps a more clear and more explicit means making this >>>>> configurable would be to define an integer param >>>>> "dag_run_execution_date_interval_offset", which would control how >> many >>>>> intervals back from actual "dag run-at date" the "execution date" >>> should >>>>> be. (current behavior = 1, new behavior = 0). >>>>> >>>>> *Side note* >>>>> >>>>> Hopefully not to derail discussion: I think there are additional, >>> related >>>>> task attributes that may want to come into being: namely, >> low_watermark >>>> and >>>>> high_watermark. There is the potential, with attributes like this, >> for >>>>> adding better out-of-the-box support for common data workflows that >> we >>>> now >>>>> need to use xcom for, namely incremental loads. But I want to give >> it >>>> more >>>>> thought before proposing anything specific. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk < >> jarek.pot...@polidea.com >>>> >>>>> wrote: >>>>> >>>>>> Good one Damian. I will have a list of issues that can be possible >> to >>>>>> handle at the workshop, so that one goes there. >>>>>> >>>>>> J. >>>>>> >>>>>> Principal Software Engineer >>>>>> Phone: +48660796129 >>>>>> >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < >>>>>> damian.sha...@credit-suisse.com> napisał: >>>>>> >>>>>>> I can't understate what a conceptual improvement this would be for >>> the >>>>>> end >>>>>>> users of Airflow in our environment. I've written a lot of code so >>> all >>>>>> our >>>>>>> configuration works like this anyway. But the UI still shows the >>>> Airflow >>>>>>> dates which still to this day sometimes confuse me. >>>>>>> >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my >>>> first >>>>>>> PRs could be additional test cases around edge cases to do with DST >>> and >>>>>>> cron scheduling that I have concerns about :) >>>>>>> >>>>>>> Damian >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Ash Berlin-Taylor [mailto:a...@apache.org] >>>>>>> Sent: Friday, August 23, 2019 6:50 AM >>>>>>> To: dev@airflow.apache.org >>>>>>> Subject: Setting to add choice of schedule at end or schedule at >>> start >>>> of >>>>>>> interval >>>>>>> >>>>>>> This has come up a few times before, someone has now opened a PR >> that >>>>>>> makes this a global+per-dag setting: >>>>>>> https://github.com/apache/airflow/pull/5787 and it also includes >>> docs >>>>>>> that I think does a good job of illustrating the two modes. >>>>>>> >>>>>>> Does anyone object to this being merged? If no one says anything by >>>>>> midday >>>>>>> on Tuesday I will take that as assent and will merge it. >>>>>>> >>>>>>> The docs from the PR included below. >>>>>>> >>>>>>> Thanks, >>>>>>> Ash >>>>>>> >>>>>>> Scheduled Time vs Execution Time >>>>>>> '''''''''''''''''''''''''''''''' >>>>>>> >>>>>>> A DAG with a ``schedule_interval`` will execute once per interval. >> By >>>>>>> default, the execution of a DAG will occur at the **end** of the >>>>>>> schedule interval. >>>>>>> >>>>>>> A few examples: >>>>>>> >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that >>>> processes >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, >>>>>>> i.e. once that hour is over. >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that >>> processes >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00. >>>>>>> >>>>>>> The reasoning behind this execution vs scheduling behaviour is that >>>>>>> data for the interval to be processed won't be fully available >> until >>>>>>> the interval has elapsed. >>>>>>> >>>>>>> In cases where you wish the DAG to be executed at the **start** of >>> the >>>>>>> interval, specify ``schedule_at_interval_end=False``, either in >>>>>>> ``airflow.cfg``, or on a per-DAG basis. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>> >> =============================================================================== >>>>>>> >>>>>>> Please access the attached hyperlink for an important electronic >>>>>>> communications disclaimer: >>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>>>>>> >>>>>> >>>> >>> >> =============================================================================== >>>>>>> >>>>>>> >>>>>> >>>> >>> >>