How about an alternative approach that would introduce 2 new keyword arguments that are clear (something like, but maybe better than `period_start_dttm`, `period_end_dttm`) and leave `execution_date` unchanged, but plan it's deprecation. As a first step `execution_date` would be inferred from the new args, and warn about deprecation when used.
Max On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bdbr...@gmail.com> wrote: > Execution date is execution date for a dag run no matter what. There is no > end interval or start interval for a dag run. The only time this is > relevant is when we calculate the next or previous dagrun. > > So I don't Daniels rationale makes sense (?) > > Sent from my iPhone > > > On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagn...@gmail.com> wrote: > > > > I agree with Daniel's rationale but I am also worried about backwards > > compatibility as this would perhaps be the most disruptive breaking > change > > possible. I think maybe we should write down the different options > > available to us (AIP?) and call for a vote. What does everyone think? > > > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote: > >> > >> Can't execution date can already mean different things depending on if > the > >> dag run was initiated via the scheduler or manually via command > line/API? > >> I agree that making it consistent might make it easier to explain to new > >> users, but should we exchange that for breaking pretty much every > existing > >> dag by re-defining what execution date is? > >> -James > >> > >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com> > >> wrote: > >> > >>>> > >>>> To Daniel’s concerns, I would argue this is not a change to what a dag > >>> run > >>>> is, it is rather a change to WHEN that dag run will be scheduled. > >>> > >>> > >>> Execution date is part of the definition of a dag_run; it is uniquely > >>> identified by an execution_date and dag_id. > >>> > >>> When someone asks what is a dag_run, we should be able to provide an > >>> answer. > >>> > >>> Imagine trying to explain what a dag run is, when execution_date can > mean > >>> different things. > >>> Admin: "A dag run is an execution_date and a dag_id". > >>> New user: "Ok. Clear as a bell. What's an execution_date?" > >>> Admin: "Well, it can be one of two things. It *could* be when the > >> dag > >>> will be run... but it could *also* be 'the time when dag should be run > >>> minus one schedule interval". It depends on whether you choose 'end' > or > >>> 'start' for 'schedule_interval_edge.' If you choose 'start' then > >>> execution_date means 'when dag will be run'. If you choose 'end' then > >>> execution_date means 'when dag will be run minus one interval.' If you > >>> change the parameter after some time, then we don't necessarily know > what > >>> it means at all times". > >>> > >>> Why would we do this to ourselves? > >>> > >>> Alternatively, we can give dag_run a clear, unambiguous meaning: > >>> * dag_run is dag_id + execution_date > >>> * execution_date is when dag will be run (notwithstanding scheduler > >> delay, > >>> queuing) > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Execution_date is defined as "run-at date minus 1 interval". The > >>> assumption in this is that you tasks care about this particular date. > >>> Obviously this makes sense for some tasks but not for others. > >>> > >>> I would prop > >>> > >>> > >>> > >>> > >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> > wrote: > >>>> > >>>> I think this is a great improvement and should be merged. To Daniel’s > >>>> concerns, I would argue this is not a change to what a dag run is, it > >> is > >>>> rather a change to WHEN that dag run will be scheduled. > >>>> I had implemented a similar change in my own version but ultimately > >>> backed > >>>> so I didn’t have to patch after each new release. In my opinion the > >> main > >>>> flaw in the current scheduler, and I have brought this up before, is > >> when > >>>> you don’t have a consistent schedule interval (e.g. only run M-F). > >> After > >>>> backing out the “schedule at interval start” I had to switch to a > daily > >>>> schedule and go through and put a short circuit operator in each of my > >>> M-F > >>>> dags to get the behavior that I wanted. This results in putting > >>> scheduling > >>>> logic inside the dag, when scheduling logic should be in the > scheduler. > >>>> > >>>> -James > >>>> > >>>> > >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Re > >>>>> > >>>>>> What are people's feelings on changing the default execution to > >>> schedule > >>>>>> interval start > >>>>> > >>>>> and > >>>>> > >>>>>> I'm in favor of doing that, but then exposing new variables of > >>>>>> "interval_start" and "interval_end", etc. so that people write > >>>>>> clearer-looking at-a-glance DAGs > >>>>> > >>>>> > >>>>> While I am def on board with the spirit of this PR, I would vote we > >> do > >>>> not > >>>>> accept this PR as is, because it cements a confusing option. > >>>>> > >>>>> *What is the right representation of a dag run?* > >>>>> > >>>>> Right now the representation is "dag run-at date minus 1 interval". > >> It > >>>>> should just be "dag run-at date". > >>>>> > >>>>> We don't need to address the question of whether execution date is > >> the > >>>>> start or the end of an interval; it doesn't matter. > >>>>> > >>>>> In all cases, a given dag run will be targeted for *some* initial > >>> "run-at > >>>>> time"; so *that* should be the time that is part of the PK of a dag > >>> run, > >>>>> and *that *is the time that should be exposed as the dag run > >> "execution > >>>>> date" > >>>>> > >>>>> *Interval of interest is not a dag_run attribute* > >>>>> > >>>>> We also mix in this question of the date interval that the *tasks* > >> are > >>>>> interested in. But the *dag run* need not concern itself with this > >> in > >>>> any > >>>>> way. That is for the tasks to figure out: if they happen to need > >> "dag > >>>>> run-at date," then they can reference that; if they want the prior > >> one, > >>>> ask > >>>>> for the prior one. > >>>>> > >>>>> Previously, I was in the camp that thought it was a great idea to > >>> rename > >>>>> "execution_date" to "period_start" or "interval_start". But I now > >>> think > >>>>> this is folly. It invokes this question of the "interval of > >> interest" > >>> or > >>>>> "period of interest". But the dag doesn't need to know anything > >> about > >>>>> that. > >>>>> > >>>>> Within the same dag you may have tasks with different intervals of > >>>>> interest. So why make assumptions in the dag; just give the facts: > >>> this > >>>> is > >>>>> my run date; this is the prior run date, etc. It would be a > >> regression > >>>>> from the perspective of providing accurate names. > >>>>> > >>>>> *Proposal* > >>>>> > >>>>> So, I would propose we change "execution_date" to mean "dag run-at > >>> date" > >>>> as > >>>>> opposed to "dag run-at date minus 1". But we should do so without > >>>>> reference to interval end or interval start. > >>>>> > >>>>> *Configurability* > >>>>> > >>>>> The more configuration options we have, the more noise there is as a > >>> user > >>>>> trying to understand how to use airflow, so I'd rather us not make > >> this > >>>>> configurable at all. > >>>>> > >>>>> That said, perhaps a more clear and more explicit means making this > >>>>> configurable would be to define an integer param > >>>>> "dag_run_execution_date_interval_offset", which would control how > >> many > >>>>> intervals back from actual "dag run-at date" the "execution date" > >>> should > >>>>> be. (current behavior = 1, new behavior = 0). > >>>>> > >>>>> *Side note* > >>>>> > >>>>> Hopefully not to derail discussion: I think there are additional, > >>> related > >>>>> task attributes that may want to come into being: namely, > >> low_watermark > >>>> and > >>>>> high_watermark. There is the potential, with attributes like this, > >> for > >>>>> adding better out-of-the-box support for common data workflows that > >> we > >>>> now > >>>>> need to use xcom for, namely incremental loads. But I want to give > >> it > >>>> more > >>>>> thought before proposing anything specific. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk < > >> jarek.pot...@polidea.com > >>>> > >>>>> wrote: > >>>>> > >>>>>> Good one Damian. I will have a list of issues that can be possible > >> to > >>>>>> handle at the workshop, so that one goes there. > >>>>>> > >>>>>> J. > >>>>>> > >>>>>> Principal Software Engineer > >>>>>> Phone: +48660796129 > >>>>>> > >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < > >>>>>> damian.sha...@credit-suisse.com> napisał: > >>>>>> > >>>>>>> I can't understate what a conceptual improvement this would be for > >>> the > >>>>>> end > >>>>>>> users of Airflow in our environment. I've written a lot of code so > >>> all > >>>>>> our > >>>>>>> configuration works like this anyway. But the UI still shows the > >>>> Airflow > >>>>>>> dates which still to this day sometimes confuse me. > >>>>>>> > >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my > >>>> first > >>>>>>> PRs could be additional test cases around edge cases to do with DST > >>> and > >>>>>>> cron scheduling that I have concerns about :) > >>>>>>> > >>>>>>> Damian > >>>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Ash Berlin-Taylor [mailto:a...@apache.org] > >>>>>>> Sent: Friday, August 23, 2019 6:50 AM > >>>>>>> To: dev@airflow.apache.org > >>>>>>> Subject: Setting to add choice of schedule at end or schedule at > >>> start > >>>> of > >>>>>>> interval > >>>>>>> > >>>>>>> This has come up a few times before, someone has now opened a PR > >> that > >>>>>>> makes this a global+per-dag setting: > >>>>>>> https://github.com/apache/airflow/pull/5787 and it also includes > >>> docs > >>>>>>> that I think does a good job of illustrating the two modes. > >>>>>>> > >>>>>>> Does anyone object to this being merged? If no one says anything by > >>>>>> midday > >>>>>>> on Tuesday I will take that as assent and will merge it. > >>>>>>> > >>>>>>> The docs from the PR included below. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Ash > >>>>>>> > >>>>>>> Scheduled Time vs Execution Time > >>>>>>> '''''''''''''''''''''''''''''''' > >>>>>>> > >>>>>>> A DAG with a ``schedule_interval`` will execute once per interval. > >> By > >>>>>>> default, the execution of a DAG will occur at the **end** of the > >>>>>>> schedule interval. > >>>>>>> > >>>>>>> A few examples: > >>>>>>> > >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that > >>>> processes > >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59, > >>>>>>> i.e. once that hour is over. > >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that > >>> processes > >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00. > >>>>>>> > >>>>>>> The reasoning behind this execution vs scheduling behaviour is that > >>>>>>> data for the interval to be processed won't be fully available > >> until > >>>>>>> the interval has elapsed. > >>>>>>> > >>>>>>> In cases where you wish the DAG to be executed at the **start** of > >>> the > >>>>>>> interval, specify ``schedule_at_interval_end=False``, either in > >>>>>>> ``airflow.cfg``, or on a per-DAG basis. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > =============================================================================== > >>>>>>> > >>>>>>> Please access the attached hyperlink for an important electronic > >>>>>>> communications disclaimer: > >>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > >>>>>>> > >>>>>> > >>>> > >>> > >> > =============================================================================== > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> >