What if we merely add a property "run_date" to DagRun? At present this would be essentially same as "next_execution_date".
Then no change to scheduler would be required, and no new dag parameter or config. Perhaps you could add a toggle to the DAGs UI view that lets you choose whether to display "last run" by "run_date" or "execution_date". If you want your dags to be parameterized by the date when they meant to be run -- as opposed to their implicit interval-of-interest -- then you can reference "run_date". One potential source of confusion with this is backfilling: what does "run_date" mean in the context of a backfill? You could say it means essentially "initial run date", i.e. "do not run before date", i.e. "run after date" or "run-at date". So, for a daily, job the 2019-01-02 "run_date" corresponds to a 2019-01-01 execution_date. This makes sense right? Perhaps in the future, the relationship between "run_date" and "execution_date" can be more dynamic. Perhaps in the future we rename "execution_date" for clarity, or to be more generic. But it makes sense that a dag run will always have a run date, so it doesn't seem like a terrible idea to add a property representing this. Would this meet the goals of the PR? On Wed, Aug 28, 2019 at 11:50 AM James Meickle <jmeic...@quantopian.com.invalid> wrote: > Totally agree with Daniel here. I think that if we implement this feature > as proposed, it will actively discourage us from implementing a better > data-aware feature that would remain invisible to most users while neatly > addressing a lot of edge cases that currently require really ugly hacks. I > believe that having more data awareness features in Airflow (like the data > lineage work, or other metadata integrations) is worth investing in if we > can do it without too much required user-facing complexity. The Airflow > project isn't a full data warehouse suite but it's also not just "cron with > a UI", so we should try to be pragmatic and fit in power-user features > where we can do so without compromising the project's overall goals. > > On Wed, Aug 28, 2019 at 2:24 PM Daniel Standish <dpstand...@gmail.com> > wrote: > > > I am just thinking there is the potential for a more comprehensive > > enhancement here, and I worry that this is a band-aid that, like all new > > features has the potential to constrain future options. It does not help > > us to do anything we cannot already do. > > > > The source of this problem is that scheduling and interval-of-interest > are > > mixed together. > > > > My thought is there may be a way to separate scheduling and > > interval-of-interest to uniformly resolve "execution_date" vs "run_date" > > confusion. We could make *explicit* instead of *implicit* the > relationship > > between run_date *(not currently a concept in airflow)* and > > "interval-of-interest" *(currently represented by execution_date)*. > > > > I also see in this the potential to unlock some other improvements: > > * support a greater diversity of incremental processes > > * allow more flexible backfilling > > * provide better views of data you have vs data you don't. > > > > The canonical airflow job is date-partitioned idempotent data pull. Your > > interval of interest is from execution_date to execution_date + 1 > > interval. Schedule_interval is not just the scheduling cadence but it is > > also your interval-of-interest partition function. If that doesn't work > > for your job, you set catchup=False and roll your own. > > > > What if there was a way to generalize? E.g. could we allow for more > > flexible partition function that deviated from scheduler cadence? E.g. > > what if your interval-of-interest partitions could be governed by "min 1 > > day, max 30 days". Then on on-going basis, your daily loads would be a > > range of 1 day but then if server down for couple days, this could be > > caught up in one task and if you backfill it could be up to 30-day > batches. > > > > Perhaps there is an abstraction that could be used by a greater diversity > > of incremental processes. Such a thing could support a nice "data > > contiguity view". I imagine a horizontal bar that is solid where we have > > the data and empty where we don't. Then you click on a "missing" section > > and you can trigger a backfill task with that date interval according to > > your partitioning rules. > > > > I can imagine using this for an incremental job where each time we pull > the > > new data since last time; in the `execute` method the operator could set > > `self.high_watermark` with the max datetime processed. Or maybe a > callback > > function could be used to gather this value. This value could be used in > > next run, and cold be depicted in a view. > > > > Default intervals of interest could be status quo -- i.e. partitions > equal > > to schedule interval -- but could be overwritten using templating or > > callbacks or setting it during `execute`. > > > > So anyway, I don't have a master plan all figured out. But I think there > > is opportunity in this area for more comprehensive enhancement that goes > > more directly at the root of the problem. > > > > > > > > > > On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin < > > maximebeauche...@gmail.com> wrote: > > > > > How about an alternative approach that would introduce 2 new keyword > > > arguments that are clear (something like, but maybe better than > > > `period_start_dttm`, `period_end_dttm`) and leave `execution_date` > > > unchanged, but plan it's deprecation. As a first step `execution_date` > > > would be inferred from the new args, and warn about deprecation when > > used. > > > > > > Max > > > > > > On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bdbr...@gmail.com> > > wrote: > > > > > > > Execution date is execution date for a dag run no matter what. There > is > > > no > > > > end interval or start interval for a dag run. The only time this is > > > > relevant is when we calculate the next or previous dagrun. > > > > > > > > So I don't Daniels rationale makes sense (?) > > > > > > > > Sent from my iPhone > > > > > > > > > On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagn...@gmail.com> > > > wrote: > > > > > > > > > > I agree with Daniel's rationale but I am also worried about > backwards > > > > > compatibility as this would perhaps be the most disruptive breaking > > > > change > > > > > possible. I think maybe we should write down the different options > > > > > available to us (AIP?) and call for a vote. What does everyone > think? > > > > > > > > > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> > > > wrote: > > > > >> > > > > >> Can't execution date can already mean different things depending > on > > if > > > > the > > > > >> dag run was initiated via the scheduler or manually via command > > > > line/API? > > > > >> I agree that making it consistent might make it easier to explain > to > > > new > > > > >> users, but should we exchange that for breaking pretty much every > > > > existing > > > > >> dag by re-defining what execution date is? > > > > >> -James > > > > >> > > > > >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish < > > > dpstand...@gmail.com> > > > > >> wrote: > > > > >> > > > > >>>> > > > > >>>> To Daniel’s concerns, I would argue this is not a change to > what a > > > dag > > > > >>> run > > > > >>>> is, it is rather a change to WHEN that dag run will be > scheduled. > > > > >>> > > > > >>> > > > > >>> Execution date is part of the definition of a dag_run; it is > > uniquely > > > > >>> identified by an execution_date and dag_id. > > > > >>> > > > > >>> When someone asks what is a dag_run, we should be able to provide > > an > > > > >>> answer. > > > > >>> > > > > >>> Imagine trying to explain what a dag run is, when execution_date > > can > > > > mean > > > > >>> different things. > > > > >>> Admin: "A dag run is an execution_date and a dag_id". > > > > >>> New user: "Ok. Clear as a bell. What's an execution_date?" > > > > >>> Admin: "Well, it can be one of two things. It *could* be when > > the > > > > >> dag > > > > >>> will be run... but it could *also* be 'the time when dag should > be > > > run > > > > >>> minus one schedule interval". It depends on whether you choose > > 'end' > > > > or > > > > >>> 'start' for 'schedule_interval_edge.' If you choose 'start' then > > > > >>> execution_date means 'when dag will be run'. If you choose 'end' > > > then > > > > >>> execution_date means 'when dag will be run minus one interval.' > If > > > you > > > > >>> change the parameter after some time, then we don't necessarily > > know > > > > what > > > > >>> it means at all times". > > > > >>> > > > > >>> Why would we do this to ourselves? > > > > >>> > > > > >>> Alternatively, we can give dag_run a clear, unambiguous meaning: > > > > >>> * dag_run is dag_id + execution_date > > > > >>> * execution_date is when dag will be run (notwithstanding > scheduler > > > > >> delay, > > > > >>> queuing) > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> Execution_date is defined as "run-at date minus 1 interval". The > > > > >>> assumption in this is that you tasks care about this particular > > date. > > > > >>> Obviously this makes sense for some tasks but not for others. > > > > >>> > > > > >>> I would prop > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com > > > > > > wrote: > > > > >>>> > > > > >>>> I think this is a great improvement and should be merged. To > > > Daniel’s > > > > >>>> concerns, I would argue this is not a change to what a dag run > is, > > > it > > > > >> is > > > > >>>> rather a change to WHEN that dag run will be scheduled. > > > > >>>> I had implemented a similar change in my own version but > > ultimately > > > > >>> backed > > > > >>>> so I didn’t have to patch after each new release. In my opinion > > the > > > > >> main > > > > >>>> flaw in the current scheduler, and I have brought this up > before, > > is > > > > >> when > > > > >>>> you don’t have a consistent schedule interval (e.g. only run > M-F). > > > > >> After > > > > >>>> backing out the “schedule at interval start” I had to switch to > a > > > > daily > > > > >>>> schedule and go through and put a short circuit operator in each > > of > > > my > > > > >>> M-F > > > > >>>> dags to get the behavior that I wanted. This results in putting > > > > >>> scheduling > > > > >>>> logic inside the dag, when scheduling logic should be in the > > > > scheduler. > > > > >>>> > > > > >>>> -James > > > > >>>> > > > > >>>> > > > > >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish < > > dpstand...@gmail.com > > > > > > > > >>>> wrote: > > > > >>>>> > > > > >>>>> Re > > > > >>>>> > > > > >>>>>> What are people's feelings on changing the default execution > to > > > > >>> schedule > > > > >>>>>> interval start > > > > >>>>> > > > > >>>>> and > > > > >>>>> > > > > >>>>>> I'm in favor of doing that, but then exposing new variables of > > > > >>>>>> "interval_start" and "interval_end", etc. so that people write > > > > >>>>>> clearer-looking at-a-glance DAGs > > > > >>>>> > > > > >>>>> > > > > >>>>> While I am def on board with the spirit of this PR, I would > vote > > we > > > > >> do > > > > >>>> not > > > > >>>>> accept this PR as is, because it cements a confusing option. > > > > >>>>> > > > > >>>>> *What is the right representation of a dag run?* > > > > >>>>> > > > > >>>>> Right now the representation is "dag run-at date minus 1 > > interval". > > > > >> It > > > > >>>>> should just be "dag run-at date". > > > > >>>>> > > > > >>>>> We don't need to address the question of whether execution date > > is > > > > >> the > > > > >>>>> start or the end of an interval; it doesn't matter. > > > > >>>>> > > > > >>>>> In all cases, a given dag run will be targeted for *some* > initial > > > > >>> "run-at > > > > >>>>> time"; so *that* should be the time that is part of the PK of a > > dag > > > > >>> run, > > > > >>>>> and *that *is the time that should be exposed as the dag run > > > > >> "execution > > > > >>>>> date" > > > > >>>>> > > > > >>>>> *Interval of interest is not a dag_run attribute* > > > > >>>>> > > > > >>>>> We also mix in this question of the date interval that the > > *tasks* > > > > >> are > > > > >>>>> interested in. But the *dag run* need not concern itself with > > this > > > > >> in > > > > >>>> any > > > > >>>>> way. That is for the tasks to figure out: if they happen to > need > > > > >> "dag > > > > >>>>> run-at date," then they can reference that; if they want the > > prior > > > > >> one, > > > > >>>> ask > > > > >>>>> for the prior one. > > > > >>>>> > > > > >>>>> Previously, I was in the camp that thought it was a great idea > to > > > > >>> rename > > > > >>>>> "execution_date" to "period_start" or "interval_start". But I > > now > > > > >>> think > > > > >>>>> this is folly. It invokes this question of the "interval of > > > > >> interest" > > > > >>> or > > > > >>>>> "period of interest". But the dag doesn't need to know > anything > > > > >> about > > > > >>>>> that. > > > > >>>>> > > > > >>>>> Within the same dag you may have tasks with different intervals > > of > > > > >>>>> interest. So why make assumptions in the dag; just give the > > facts: > > > > >>> this > > > > >>>> is > > > > >>>>> my run date; this is the prior run date, etc. It would be a > > > > >> regression > > > > >>>>> from the perspective of providing accurate names. > > > > >>>>> > > > > >>>>> *Proposal* > > > > >>>>> > > > > >>>>> So, I would propose we change "execution_date" to mean "dag > > run-at > > > > >>> date" > > > > >>>> as > > > > >>>>> opposed to "dag run-at date minus 1". But we should do so > > without > > > > >>>>> reference to interval end or interval start. > > > > >>>>> > > > > >>>>> *Configurability* > > > > >>>>> > > > > >>>>> The more configuration options we have, the more noise there is > > as > > > a > > > > >>> user > > > > >>>>> trying to understand how to use airflow, so I'd rather us not > > make > > > > >> this > > > > >>>>> configurable at all. > > > > >>>>> > > > > >>>>> That said, perhaps a more clear and more explicit means making > > this > > > > >>>>> configurable would be to define an integer param > > > > >>>>> "dag_run_execution_date_interval_offset", which would control > how > > > > >> many > > > > >>>>> intervals back from actual "dag run-at date" the "execution > date" > > > > >>> should > > > > >>>>> be. (current behavior = 1, new behavior = 0). > > > > >>>>> > > > > >>>>> *Side note* > > > > >>>>> > > > > >>>>> Hopefully not to derail discussion: I think there are > additional, > > > > >>> related > > > > >>>>> task attributes that may want to come into being: namely, > > > > >> low_watermark > > > > >>>> and > > > > >>>>> high_watermark. There is the potential, with attributes like > > this, > > > > >> for > > > > >>>>> adding better out-of-the-box support for common data workflows > > that > > > > >> we > > > > >>>> now > > > > >>>>> need to use xcom for, namely incremental loads. But I want to > > give > > > > >> it > > > > >>>> more > > > > >>>>> thought before proposing anything specific. > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk < > > > > >> jarek.pot...@polidea.com > > > > >>>> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> Good one Damian. I will have a list of issues that can be > > possible > > > > >> to > > > > >>>>>> handle at the workshop, so that one goes there. > > > > >>>>>> > > > > >>>>>> J. > > > > >>>>>> > > > > >>>>>> Principal Software Engineer > > > > >>>>>> Phone: +48660796129 > > > > >>>>>> > > > > >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. < > > > > >>>>>> damian.sha...@credit-suisse.com> napisał: > > > > >>>>>> > > > > >>>>>>> I can't understate what a conceptual improvement this would > be > > > for > > > > >>> the > > > > >>>>>> end > > > > >>>>>>> users of Airflow in our environment. I've written a lot of > code > > > so > > > > >>> all > > > > >>>>>> our > > > > >>>>>>> configuration works like this anyway. But the UI still shows > > the > > > > >>>> Airflow > > > > >>>>>>> dates which still to this day sometimes confuse me. > > > > >>>>>>> > > > > >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some > of > > > my > > > > >>>> first > > > > >>>>>>> PRs could be additional test cases around edge cases to do > with > > > DST > > > > >>> and > > > > >>>>>>> cron scheduling that I have concerns about :) > > > > >>>>>>> > > > > >>>>>>> Damian > > > > >>>>>>> > > > > >>>>>>> -----Original Message----- > > > > >>>>>>> From: Ash Berlin-Taylor [mailto:a...@apache.org] > > > > >>>>>>> Sent: Friday, August 23, 2019 6:50 AM > > > > >>>>>>> To: dev@airflow.apache.org > > > > >>>>>>> Subject: Setting to add choice of schedule at end or schedule > > at > > > > >>> start > > > > >>>> of > > > > >>>>>>> interval > > > > >>>>>>> > > > > >>>>>>> This has come up a few times before, someone has now opened a > > PR > > > > >> that > > > > >>>>>>> makes this a global+per-dag setting: > > > > >>>>>>> https://github.com/apache/airflow/pull/5787 and it also > > includes > > > > >>> docs > > > > >>>>>>> that I think does a good job of illustrating the two modes. > > > > >>>>>>> > > > > >>>>>>> Does anyone object to this being merged? If no one says > > anything > > > by > > > > >>>>>> midday > > > > >>>>>>> on Tuesday I will take that as assent and will merge it. > > > > >>>>>>> > > > > >>>>>>> The docs from the PR included below. > > > > >>>>>>> > > > > >>>>>>> Thanks, > > > > >>>>>>> Ash > > > > >>>>>>> > > > > >>>>>>> Scheduled Time vs Execution Time > > > > >>>>>>> '''''''''''''''''''''''''''''''' > > > > >>>>>>> > > > > >>>>>>> A DAG with a ``schedule_interval`` will execute once per > > > interval. > > > > >> By > > > > >>>>>>> default, the execution of a DAG will occur at the **end** of > > the > > > > >>>>>>> schedule interval. > > > > >>>>>>> > > > > >>>>>>> A few examples: > > > > >>>>>>> > > > > >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run > that > > > > >>>> processes > > > > >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16 > > > 17:59:59, > > > > >>>>>>> i.e. once that hour is over. > > > > >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that > > > > >>> processes > > > > >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00. > > > > >>>>>>> > > > > >>>>>>> The reasoning behind this execution vs scheduling behaviour > is > > > that > > > > >>>>>>> data for the interval to be processed won't be fully > available > > > > >> until > > > > >>>>>>> the interval has elapsed. > > > > >>>>>>> > > > > >>>>>>> In cases where you wish the DAG to be executed at the > **start** > > > of > > > > >>> the > > > > >>>>>>> interval, specify ``schedule_at_interval_end=False``, either > in > > > > >>>>>>> ``airflow.cfg``, or on a per-DAG basis. > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > =============================================================================== > > > > >>>>>>> > > > > >>>>>>> Please access the attached hyperlink for an important > > electronic > > > > >>>>>>> communications disclaimer: > > > > >>>>>>> > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > =============================================================================== > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > >