Re: Setting to add choice of schedule at end or schedule at start of interval

Daniel Standish Wed, 28 Aug 2019 11:25:23 -0700

I am just thinking there is the potential for a more comprehensive
enhancement here, and I worry that this is a band-aid that, like all new
features has the potential to constrain future options.  It does not help
us to do anything we cannot already do.


The source of this problem is that scheduling and interval-of-interest are
mixed together.

My thought is there may be a way to separate scheduling and
interval-of-interest to uniformly resolve "execution_date" vs "run_date"
confusion.  We could make *explicit* instead of *implicit* the relationship
between run_date *(not currently a concept in airflow)* and
"interval-of-interest" *(currently represented by execution_date)*.

I also see in this the potential to unlock some other improvements:
* support a greater diversity of incremental processes
* allow more flexible backfilling
* provide better views of data you have vs data you don't.

The canonical airflow job is date-partitioned idempotent data pull.  Your
interval of interest is from execution_date to execution_date + 1
interval.  Schedule_interval is not just the scheduling cadence but it is
also your interval-of-interest partition function.   If that doesn't work
for your job, you set catchup=False and roll your own.

What if there was a way to generalize?  E.g. could we allow for more
flexible partition function that deviated from scheduler cadence?  E.g.
what if your interval-of-interest partitions could be governed by "min 1
day, max 30 days".  Then on on-going basis, your daily loads would be a
range of 1 day but then if server down for couple days, this could be
caught up in one task and if you backfill it could be up to 30-day batches.

Perhaps there is an abstraction that could be used by a greater diversity
of incremental processes.  Such a thing could support a nice "data
contiguity view". I imagine a horizontal bar that is solid where we have
the data and empty where we don't.  Then you click on a "missing" section
and you can  trigger a backfill task with that date interval according to
your partitioning rules.

I can imagine using this for an incremental job where each time we pull the
new data since last time; in the `execute` method the operator could set
`self.high_watermark` with the max datetime processed.  Or maybe a callback
function could be used to gather this value.  This value could be used in
next run, and cold be depicted in a view.

Default intervals of interest could be status quo -- i.e. partitions equal
to schedule interval -- but could be overwritten using templating or
callbacks or setting it during `execute`.

So anyway, I don't have a master plan all figured out.  But I think there
is opportunity in this area for more comprehensive enhancement that goes
more directly at the root of the problem.




On Tue, Aug 27, 2019 at 10:00 AM Maxime Beauchemin <
[email protected]> wrote:

> How about an alternative approach that would introduce 2 new keyword
> arguments that are clear (something like, but maybe better than
> `period_start_dttm`, `period_end_dttm`) and leave `execution_date`
> unchanged, but plan it's deprecation. As a first step `execution_date`
> would be inferred from the new args, and warn about deprecation when used.
>
> Max
>
> On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <[email protected]> wrote:
>
> > Execution date is execution date for a dag run no matter what. There is
> no
> > end interval or start interval for a dag run. The only time this is
> > relevant is when we calculate the next or previous dagrun.
> >
> > So I don't Daniels rationale makes sense (?)
> >
> > Sent from my iPhone
> >
> > > On 27 Aug 2019, at 17:40, Philippe Gagnon <[email protected]>
> wrote:
> > >
> > > I agree with Daniel's rationale but I am also worried about backwards
> > > compatibility as this would perhaps be the most disruptive breaking
> > change
> > > possible. I think maybe we should write down the different options
> > > available to us (AIP?) and call for a vote. What does everyone think?
> > >
> > >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <[email protected]>
> wrote:
> > >>
> > >> Can't execution date can already mean different things depending on if
> > the
> > >> dag run was initiated via the scheduler or manually via command
> > line/API?
> > >> I agree that making it consistent might make it easier to explain to
> new
> > >> users, but should we exchange that for breaking pretty much every
> > existing
> > >> dag by re-defining what execution date is?
> > >> -James
> > >>
> > >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <
> [email protected]>
> > >> wrote:
> > >>
> > >>>>
> > >>>> To Daniel’s concerns, I would argue this is not a change to what a
> dag
> > >>> run
> > >>>> is, it is rather a change to WHEN that dag run will be scheduled.
> > >>>
> > >>>
> > >>> Execution date is part of the definition of a dag_run; it is uniquely
> > >>> identified by an execution_date and dag_id.
> > >>>
> > >>> When someone asks what is a dag_run, we should be able to provide an
> > >>> answer.
> > >>>
> > >>> Imagine trying to explain what a dag run is, when execution_date can
> > mean
> > >>> different things.
> > >>>    Admin: "A dag run is an execution_date and a dag_id".
> > >>>    New user: "Ok. Clear as a bell. What's an execution_date?"
> > >>>    Admin: "Well, it can be one of two things.  It *could* be when the
> > >> dag
> > >>> will be run... but it could *also* be 'the time when dag should be
> run
> > >>> minus one schedule interval".  It depends on whether you choose 'end'
> > or
> > >>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> > >>> execution_date means 'when dag will be run'.  If you choose 'end'
> then
> > >>> execution_date means 'when dag will be run minus one interval.'  If
> you
> > >>> change the parameter after some time, then we don't necessarily know
> > what
> > >>> it means at all times".
> > >>>
> > >>> Why would we do this to ourselves?
> > >>>
> > >>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> > >>> * dag_run is dag_id + execution_date
> > >>> * execution_date is when dag will be run (notwithstanding scheduler
> > >> delay,
> > >>> queuing)
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Execution_date is defined as "run-at date minus 1 interval".  The
> > >>> assumption in this is that you tasks care about this particular date.
> > >>> Obviously this makes sense for some tasks but not for others.
> > >>>
> > >>> I would prop
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <[email protected]>
> > wrote:
> > >>>>
> > >>>> I think this is a great improvement and should be merged. To
> Daniel’s
> > >>>> concerns, I would argue this is not a change to what a dag run is,
> it
> > >> is
> > >>>> rather a change to WHEN that dag run will be scheduled.
> > >>>> I had implemented a similar change in my own version but ultimately
> > >>> backed
> > >>>> so I didn’t have to patch after each new release. In my opinion the
> > >> main
> > >>>> flaw in the current scheduler, and I have brought this up before, is
> > >> when
> > >>>> you don’t have a consistent schedule interval (e.g. only run M-F).
> > >> After
> > >>>> backing out the “schedule at interval start” I had to switch to a
> > daily
> > >>>> schedule and go through and put a short circuit operator in each of
> my
> > >>> M-F
> > >>>> dags to get the behavior that I wanted. This results in putting
> > >>> scheduling
> > >>>> logic inside the dag, when scheduling logic should be in the
> > scheduler.
> > >>>>
> > >>>> -James
> > >>>>
> > >>>>
> > >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <[email protected]
> >
> > >>>> wrote:
> > >>>>>
> > >>>>> Re
> > >>>>>
> > >>>>>> What are people's feelings on changing the default execution to
> > >>> schedule
> > >>>>>> interval start
> > >>>>>
> > >>>>> and
> > >>>>>
> > >>>>>> I'm in favor of doing that, but then exposing new variables of
> > >>>>>> "interval_start" and "interval_end", etc. so that people write
> > >>>>>> clearer-looking at-a-glance DAGs
> > >>>>>
> > >>>>>
> > >>>>> While I am def on board with the spirit of this PR, I would vote we
> > >> do
> > >>>> not
> > >>>>> accept this PR as is, because it cements a confusing option.
> > >>>>>
> > >>>>> *What is the right representation of a dag run?*
> > >>>>>
> > >>>>> Right now the representation is "dag run-at date minus 1 interval".
> > >> It
> > >>>>> should just be "dag run-at date".
> > >>>>>
> > >>>>> We don't need to address the question of whether execution date is
> > >> the
> > >>>>> start or the end of an interval; it doesn't matter.
> > >>>>>
> > >>>>> In all cases, a given dag run will be targeted for *some* initial
> > >>> "run-at
> > >>>>> time"; so *that* should be the time that is part of the PK of a dag
> > >>> run,
> > >>>>> and *that *is the time that should be exposed as the dag run
> > >> "execution
> > >>>>> date"
> > >>>>>
> > >>>>> *Interval of interest is not a dag_run attribute*
> > >>>>>
> > >>>>> We also mix in this question of the date interval that the *tasks*
> > >> are
> > >>>>> interested in.  But the *dag run* need not concern itself with this
> > >> in
> > >>>> any
> > >>>>> way.  That is for the tasks to figure out: if they happen to need
> > >> "dag
> > >>>>> run-at date," then they can reference that; if they want the prior
> > >> one,
> > >>>> ask
> > >>>>> for the prior one.
> > >>>>>
> > >>>>> Previously, I was in the camp that thought it was a great idea to
> > >>> rename
> > >>>>> "execution_date" to "period_start" or "interval_start".  But I now
> > >>> think
> > >>>>> this is folly.  It invokes this question of the "interval of
> > >> interest"
> > >>> or
> > >>>>> "period of interest".  But the dag doesn't need to know anything
> > >> about
> > >>>>> that.
> > >>>>>
> > >>>>> Within the same dag you may have tasks with different intervals of
> > >>>>> interest.  So why make assumptions in the dag; just give the facts:
> > >>> this
> > >>>> is
> > >>>>> my run date; this is the prior run date, etc.  It would be a
> > >> regression
> > >>>>> from the perspective of providing accurate names.
> > >>>>>
> > >>>>> *Proposal*
> > >>>>>
> > >>>>> So, I would propose we change "execution_date" to mean "dag run-at
> > >>> date"
> > >>>> as
> > >>>>> opposed to "dag run-at date minus 1".  But we should do so without
> > >>>>> reference to interval end or interval start.
> > >>>>>
> > >>>>> *Configurability*
> > >>>>>
> > >>>>> The more configuration options we have, the more noise there is as
> a
> > >>> user
> > >>>>> trying to understand how to use airflow, so I'd rather us not make
> > >> this
> > >>>>> configurable at all.
> > >>>>>
> > >>>>> That said, perhaps a more clear and more explicit means making this
> > >>>>> configurable would be to define an integer param
> > >>>>> "dag_run_execution_date_interval_offset", which would control how
> > >> many
> > >>>>> intervals back from actual "dag run-at date" the "execution date"
> > >>> should
> > >>>>> be.  (current behavior = 1, new behavior = 0).
> > >>>>>
> > >>>>> *Side note*
> > >>>>>
> > >>>>> Hopefully not to derail discussion: I think there are additional,
> > >>> related
> > >>>>> task attributes that may want to come into being: namely,
> > >> low_watermark
> > >>>> and
> > >>>>> high_watermark.  There is the potential, with attributes like this,
> > >> for
> > >>>>> adding better out-of-the-box support for common data workflows that
> > >> we
> > >>>> now
> > >>>>> need to use xcom for, namely incremental loads.  But I want to give
> > >> it
> > >>>> more
> > >>>>> thought before proposing anything specific.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> > >> [email protected]
> > >>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Good one Damian. I will have a list of issues that can be possible
> > >> to
> > >>>>>> handle at the workshop, so that one goes there.
> > >>>>>>
> > >>>>>> J.
> > >>>>>>
> > >>>>>> Principal Software Engineer
> > >>>>>> Phone: +48660796129
> > >>>>>>
> > >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> > >>>>>> [email protected]> napisał:
> > >>>>>>
> > >>>>>>> I can't understate what a conceptual improvement this would be
> for
> > >>> the
> > >>>>>> end
> > >>>>>>> users of Airflow in our environment. I've written a lot of code
> so
> > >>> all
> > >>>>>> our
> > >>>>>>> configuration works like this anyway. But the UI still shows the
> > >>>> Airflow
> > >>>>>>> dates which still to this day sometimes confuse me.
> > >>>>>>>
> > >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of
> my
> > >>>> first
> > >>>>>>> PRs could be additional test cases around edge cases to do with
> DST
> > >>> and
> > >>>>>>> cron scheduling that I have concerns about :)
> > >>>>>>>
> > >>>>>>> Damian
> > >>>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Ash Berlin-Taylor [mailto:[email protected]]
> > >>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> > >>>>>>> To: [email protected]
> > >>>>>>> Subject: Setting to add choice of schedule at end or schedule at
> > >>> start
> > >>>> of
> > >>>>>>> interval
> > >>>>>>>
> > >>>>>>> This has come up a few times before, someone has now opened a PR
> > >> that
> > >>>>>>> makes this a global+per-dag setting:
> > >>>>>>> https://github.com/apache/airflow/pull/5787 and it also includes
> > >>> docs
> > >>>>>>> that I think does a good job of illustrating the two modes.
> > >>>>>>>
> > >>>>>>> Does anyone object to this being merged? If no one says anything
> by
> > >>>>>> midday
> > >>>>>>> on Tuesday I will take that as assent and will merge it.
> > >>>>>>>
> > >>>>>>> The docs from the PR included below.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Ash
> > >>>>>>>
> > >>>>>>> Scheduled Time vs Execution Time
> > >>>>>>> ''''''''''''''''''''''''''''''''
> > >>>>>>>
> > >>>>>>> A DAG with a ``schedule_interval`` will execute once per
> interval.
> > >> By
> > >>>>>>> default, the execution of a DAG will occur at the **end** of the
> > >>>>>>> schedule interval.
> > >>>>>>>
> > >>>>>>> A few examples:
> > >>>>>>>
> > >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> > >>>> processes
> > >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16
> 17:59:59,
> > >>>>>>> i.e. once that hour is over.
> > >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> > >>> processes
> > >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > >>>>>>>
> > >>>>>>> The reasoning behind this execution vs scheduling behaviour is
> that
> > >>>>>>> data for the interval to be processed won't be fully available
> > >> until
> > >>>>>>> the interval has elapsed.
> > >>>>>>>
> > >>>>>>> In cases where you wish the DAG to be executed at the **start**
> of
> > >>> the
> > >>>>>>> interval, specify ``schedule_at_interval_end=False``, either in
> > >>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>
> > >>>>>>> Please access the attached hyperlink for an important electronic
> > >>>>>>> communications disclaimer:
> > >>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: Setting to add choice of schedule at end or schedule at start of interval

Reply via email to