How about an alternative approach that would introduce 2 new keyword
arguments that are clear (something like, but maybe better than
`period_start_dttm`, `period_end_dttm`) and leave `execution_date`
unchanged, but plan it's deprecation. As a first step `execution_date`
would be inferred from the new args, and warn about deprecation when used.

Max

On Tue, Aug 27, 2019 at 9:26 AM Bolke de Bruin <bdbr...@gmail.com> wrote:

> Execution date is execution date for a dag run no matter what. There is no
> end interval or start interval for a dag run. The only time this is
> relevant is when we calculate the next or previous dagrun.
>
> So I don't Daniels rationale makes sense (?)
>
> Sent from my iPhone
>
> > On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagn...@gmail.com> wrote:
> >
> > I agree with Daniel's rationale but I am also worried about backwards
> > compatibility as this would perhaps be the most disruptive breaking
> change
> > possible. I think maybe we should write down the different options
> > available to us (AIP?) and call for a vote. What does everyone think?
> >
> >> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote:
> >>
> >> Can't execution date can already mean different things depending on if
> the
> >> dag run was initiated via the scheduler or manually via command
> line/API?
> >> I agree that making it consistent might make it easier to explain to new
> >> users, but should we exchange that for breaking pretty much every
> existing
> >> dag by re-defining what execution date is?
> >> -James
> >>
> >> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com>
> >> wrote:
> >>
> >>>>
> >>>> To Daniel’s concerns, I would argue this is not a change to what a dag
> >>> run
> >>>> is, it is rather a change to WHEN that dag run will be scheduled.
> >>>
> >>>
> >>> Execution date is part of the definition of a dag_run; it is uniquely
> >>> identified by an execution_date and dag_id.
> >>>
> >>> When someone asks what is a dag_run, we should be able to provide an
> >>> answer.
> >>>
> >>> Imagine trying to explain what a dag run is, when execution_date can
> mean
> >>> different things.
> >>>    Admin: "A dag run is an execution_date and a dag_id".
> >>>    New user: "Ok. Clear as a bell. What's an execution_date?"
> >>>    Admin: "Well, it can be one of two things.  It *could* be when the
> >> dag
> >>> will be run... but it could *also* be 'the time when dag should be run
> >>> minus one schedule interval".  It depends on whether you choose 'end'
> or
> >>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> >>> execution_date means 'when dag will be run'.  If you choose 'end' then
> >>> execution_date means 'when dag will be run minus one interval.'  If you
> >>> change the parameter after some time, then we don't necessarily know
> what
> >>> it means at all times".
> >>>
> >>> Why would we do this to ourselves?
> >>>
> >>> Alternatively, we can give dag_run a clear, unambiguous meaning:
> >>> * dag_run is dag_id + execution_date
> >>> * execution_date is when dag will be run (notwithstanding scheduler
> >> delay,
> >>> queuing)
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Execution_date is defined as "run-at date minus 1 interval".  The
> >>> assumption in this is that you tasks care about this particular date.
> >>> Obviously this makes sense for some tasks but not for others.
> >>>
> >>> I would prop
> >>>
> >>>
> >>>
> >>>
> >>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com>
> wrote:
> >>>>
> >>>> I think this is a great improvement and should be merged. To Daniel’s
> >>>> concerns, I would argue this is not a change to what a dag run is, it
> >> is
> >>>> rather a change to WHEN that dag run will be scheduled.
> >>>> I had implemented a similar change in my own version but ultimately
> >>> backed
> >>>> so I didn’t have to patch after each new release. In my opinion the
> >> main
> >>>> flaw in the current scheduler, and I have brought this up before, is
> >> when
> >>>> you don’t have a consistent schedule interval (e.g. only run M-F).
> >> After
> >>>> backing out the “schedule at interval start” I had to switch to a
> daily
> >>>> schedule and go through and put a short circuit operator in each of my
> >>> M-F
> >>>> dags to get the behavior that I wanted. This results in putting
> >>> scheduling
> >>>> logic inside the dag, when scheduling logic should be in the
> scheduler.
> >>>>
> >>>> -James
> >>>>
> >>>>
> >>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Re
> >>>>>
> >>>>>> What are people's feelings on changing the default execution to
> >>> schedule
> >>>>>> interval start
> >>>>>
> >>>>> and
> >>>>>
> >>>>>> I'm in favor of doing that, but then exposing new variables of
> >>>>>> "interval_start" and "interval_end", etc. so that people write
> >>>>>> clearer-looking at-a-glance DAGs
> >>>>>
> >>>>>
> >>>>> While I am def on board with the spirit of this PR, I would vote we
> >> do
> >>>> not
> >>>>> accept this PR as is, because it cements a confusing option.
> >>>>>
> >>>>> *What is the right representation of a dag run?*
> >>>>>
> >>>>> Right now the representation is "dag run-at date minus 1 interval".
> >> It
> >>>>> should just be "dag run-at date".
> >>>>>
> >>>>> We don't need to address the question of whether execution date is
> >> the
> >>>>> start or the end of an interval; it doesn't matter.
> >>>>>
> >>>>> In all cases, a given dag run will be targeted for *some* initial
> >>> "run-at
> >>>>> time"; so *that* should be the time that is part of the PK of a dag
> >>> run,
> >>>>> and *that *is the time that should be exposed as the dag run
> >> "execution
> >>>>> date"
> >>>>>
> >>>>> *Interval of interest is not a dag_run attribute*
> >>>>>
> >>>>> We also mix in this question of the date interval that the *tasks*
> >> are
> >>>>> interested in.  But the *dag run* need not concern itself with this
> >> in
> >>>> any
> >>>>> way.  That is for the tasks to figure out: if they happen to need
> >> "dag
> >>>>> run-at date," then they can reference that; if they want the prior
> >> one,
> >>>> ask
> >>>>> for the prior one.
> >>>>>
> >>>>> Previously, I was in the camp that thought it was a great idea to
> >>> rename
> >>>>> "execution_date" to "period_start" or "interval_start".  But I now
> >>> think
> >>>>> this is folly.  It invokes this question of the "interval of
> >> interest"
> >>> or
> >>>>> "period of interest".  But the dag doesn't need to know anything
> >> about
> >>>>> that.
> >>>>>
> >>>>> Within the same dag you may have tasks with different intervals of
> >>>>> interest.  So why make assumptions in the dag; just give the facts:
> >>> this
> >>>> is
> >>>>> my run date; this is the prior run date, etc.  It would be a
> >> regression
> >>>>> from the perspective of providing accurate names.
> >>>>>
> >>>>> *Proposal*
> >>>>>
> >>>>> So, I would propose we change "execution_date" to mean "dag run-at
> >>> date"
> >>>> as
> >>>>> opposed to "dag run-at date minus 1".  But we should do so without
> >>>>> reference to interval end or interval start.
> >>>>>
> >>>>> *Configurability*
> >>>>>
> >>>>> The more configuration options we have, the more noise there is as a
> >>> user
> >>>>> trying to understand how to use airflow, so I'd rather us not make
> >> this
> >>>>> configurable at all.
> >>>>>
> >>>>> That said, perhaps a more clear and more explicit means making this
> >>>>> configurable would be to define an integer param
> >>>>> "dag_run_execution_date_interval_offset", which would control how
> >> many
> >>>>> intervals back from actual "dag run-at date" the "execution date"
> >>> should
> >>>>> be.  (current behavior = 1, new behavior = 0).
> >>>>>
> >>>>> *Side note*
> >>>>>
> >>>>> Hopefully not to derail discussion: I think there are additional,
> >>> related
> >>>>> task attributes that may want to come into being: namely,
> >> low_watermark
> >>>> and
> >>>>> high_watermark.  There is the potential, with attributes like this,
> >> for
> >>>>> adding better out-of-the-box support for common data workflows that
> >> we
> >>>> now
> >>>>> need to use xcom for, namely incremental loads.  But I want to give
> >> it
> >>>> more
> >>>>> thought before proposing anything specific.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> >> jarek.pot...@polidea.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Good one Damian. I will have a list of issues that can be possible
> >> to
> >>>>>> handle at the workshop, so that one goes there.
> >>>>>>
> >>>>>> J.
> >>>>>>
> >>>>>> Principal Software Engineer
> >>>>>> Phone: +48660796129
> >>>>>>
> >>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> >>>>>> damian.sha...@credit-suisse.com> napisał:
> >>>>>>
> >>>>>>> I can't understate what a conceptual improvement this would be for
> >>> the
> >>>>>> end
> >>>>>>> users of Airflow in our environment. I've written a lot of code so
> >>> all
> >>>>>> our
> >>>>>>> configuration works like this anyway. But the UI still shows the
> >>>> Airflow
> >>>>>>> dates which still to this day sometimes confuse me.
> >>>>>>>
> >>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my
> >>>> first
> >>>>>>> PRs could be additional test cases around edge cases to do with DST
> >>> and
> >>>>>>> cron scheduling that I have concerns about :)
> >>>>>>>
> >>>>>>> Damian
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Ash Berlin-Taylor [mailto:a...@apache.org]
> >>>>>>> Sent: Friday, August 23, 2019 6:50 AM
> >>>>>>> To: dev@airflow.apache.org
> >>>>>>> Subject: Setting to add choice of schedule at end or schedule at
> >>> start
> >>>> of
> >>>>>>> interval
> >>>>>>>
> >>>>>>> This has come up a few times before, someone has now opened a PR
> >> that
> >>>>>>> makes this a global+per-dag setting:
> >>>>>>> https://github.com/apache/airflow/pull/5787 and it also includes
> >>> docs
> >>>>>>> that I think does a good job of illustrating the two modes.
> >>>>>>>
> >>>>>>> Does anyone object to this being merged? If no one says anything by
> >>>>>> midday
> >>>>>>> on Tuesday I will take that as assent and will merge it.
> >>>>>>>
> >>>>>>> The docs from the PR included below.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ash
> >>>>>>>
> >>>>>>> Scheduled Time vs Execution Time
> >>>>>>> ''''''''''''''''''''''''''''''''
> >>>>>>>
> >>>>>>> A DAG with a ``schedule_interval`` will execute once per interval.
> >> By
> >>>>>>> default, the execution of a DAG will occur at the **end** of the
> >>>>>>> schedule interval.
> >>>>>>>
> >>>>>>> A few examples:
> >>>>>>>
> >>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> >>>> processes
> >>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> >>>>>>> i.e. once that hour is over.
> >>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> >>> processes
> >>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> >>>>>>>
> >>>>>>> The reasoning behind this execution vs scheduling behaviour is that
> >>>>>>> data for the interval to be processed won't be fully available
> >> until
> >>>>>>> the interval has elapsed.
> >>>>>>>
> >>>>>>> In cases where you wish the DAG to be executed at the **start** of
> >>> the
> >>>>>>> interval, specify ``schedule_at_interval_end=False``, either in
> >>>>>>> ``airflow.cfg``, or on a per-DAG basis.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> ===============================================================================
> >>>>>>>
> >>>>>>> Please access the attached hyperlink for an important electronic
> >>>>>>> communications disclaimer:
> >>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> ===============================================================================
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
>

Reply via email to