I agree with Daniel's rationale but I am also worried about backwards
compatibility as this would perhaps be the most disruptive breaking change
possible. I think maybe we should write down the different options
available to us (AIP?) and call for a vote. What does everyone think?

On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote:

> Can't execution date can already mean different things depending on if the
> dag run was initiated via the scheduler or manually via command line/API?
>  I agree that making it consistent might make it easier to explain to new
> users, but should we exchange that for breaking pretty much every existing
> dag by re-defining what execution date is?
> -James
>
> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com>
> wrote:
>
> > >
> > > To Daniel’s concerns, I would argue this is not a change to what a dag
> > run
> > > is, it is rather a change to WHEN that dag run will be scheduled.
> >
> >
> > Execution date is part of the definition of a dag_run; it is uniquely
> > identified by an execution_date and dag_id.
> >
> > When someone asks what is a dag_run, we should be able to provide an
> > answer.
> >
> > Imagine trying to explain what a dag run is, when execution_date can mean
> > different things.
> >     Admin: "A dag run is an execution_date and a dag_id".
> >     New user: "Ok. Clear as a bell. What's an execution_date?"
> >     Admin: "Well, it can be one of two things.  It *could* be when the
> dag
> > will be run... but it could *also* be 'the time when dag should be run
> > minus one schedule interval".  It depends on whether you choose 'end' or
> > 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> > execution_date means 'when dag will be run'.  If you choose 'end' then
> > execution_date means 'when dag will be run minus one interval.'  If you
> > change the parameter after some time, then we don't necessarily know what
> > it means at all times".
> >
> > Why would we do this to ourselves?
> >
> > Alternatively, we can give dag_run a clear, unambiguous meaning:
> > * dag_run is dag_id + execution_date
> > * execution_date is when dag will be run (notwithstanding scheduler
> delay,
> > queuing)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Execution_date is defined as "run-at date minus 1 interval".  The
> > assumption in this is that you tasks care about this particular date.
> > Obviously this makes sense for some tasks but not for others.
> >
> > I would prop
> >
> >
> >
> >
> > On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote:
> >
> > > I think this is a great improvement and should be merged. To Daniel’s
> > > concerns, I would argue this is not a change to what a dag run is, it
> is
> > > rather a change to WHEN that dag run will be scheduled.
> > > I had implemented a similar change in my own version but ultimately
> > backed
> > > so I didn’t have to patch after each new release. In my opinion the
> main
> > > flaw in the current scheduler, and I have brought this up before, is
> when
> > > you don’t have a consistent schedule interval (e.g. only run M-F).
> After
> > > backing out the “schedule at interval start” I had to switch to a daily
> > > schedule and go through and put a short circuit operator in each of my
> > M-F
> > > dags to get the behavior that I wanted. This results in putting
> > scheduling
> > > logic inside the dag, when scheduling logic should be in the scheduler.
> > >
> > > -James
> > >
> > >
> > > > On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com>
> > > wrote:
> > > >
> > > > Re
> > > >
> > > >> What are people's feelings on changing the default execution to
> > schedule
> > > >> interval start
> > > >
> > > > and
> > > >
> > > >> I'm in favor of doing that, but then exposing new variables of
> > > >> "interval_start" and "interval_end", etc. so that people write
> > > >> clearer-looking at-a-glance DAGs
> > > >
> > > >
> > > > While I am def on board with the spirit of this PR, I would vote we
> do
> > > not
> > > > accept this PR as is, because it cements a confusing option.
> > > >
> > > > *What is the right representation of a dag run?*
> > > >
> > > > Right now the representation is "dag run-at date minus 1 interval".
> It
> > > > should just be "dag run-at date".
> > > >
> > > > We don't need to address the question of whether execution date is
> the
> > > > start or the end of an interval; it doesn't matter.
> > > >
> > > > In all cases, a given dag run will be targeted for *some* initial
> > "run-at
> > > > time"; so *that* should be the time that is part of the PK of a dag
> > run,
> > > > and *that *is the time that should be exposed as the dag run
> "execution
> > > > date"
> > > >
> > > > *Interval of interest is not a dag_run attribute*
> > > >
> > > > We also mix in this question of the date interval that the *tasks*
> are
> > > > interested in.  But the *dag run* need not concern itself with this
> in
> > > any
> > > > way.  That is for the tasks to figure out: if they happen to need
> "dag
> > > > run-at date," then they can reference that; if they want the prior
> one,
> > > ask
> > > > for the prior one.
> > > >
> > > > Previously, I was in the camp that thought it was a great idea to
> > rename
> > > > "execution_date" to "period_start" or "interval_start".  But I now
> > think
> > > > this is folly.  It invokes this question of the "interval of
> interest"
> > or
> > > > "period of interest".  But the dag doesn't need to know anything
> about
> > > > that.
> > > >
> > > > Within the same dag you may have tasks with different intervals of
> > > > interest.  So why make assumptions in the dag; just give the facts:
> > this
> > > is
> > > > my run date; this is the prior run date, etc.  It would be a
> regression
> > > > from the perspective of providing accurate names.
> > > >
> > > > *Proposal*
> > > >
> > > > So, I would propose we change "execution_date" to mean "dag run-at
> > date"
> > > as
> > > > opposed to "dag run-at date minus 1".  But we should do so without
> > > > reference to interval end or interval start.
> > > >
> > > > *Configurability*
> > > >
> > > > The more configuration options we have, the more noise there is as a
> > user
> > > > trying to understand how to use airflow, so I'd rather us not make
> this
> > > > configurable at all.
> > > >
> > > > That said, perhaps a more clear and more explicit means making this
> > > > configurable would be to define an integer param
> > > > "dag_run_execution_date_interval_offset", which would control how
> many
> > > > intervals back from actual "dag run-at date" the "execution date"
> > should
> > > > be.  (current behavior = 1, new behavior = 0).
> > > >
> > > > *Side note*
> > > >
> > > > Hopefully not to derail discussion: I think there are additional,
> > related
> > > > task attributes that may want to come into being: namely,
> low_watermark
> > > and
> > > > high_watermark.  There is the potential, with attributes like this,
> for
> > > > adding better out-of-the-box support for common data workflows that
> we
> > > now
> > > > need to use xcom for, namely incremental loads.  But I want to give
> it
> > > more
> > > > thought before proposing anything specific.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
> jarek.pot...@polidea.com
> > >
> > > > wrote:
> > > >
> > > >> Good one Damian. I will have a list of issues that can be possible
> to
> > > >> handle at the workshop, so that one goes there.
> > > >>
> > > >> J.
> > > >>
> > > >> Principal Software Engineer
> > > >> Phone: +48660796129
> > > >>
> > > >> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> > > >> damian.sha...@credit-suisse.com> napisał:
> > > >>
> > > >>> I can't understate what a conceptual improvement this would be for
> > the
> > > >> end
> > > >>> users of Airflow in our environment. I've written a lot of code so
> > all
> > > >> our
> > > >>> configuration works like this anyway. But the UI still shows the
> > > Airflow
> > > >>> dates which still to this day sometimes confuse me.
> > > >>>
> > > >>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my
> > > first
> > > >>> PRs could be additional test cases around edge cases to do with DST
> > and
> > > >>> cron scheduling that I have concerns about :)
> > > >>>
> > > >>> Damian
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Ash Berlin-Taylor [mailto:a...@apache.org]
> > > >>> Sent: Friday, August 23, 2019 6:50 AM
> > > >>> To: dev@airflow.apache.org
> > > >>> Subject: Setting to add choice of schedule at end or schedule at
> > start
> > > of
> > > >>> interval
> > > >>>
> > > >>> This has come up a few times before, someone has now opened a PR
> that
> > > >>> makes this a global+per-dag setting:
> > > >>> https://github.com/apache/airflow/pull/5787 and it also includes
> > docs
> > > >>> that I think does a good job of illustrating the two modes.
> > > >>>
> > > >>> Does anyone object to this being merged? If no one says anything by
> > > >> midday
> > > >>> on Tuesday I will take that as assent and will merge it.
> > > >>>
> > > >>> The docs from the PR included below.
> > > >>>
> > > >>> Thanks,
> > > >>> Ash
> > > >>>
> > > >>> Scheduled Time vs Execution Time
> > > >>> ''''''''''''''''''''''''''''''''
> > > >>>
> > > >>> A DAG with a ``schedule_interval`` will execute once per interval.
> By
> > > >>> default, the execution of a DAG will occur at the **end** of the
> > > >>> schedule interval.
> > > >>>
> > > >>> A few examples:
> > > >>>
> > > >>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> > > processes
> > > >>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> > > >>> i.e. once that hour is over.
> > > >>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> > processes
> > > >>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > > >>>
> > > >>> The reasoning behind this execution vs scheduling behaviour is that
> > > >>> data for the interval to be processed won't be fully available
> until
> > > >>> the interval has elapsed.
> > > >>>
> > > >>> In cases where you wish the DAG to be executed at the **start** of
> > the
> > > >>> interval, specify ``schedule_at_interval_end=False``, either in
> > > >>> ``airflow.cfg``, or on a per-DAG basis.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> >
> ===============================================================================
> > > >>>
> > > >>> Please access the attached hyperlink for an important electronic
> > > >>> communications disclaimer:
> > > >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > > >>>
> > > >>
> > >
> >
> ===============================================================================
> > > >>>
> > > >>>
> > > >>
> > >
> >
>

Reply via email to