Re

> What are people's feelings on changing the default execution to schedule
> interval start

 and

> I'm in favor of doing that, but then exposing new variables of
> "interval_start" and "interval_end", etc. so that people write
> clearer-looking at-a-glance DAGs


While I am def on board with the spirit of this PR, I would vote we do not
accept this PR as is, because it cements a confusing option.

*What is the right representation of a dag run?*

Right now the representation is "dag run-at date minus 1 interval".  It
should just be "dag run-at date".

We don't need to address the question of whether execution date is the
start or the end of an interval; it doesn't matter.

In all cases, a given dag run will be targeted for *some* initial "run-at
time"; so *that* should be the time that is part of the PK of a dag run,
and *that *is the time that should be exposed as the dag run "execution
date"

*Interval of interest is not a dag_run attribute*

We also mix in this question of the date interval that the *tasks* are
interested in.  But the *dag run* need not concern itself with this in any
way.  That is for the tasks to figure out: if they happen to need "dag
run-at date," then they can reference that; if they want the prior one, ask
for the prior one.

Previously, I was in the camp that thought it was a great idea to rename
"execution_date" to "period_start" or "interval_start".  But I now think
this is folly.  It invokes this question of the "interval of interest" or
"period of interest".  But the dag doesn't need to know anything about
that.

Within the same dag you may have tasks with different intervals of
interest.  So why make assumptions in the dag; just give the facts: this is
my run date; this is the prior run date, etc.  It would be a regression
from the perspective of providing accurate names.

*Proposal*

So, I would propose we change "execution_date" to mean "dag run-at date" as
opposed to "dag run-at date minus 1".  But we should do so without
reference to interval end or interval start.

*Configurability*

The more configuration options we have, the more noise there is as a user
trying to understand how to use airflow, so I'd rather us not make this
configurable at all.

That said, perhaps a more clear and more explicit means making this
configurable would be to define an integer param
"dag_run_execution_date_interval_offset", which would control how many
intervals back from actual "dag run-at date" the "execution date" should
be.  (current behavior = 1, new behavior = 0).

*Side note*

Hopefully not to derail discussion: I think there are additional, related
task attributes that may want to come into being: namely, low_watermark and
high_watermark.  There is the potential, with attributes like this, for
adding better out-of-the-box support for common data workflows that we now
need to use xcom for, namely incremental loads.  But I want to give it more
thought before proposing anything specific.






On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> Good one Damian. I will have a list of issues that can be possible to
> handle at the workshop, so that one goes there.
>
> J.
>
> Principal Software Engineer
> Phone: +48660796129
>
> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> damian.sha...@credit-suisse.com> napisał:
>
> > I can't understate what a conceptual improvement this would be for the
> end
> > users of Airflow in our environment. I've written a lot of code so all
> our
> > configuration works like this anyway. But the UI still shows the Airflow
> > dates which still to this day sometimes confuse me.
> >
> > I'll be at the NY meet ups on Monday and Tuesday, maybe some of my first
> > PRs could be additional test cases around edge cases to do with DST and
> > cron scheduling that I have concerns about :)
> >
> > Damian
> >
> > -----Original Message-----
> > From: Ash Berlin-Taylor [mailto:a...@apache.org]
> > Sent: Friday, August 23, 2019 6:50 AM
> > To: dev@airflow.apache.org
> > Subject: Setting to add choice of schedule at end or schedule at start of
> > interval
> >
> > This has come up a few times before, someone has now opened a PR that
> > makes this a global+per-dag setting:
> > https://github.com/apache/airflow/pull/5787 and it also includes docs
> > that I think does a good job of illustrating the two modes.
> >
> > Does anyone object to this being merged? If no one says anything by
> midday
> > on Tuesday I will take that as assent and will merge it.
> >
> > The docs from the PR included below.
> >
> > Thanks,
> > Ash
> >
> > Scheduled Time vs Execution Time
> > ''''''''''''''''''''''''''''''''
> >
> > A DAG with a ``schedule_interval`` will execute once per interval. By
> > default, the execution of a DAG will occur at the **end** of the
> > schedule interval.
> >
> > A few examples:
> >
> > - A DAG with ``schedule_interval='@hourly'``: The DAG run that processes
> > 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> > i.e. once that hour is over.
> > - A DAG with ``schedule_interval='@daily'``: The DAG run that processes
> > 2019-08-16 will start running shortly after 2019-08-17 00:00.
> >
> > The reasoning behind this execution vs scheduling behaviour is that
> > data for the interval to be processed won't be fully available until
> > the interval has elapsed.
> >
> > In cases where you wish the DAG to be executed at the **start** of the
> > interval, specify ``schedule_at_interval_end=False``, either in
> > ``airflow.cfg``, or on a per-DAG basis.
> >
> >
> >
> >
> ===============================================================================
> >
> > Please access the attached hyperlink for an important electronic
> > communications disclaimer:
> > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> >
> ===============================================================================
> >
> >
>

Reply via email to