Can't execution date can already mean different things depending on if the
dag run was initiated via the scheduler or manually via command line/API?
 I agree that making it consistent might make it easier to explain to new
users, but should we exchange that for breaking pretty much every existing
dag by re-defining what execution date is?
-James

On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com>
wrote:

> >
> > To Daniel’s concerns, I would argue this is not a change to what a dag
> run
> > is, it is rather a change to WHEN that dag run will be scheduled.
>
>
> Execution date is part of the definition of a dag_run; it is uniquely
> identified by an execution_date and dag_id.
>
> When someone asks what is a dag_run, we should be able to provide an
> answer.
>
> Imagine trying to explain what a dag run is, when execution_date can mean
> different things.
>     Admin: "A dag run is an execution_date and a dag_id".
>     New user: "Ok. Clear as a bell. What's an execution_date?"
>     Admin: "Well, it can be one of two things.  It *could* be when the dag
> will be run... but it could *also* be 'the time when dag should be run
> minus one schedule interval".  It depends on whether you choose 'end' or
> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
> execution_date means 'when dag will be run'.  If you choose 'end' then
> execution_date means 'when dag will be run minus one interval.'  If you
> change the parameter after some time, then we don't necessarily know what
> it means at all times".
>
> Why would we do this to ourselves?
>
> Alternatively, we can give dag_run a clear, unambiguous meaning:
> * dag_run is dag_id + execution_date
> * execution_date is when dag will be run (notwithstanding scheduler delay,
> queuing)
>
>
>
>
>
>
>
>
>
>
> Execution_date is defined as "run-at date minus 1 interval".  The
> assumption in this is that you tasks care about this particular date.
> Obviously this makes sense for some tasks but not for others.
>
> I would prop
>
>
>
>
> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote:
>
> > I think this is a great improvement and should be merged. To Daniel’s
> > concerns, I would argue this is not a change to what a dag run is, it is
> > rather a change to WHEN that dag run will be scheduled.
> > I had implemented a similar change in my own version but ultimately
> backed
> > so I didn’t have to patch after each new release. In my opinion the main
> > flaw in the current scheduler, and I have brought this up before, is when
> > you don’t have a consistent schedule interval (e.g. only run M-F). After
> > backing out the “schedule at interval start” I had to switch to a daily
> > schedule and go through and put a short circuit operator in each of my
> M-F
> > dags to get the behavior that I wanted. This results in putting
> scheduling
> > logic inside the dag, when scheduling logic should be in the scheduler.
> >
> > -James
> >
> >
> > > On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com>
> > wrote:
> > >
> > > Re
> > >
> > >> What are people's feelings on changing the default execution to
> schedule
> > >> interval start
> > >
> > > and
> > >
> > >> I'm in favor of doing that, but then exposing new variables of
> > >> "interval_start" and "interval_end", etc. so that people write
> > >> clearer-looking at-a-glance DAGs
> > >
> > >
> > > While I am def on board with the spirit of this PR, I would vote we do
> > not
> > > accept this PR as is, because it cements a confusing option.
> > >
> > > *What is the right representation of a dag run?*
> > >
> > > Right now the representation is "dag run-at date minus 1 interval".  It
> > > should just be "dag run-at date".
> > >
> > > We don't need to address the question of whether execution date is the
> > > start or the end of an interval; it doesn't matter.
> > >
> > > In all cases, a given dag run will be targeted for *some* initial
> "run-at
> > > time"; so *that* should be the time that is part of the PK of a dag
> run,
> > > and *that *is the time that should be exposed as the dag run "execution
> > > date"
> > >
> > > *Interval of interest is not a dag_run attribute*
> > >
> > > We also mix in this question of the date interval that the *tasks* are
> > > interested in.  But the *dag run* need not concern itself with this in
> > any
> > > way.  That is for the tasks to figure out: if they happen to need "dag
> > > run-at date," then they can reference that; if they want the prior one,
> > ask
> > > for the prior one.
> > >
> > > Previously, I was in the camp that thought it was a great idea to
> rename
> > > "execution_date" to "period_start" or "interval_start".  But I now
> think
> > > this is folly.  It invokes this question of the "interval of interest"
> or
> > > "period of interest".  But the dag doesn't need to know anything about
> > > that.
> > >
> > > Within the same dag you may have tasks with different intervals of
> > > interest.  So why make assumptions in the dag; just give the facts:
> this
> > is
> > > my run date; this is the prior run date, etc.  It would be a regression
> > > from the perspective of providing accurate names.
> > >
> > > *Proposal*
> > >
> > > So, I would propose we change "execution_date" to mean "dag run-at
> date"
> > as
> > > opposed to "dag run-at date minus 1".  But we should do so without
> > > reference to interval end or interval start.
> > >
> > > *Configurability*
> > >
> > > The more configuration options we have, the more noise there is as a
> user
> > > trying to understand how to use airflow, so I'd rather us not make this
> > > configurable at all.
> > >
> > > That said, perhaps a more clear and more explicit means making this
> > > configurable would be to define an integer param
> > > "dag_run_execution_date_interval_offset", which would control how many
> > > intervals back from actual "dag run-at date" the "execution date"
> should
> > > be.  (current behavior = 1, new behavior = 0).
> > >
> > > *Side note*
> > >
> > > Hopefully not to derail discussion: I think there are additional,
> related
> > > task attributes that may want to come into being: namely, low_watermark
> > and
> > > high_watermark.  There is the potential, with attributes like this, for
> > > adding better out-of-the-box support for common data workflows that we
> > now
> > > need to use xcom for, namely incremental loads.  But I want to give it
> > more
> > > thought before proposing anything specific.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <jarek.pot...@polidea.com
> >
> > > wrote:
> > >
> > >> Good one Damian. I will have a list of issues that can be possible to
> > >> handle at the workshop, so that one goes there.
> > >>
> > >> J.
> > >>
> > >> Principal Software Engineer
> > >> Phone: +48660796129
> > >>
> > >> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
> > >> damian.sha...@credit-suisse.com> napisał:
> > >>
> > >>> I can't understate what a conceptual improvement this would be for
> the
> > >> end
> > >>> users of Airflow in our environment. I've written a lot of code so
> all
> > >> our
> > >>> configuration works like this anyway. But the UI still shows the
> > Airflow
> > >>> dates which still to this day sometimes confuse me.
> > >>>
> > >>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my
> > first
> > >>> PRs could be additional test cases around edge cases to do with DST
> and
> > >>> cron scheduling that I have concerns about :)
> > >>>
> > >>> Damian
> > >>>
> > >>> -----Original Message-----
> > >>> From: Ash Berlin-Taylor [mailto:a...@apache.org]
> > >>> Sent: Friday, August 23, 2019 6:50 AM
> > >>> To: dev@airflow.apache.org
> > >>> Subject: Setting to add choice of schedule at end or schedule at
> start
> > of
> > >>> interval
> > >>>
> > >>> This has come up a few times before, someone has now opened a PR that
> > >>> makes this a global+per-dag setting:
> > >>> https://github.com/apache/airflow/pull/5787 and it also includes
> docs
> > >>> that I think does a good job of illustrating the two modes.
> > >>>
> > >>> Does anyone object to this being merged? If no one says anything by
> > >> midday
> > >>> on Tuesday I will take that as assent and will merge it.
> > >>>
> > >>> The docs from the PR included below.
> > >>>
> > >>> Thanks,
> > >>> Ash
> > >>>
> > >>> Scheduled Time vs Execution Time
> > >>> ''''''''''''''''''''''''''''''''
> > >>>
> > >>> A DAG with a ``schedule_interval`` will execute once per interval. By
> > >>> default, the execution of a DAG will occur at the **end** of the
> > >>> schedule interval.
> > >>>
> > >>> A few examples:
> > >>>
> > >>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> > processes
> > >>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> > >>> i.e. once that hour is over.
> > >>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
> processes
> > >>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > >>>
> > >>> The reasoning behind this execution vs scheduling behaviour is that
> > >>> data for the interval to be processed won't be fully available until
> > >>> the interval has elapsed.
> > >>>
> > >>> In cases where you wish the DAG to be executed at the **start** of
> the
> > >>> interval, specify ``schedule_at_interval_end=False``, either in
> > >>> ``airflow.cfg``, or on a per-DAG basis.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> >
> ===============================================================================
> > >>>
> > >>> Please access the attached hyperlink for an important electronic
> > >>> communications disclaimer:
> > >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >>>
> > >>
> >
> ===============================================================================
> > >>>
> > >>>
> > >>
> >
>

Reply via email to