Execution date is execution date for a dag run no matter what. There is no end 
interval or start interval for a dag run. The only time this is relevant is 
when we calculate the next or previous dagrun. 

So I don't Daniels rationale makes sense (?)

Sent from my iPhone

> On 27 Aug 2019, at 17:40, Philippe Gagnon <philgagn...@gmail.com> wrote:
> 
> I agree with Daniel's rationale but I am also worried about backwards
> compatibility as this would perhaps be the most disruptive breaking change
> possible. I think maybe we should write down the different options
> available to us (AIP?) and call for a vote. What does everyone think?
> 
>> On Tue, Aug 27, 2019 at 9:25 AM James Coder <jcode...@gmail.com> wrote:
>> 
>> Can't execution date can already mean different things depending on if the
>> dag run was initiated via the scheduler or manually via command line/API?
>> I agree that making it consistent might make it easier to explain to new
>> users, but should we exchange that for breaking pretty much every existing
>> dag by re-defining what execution date is?
>> -James
>> 
>> On Mon, Aug 26, 2019 at 11:12 PM Daniel Standish <dpstand...@gmail.com>
>> wrote:
>> 
>>>> 
>>>> To Daniel’s concerns, I would argue this is not a change to what a dag
>>> run
>>>> is, it is rather a change to WHEN that dag run will be scheduled.
>>> 
>>> 
>>> Execution date is part of the definition of a dag_run; it is uniquely
>>> identified by an execution_date and dag_id.
>>> 
>>> When someone asks what is a dag_run, we should be able to provide an
>>> answer.
>>> 
>>> Imagine trying to explain what a dag run is, when execution_date can mean
>>> different things.
>>>    Admin: "A dag run is an execution_date and a dag_id".
>>>    New user: "Ok. Clear as a bell. What's an execution_date?"
>>>    Admin: "Well, it can be one of two things.  It *could* be when the
>> dag
>>> will be run... but it could *also* be 'the time when dag should be run
>>> minus one schedule interval".  It depends on whether you choose 'end' or
>>> 'start' for 'schedule_interval_edge.'  If you choose 'start' then
>>> execution_date means 'when dag will be run'.  If you choose 'end' then
>>> execution_date means 'when dag will be run minus one interval.'  If you
>>> change the parameter after some time, then we don't necessarily know what
>>> it means at all times".
>>> 
>>> Why would we do this to ourselves?
>>> 
>>> Alternatively, we can give dag_run a clear, unambiguous meaning:
>>> * dag_run is dag_id + execution_date
>>> * execution_date is when dag will be run (notwithstanding scheduler
>> delay,
>>> queuing)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Execution_date is defined as "run-at date minus 1 interval".  The
>>> assumption in this is that you tasks care about this particular date.
>>> Obviously this makes sense for some tasks but not for others.
>>> 
>>> I would prop
>>> 
>>> 
>>> 
>>> 
>>>> On Sat, Aug 24, 2019 at 5:08 AM James Coder <jcode...@gmail.com> wrote:
>>>> 
>>>> I think this is a great improvement and should be merged. To Daniel’s
>>>> concerns, I would argue this is not a change to what a dag run is, it
>> is
>>>> rather a change to WHEN that dag run will be scheduled.
>>>> I had implemented a similar change in my own version but ultimately
>>> backed
>>>> so I didn’t have to patch after each new release. In my opinion the
>> main
>>>> flaw in the current scheduler, and I have brought this up before, is
>> when
>>>> you don’t have a consistent schedule interval (e.g. only run M-F).
>> After
>>>> backing out the “schedule at interval start” I had to switch to a daily
>>>> schedule and go through and put a short circuit operator in each of my
>>> M-F
>>>> dags to get the behavior that I wanted. This results in putting
>>> scheduling
>>>> logic inside the dag, when scheduling logic should be in the scheduler.
>>>> 
>>>> -James
>>>> 
>>>> 
>>>>> On Aug 23, 2019, at 3:14 PM, Daniel Standish <dpstand...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Re
>>>>> 
>>>>>> What are people's feelings on changing the default execution to
>>> schedule
>>>>>> interval start
>>>>> 
>>>>> and
>>>>> 
>>>>>> I'm in favor of doing that, but then exposing new variables of
>>>>>> "interval_start" and "interval_end", etc. so that people write
>>>>>> clearer-looking at-a-glance DAGs
>>>>> 
>>>>> 
>>>>> While I am def on board with the spirit of this PR, I would vote we
>> do
>>>> not
>>>>> accept this PR as is, because it cements a confusing option.
>>>>> 
>>>>> *What is the right representation of a dag run?*
>>>>> 
>>>>> Right now the representation is "dag run-at date minus 1 interval".
>> It
>>>>> should just be "dag run-at date".
>>>>> 
>>>>> We don't need to address the question of whether execution date is
>> the
>>>>> start or the end of an interval; it doesn't matter.
>>>>> 
>>>>> In all cases, a given dag run will be targeted for *some* initial
>>> "run-at
>>>>> time"; so *that* should be the time that is part of the PK of a dag
>>> run,
>>>>> and *that *is the time that should be exposed as the dag run
>> "execution
>>>>> date"
>>>>> 
>>>>> *Interval of interest is not a dag_run attribute*
>>>>> 
>>>>> We also mix in this question of the date interval that the *tasks*
>> are
>>>>> interested in.  But the *dag run* need not concern itself with this
>> in
>>>> any
>>>>> way.  That is for the tasks to figure out: if they happen to need
>> "dag
>>>>> run-at date," then they can reference that; if they want the prior
>> one,
>>>> ask
>>>>> for the prior one.
>>>>> 
>>>>> Previously, I was in the camp that thought it was a great idea to
>>> rename
>>>>> "execution_date" to "period_start" or "interval_start".  But I now
>>> think
>>>>> this is folly.  It invokes this question of the "interval of
>> interest"
>>> or
>>>>> "period of interest".  But the dag doesn't need to know anything
>> about
>>>>> that.
>>>>> 
>>>>> Within the same dag you may have tasks with different intervals of
>>>>> interest.  So why make assumptions in the dag; just give the facts:
>>> this
>>>> is
>>>>> my run date; this is the prior run date, etc.  It would be a
>> regression
>>>>> from the perspective of providing accurate names.
>>>>> 
>>>>> *Proposal*
>>>>> 
>>>>> So, I would propose we change "execution_date" to mean "dag run-at
>>> date"
>>>> as
>>>>> opposed to "dag run-at date minus 1".  But we should do so without
>>>>> reference to interval end or interval start.
>>>>> 
>>>>> *Configurability*
>>>>> 
>>>>> The more configuration options we have, the more noise there is as a
>>> user
>>>>> trying to understand how to use airflow, so I'd rather us not make
>> this
>>>>> configurable at all.
>>>>> 
>>>>> That said, perhaps a more clear and more explicit means making this
>>>>> configurable would be to define an integer param
>>>>> "dag_run_execution_date_interval_offset", which would control how
>> many
>>>>> intervals back from actual "dag run-at date" the "execution date"
>>> should
>>>>> be.  (current behavior = 1, new behavior = 0).
>>>>> 
>>>>> *Side note*
>>>>> 
>>>>> Hopefully not to derail discussion: I think there are additional,
>>> related
>>>>> task attributes that may want to come into being: namely,
>> low_watermark
>>>> and
>>>>> high_watermark.  There is the potential, with attributes like this,
>> for
>>>>> adding better out-of-the-box support for common data workflows that
>> we
>>>> now
>>>>> need to use xcom for, namely incremental loads.  But I want to give
>> it
>>>> more
>>>>> thought before proposing anything specific.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Aug 23, 2019 at 9:42 AM Jarek Potiuk <
>> jarek.pot...@polidea.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Good one Damian. I will have a list of issues that can be possible
>> to
>>>>>> handle at the workshop, so that one goes there.
>>>>>> 
>>>>>> J.
>>>>>> 
>>>>>> Principal Software Engineer
>>>>>> Phone: +48660796129
>>>>>> 
>>>>>> pt., 23 sie 2019, 11:09 użytkownik Shaw, Damian P. <
>>>>>> damian.sha...@credit-suisse.com> napisał:
>>>>>> 
>>>>>>> I can't understate what a conceptual improvement this would be for
>>> the
>>>>>> end
>>>>>>> users of Airflow in our environment. I've written a lot of code so
>>> all
>>>>>> our
>>>>>>> configuration works like this anyway. But the UI still shows the
>>>> Airflow
>>>>>>> dates which still to this day sometimes confuse me.
>>>>>>> 
>>>>>>> I'll be at the NY meet ups on Monday and Tuesday, maybe some of my
>>>> first
>>>>>>> PRs could be additional test cases around edge cases to do with DST
>>> and
>>>>>>> cron scheduling that I have concerns about :)
>>>>>>> 
>>>>>>> Damian
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Ash Berlin-Taylor [mailto:a...@apache.org]
>>>>>>> Sent: Friday, August 23, 2019 6:50 AM
>>>>>>> To: dev@airflow.apache.org
>>>>>>> Subject: Setting to add choice of schedule at end or schedule at
>>> start
>>>> of
>>>>>>> interval
>>>>>>> 
>>>>>>> This has come up a few times before, someone has now opened a PR
>> that
>>>>>>> makes this a global+per-dag setting:
>>>>>>> https://github.com/apache/airflow/pull/5787 and it also includes
>>> docs
>>>>>>> that I think does a good job of illustrating the two modes.
>>>>>>> 
>>>>>>> Does anyone object to this being merged? If no one says anything by
>>>>>> midday
>>>>>>> on Tuesday I will take that as assent and will merge it.
>>>>>>> 
>>>>>>> The docs from the PR included below.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ash
>>>>>>> 
>>>>>>> Scheduled Time vs Execution Time
>>>>>>> ''''''''''''''''''''''''''''''''
>>>>>>> 
>>>>>>> A DAG with a ``schedule_interval`` will execute once per interval.
>> By
>>>>>>> default, the execution of a DAG will occur at the **end** of the
>>>>>>> schedule interval.
>>>>>>> 
>>>>>>> A few examples:
>>>>>>> 
>>>>>>> - A DAG with ``schedule_interval='@hourly'``: The DAG run that
>>>> processes
>>>>>>> 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
>>>>>>> i.e. once that hour is over.
>>>>>>> - A DAG with ``schedule_interval='@daily'``: The DAG run that
>>> processes
>>>>>>> 2019-08-16 will start running shortly after 2019-08-17 00:00.
>>>>>>> 
>>>>>>> The reasoning behind this execution vs scheduling behaviour is that
>>>>>>> data for the interval to be processed won't be fully available
>> until
>>>>>>> the interval has elapsed.
>>>>>>> 
>>>>>>> In cases where you wish the DAG to be executed at the **start** of
>>> the
>>>>>>> interval, specify ``schedule_at_interval_end=False``, either in
>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> ===============================================================================
>>>>>>> 
>>>>>>> Please access the attached hyperlink for an important electronic
>>>>>>> communications disclaimer:
>>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> ===============================================================================
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 

Reply via email to