I am for this change, since I feel like in general the start of the
interval is more intuitive (I have been working on Airflow for 3 years and
this still trips me up). That being said I'm not sure how I feel about
allowing customization at DAG level instead of cluster level as it makes it
harder to make assumptions about DAGs on the cluster for ops, though maybe
this isn't a huge deal given there are tools available that show you why
tasks aren't running.

I agree with Bole that we should communicate recommended migration
strategies if they can't be done automatically.

I don't think I'm a fan for arbitrary customization of the interval via a
callback, my feeling is this would not provide significant value and could
be an ops nightmare.

On Fri, Aug 23, 2019 at 9:11 AM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> DST: I recall problems with DST especially when the hour goes back and the
> daily schedule time technically occurs twice the same day or does not occur
> at all. We have some code that chooses arbitrary the first occurence in the
> latter case (there was a problem that it worked differently python 3.6 vs
> 3.5 (!). But also the case when we move forward is an interesting one. I am
> not 100% it will work correctly after changing the scheduling mechanisms
> but it's rather easy to test and there is no harm adding it.
> There is a DST-specific logic implemented in our next/previous run
> calculation and I imagine it could get wrong.
>
> The tests I am talking about:
>
> DagTest.test_following_previous_schedule_daily_dag_CEST_to_CET/DagTest.test_following_previous_schedule_daily_dag_CET_to_CEST.
>
> Re: arbitrary customisation/converting DAGs. I think there is no need to
> convert existing dags - the default behaviour remains as it is as far as I
> understand. And this flag is much simpler to understand and reason about
> than arbitrary function and it corresponds to real business cases:
>
> 1) schedule_at_interval_end = True -> wait for the data to be ready for the
> interval (current/default behaviour related to processing batches of data)
> 2) schedule_at_interval_end = False -> CRON-like behaviour where we simply
> run arbitrary operation in regular intervals (more intuitive for people who
> are used to CRON-like jobs)
>
> You can always build your schedule differently if you need something
> "in-between" IMHO.
>
> J.
>
>
>
>
> On Fri, Aug 23, 2019 at 8:44 AM James Meickle
> <jmeic...@quantopian.com.invalid> wrote:
>
> > This is a change to one of Airflow's core concepts, and it would require
> a
> > lot of work for existing DAGs to cut over to it. Given that, my personal
> > preference would be to allow arbitrary customization rather than just a
> bit
> > toggle. Such as allowing passing in a mapping function: given an
> interval's
> > start date and end date, when should it be executed?
> >
> > On Fri, Aug 23, 2019 at 8:24 AM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> > > Happy for it as well. There are a number of cases where scheduling at
> > start
> > > makes more sense and as we see Airflow is used now in multiple cases
> > where
> > > there is no need to process data from an interval and wait until that
> > data
> > > is ready.
> > > But indeed some more tests would be great - especially for edge cases.
> > > Changig mid-air is one but I think there should be test about Daylight
> > > Saving Time changing.
> > > There are some tests for DST so they just need to be extended to cover
> > > those two different cases.
> > >
> > >
> > > J.
> > >
> > > On Fri, Aug 23, 2019 at 7:37 AM Kaxil Naik <kaxiln...@gmail.com>
> wrote:
> > >
> > > > Happy for this feature to merged
> > > >
> > > > On Fri, Aug 23, 2019, 11:49 Ash Berlin-Taylor <a...@apache.org>
> wrote:
> > > >
> > > > > This has come up a few times before, someone has now opened a PR
> that
> > > > > makes this a global+per-dag setting:
> > > > > https://github.com/apache/airflow/pull/5787 and it also includes
> > docs
> > > > > that I think does a good job of illustrating the two modes.
> > > > >
> > > > > Does anyone object to this being merged? If no one says anything by
> > > > midday
> > > > > on Tuesday I will take that as assent and will merge it.
> > > > >
> > > > > The docs from the PR included below.
> > > > >
> > > > > Thanks,
> > > > > Ash
> > > > >
> > > > > Scheduled Time vs Execution Time
> > > > > ''''''''''''''''''''''''''''''''
> > > > >
> > > > > A DAG with a ``schedule_interval`` will execute once per interval.
> By
> > > > > default, the execution of a DAG will occur at the **end** of the
> > > > > schedule interval.
> > > > >
> > > > > A few examples:
> > > > >
> > > > > - A DAG with ``schedule_interval='@hourly'``: The DAG run that
> > > processes
> > > > > 2019-08-16 17:00 will start running just after 2019-08-16 17:59:59,
> > > > > i.e. once that hour is over.
> > > > > - A DAG with ``schedule_interval='@daily'``: The DAG run that
> > processes
> > > > > 2019-08-16 will start running shortly after 2019-08-17 00:00.
> > > > >
> > > > > The reasoning behind this execution vs scheduling behaviour is that
> > > > > data for the interval to be processed won't be fully available
> until
> > > > > the interval has elapsed.
> > > > >
> > > > > In cases where you wish the DAG to be executed at the **start** of
> > the
> > > > > interval, specify ``schedule_at_interval_end=False``, either in
> > > > > ``airflow.cfg``, or on a per-DAG basis.
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Reply via email to