On May 28 2020, at 1:12 pm, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> I am for it as long as we can thoroughly test the edge cases and maybe have
> some way to visualize resulting "data intervals" in this case.
>
> I think big part of Airflow is that it is not just a "scheduled job"
> engine, but that it works on data intervals that are defined by the
> schedules.
I agree with this, but also: it is impossible right now to express this,
fairyl common use case:
"At the end of every _working_ day, I want to process that day's data".
I.e. "daily", but don't process Saturday or Sunday (in the western
world, lets not forget about Middle East where the working week is
Sunday->Thursday).
If you tried the "naieve" approach of a cron expression of "0 0 * *
MON-FRI" <https://crontab.guru/#0_0_*_*_MON-FRI> then this would Run
"fine"/as expected at 00:00:01 on Wed, Thurs, Fri, Sat -- but when it
comes to the run for "Monday" (which happens at 00:00:01 on Tuesday.
Let's not restart that discussion here.) the "interval" will be wrong.
Yes this can be handled with a Skip operator as the first step. But this
feels like a very common use case that we should handle natively in the
scheduler -- why go through the effort of creating a dag run and kicking
off tasks if we could know we wouldn't need it.
So yeah, I'm a big +1 for this idea, but lets work out the worts and
edge cases, and document them.
Does anyone want to try and tackle this?
This PR (which isn't mine originally, I've just rebased/updated and
changed how it's doing it slightly) may help expose this info:
https://github.com/apache/airflow/pull/9052. See
https://github.com/apache/airflow/pull/2460 for screenshots, and a lot
of the discussion.
-ash
> With regular schedules. it's fairly easy to reason about what data
> intervals Airflow works on, but with complex multi-cron job
> expressions it
> will become much less obvious
>
> In the example you mentioned above - - Mauricio - "every 10 min between
> 16:30 and 18:10" we will have 10x 10 minutes data intervals ending at
> 16:40, 16:50, 17:00, 17:10, 17:20, 17:30, 17:40, 17:50, 18:00, 18:10 and
> one ~ 22.5h interval between 18:10 and 16:30 next day.
>
> This is a simple example of course and it is currently also possible for
> fixed hours in cron, but I can imagine if we introduce the capability of
> multiple cron job expressions that introduce arbitrary complex schedules,
> the schedules might be super-difficult to reason about if you start mixing
> them.
>
> I think it is fine if the users want to do it, but also for the convenience
> of the users themselves. maybe there should be some way (Web UI? CLI?)
> where you can take such a schedule and see the data intervals you can
> expect to have?
>
> WDYT?
>
> J.
>
> On Thu, May 28, 2020 at 2:37 AM Shaw, Damian P. <
> damian.sha...@credit-suisse.com> wrote:
>
>> Big +1 to anything that extends the limitations of Airflow's current
>> scheduling capability.
>>
>> For me the only drawback of this is it doesn't go far enough and further
>> additions would needed to be added later, it would still be difficult to
>> express things that require updatable calendars like "Every Business Day"
>> or things which are hard to express even with composible crontabs likes
>> "The first week day of the month".
>>
>> But if this is an easy win I hope it's taken seriously.
>>
>> Damian.
>>
>> -----Original Message-----
>> From: Mauricio De Diana <mdedi...@gmail.com>
>> Sent: Wednesday, May 27, 2020 14:27
>> To: dev@airflow.apache.org
>> Subject: Support for multiple cron expressions
>>
>> Hello all,
>>
>> At the moment some schedules are not possible in Airflow, for example,
>> "every 10 min between 16:30 and 18:10". Such schedules would be
>> possible if
>> Airflow supported multiple cron expressions, as described in
>> https://github.com/apache/airflow/issues/8649. In the issue, I was
>> suggested to bring the discussion here because this may not be a desirable
>> feature.
>>
>> In terms of implementation, I gave the idea a try and I have something
>> working. For that, besides str, timedelta and relativedelta, a schedule
>> interval can also be a list of strings representing cron expressions. There
>> is a class that is a composite of croniter objects and providing the same
>> methods. It works seamlessly for one or many cron expressions, so changes
>> in the scheduler code are mostly replacing croniter with this class.
>>
>> I can create a PR if there is interest in discussing the implementation,
>> but first I would like to learn opinions about this feature? Is it an idea
>> worth following?
>>
>> Thanks,
>> Mauricio
>>
>>
>>
>> ===============================================================================
>>
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>> ===============================================================================
>>
>>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>