My thoughts (no final solution in mind, just wild thoughts):

1) I think we should add support for regular CRON behaviour. Simply "cron"
schedule for dag execution, without the "data interval" rhetoric.

There are a number of good cases where Airflow can be used as just a
scheduler to run the jobs. This should be akin to CI jobs - > either
trigger the run on some event (trigger) or in regular intervals, but each
run should not be tied with a particular "data interval" - which means that
the whole backfill, re-running. idempotency of runs etc. will not be
applicable. This should be IMHO even different type of DAGs, differently
treated in the UI (for example every rerun should result in a NEW run
rather than repetition of the previous run for a specific interval).  I
think we should very very clearly distinguish it from the "Data interval"
kinds - maybe even the base class should be called differently for those
(CronDAG) ?? It should be very, very clear what kind of DAG you have when
you look at it. Both in the code and in the UI,

2) Get rid of CRON in the "Data Interval" (i.e all current DAGs !). This
might be bold, but I think it might be best.

This is very confusing that we are using the CRON syntax but not the
execution model. I think this is a major source of confusion among the
users. The current way of specifying the schedule should be deprecated and
dropped in 3.0 or automatically convert it to a new form.
To that, I am all for Elad's proposal of using python function (with
predefined set of parameterizable ones expressing intervals not start/end
times). The CRON specification part is the only part that is
declarative rather than imperative in airflow. All other stuff is python
code. Heck, why not schedule? It has of course a number of problems to
solve (largely optimisations in scheduler that needs to look ahead and plan
scheduling in the future), but it is all solvable imho.

J.

On Wed, Jan 20, 2021 at 6:20 PM Deng Xiaodong <xd.den...@gmail.com> wrote:

> A quick thought (*maybe not making sense*): if *schedule_interval* accepts
> a list of values, we may support much higher complexity.
>
> For example, I may want to schedule my jobs at every days' 04:05 AND 02:31
> , which cannot be expressed by single Cron pattern. Then I may want to have 
> *schedule_interval
> = ["5 4 * * *", "31 2 * * *"]*.
>
> Maybe I missed something or the idea doesn't make sense. Please let me
> know.
>
>
> XD
>
> On Wed, Jan 20, 2021 at 6:09 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
>> Yes, we quite possibly could do this -- I'm trying to work out what the
>> needs are here.
>>
>> In the example of a twice-a-month dag (not sure if it you have this use
>> case too?) what do you expect the "data interval" (i.e. execution_date) to
>> be?
>>
>> Or for this case does it not matter?
>>
>> -ash
>>
>>
>> On Wed, 20 Jan, 2021 at 19:06, Elad Kalif <elad...@gmail.com> wrote:
>>
>> Another case that is mentioned in one of the issues is the ability to
>> schedule a bi-weekly job (equivalent of bi-weekly meeting that you can set
>> in a calendar) which is very much needed.
>>
>> Maybe this is unrealistic but I think the game changer is if it would be
>> possible to let the users define their own logic and airflow will use it to
>> schedule DAGs.
>> My thought here is - if I can define the logic in a python function
>> (regardless of what this logic is). Can't Airflow utilize it?
>>
>> On Wed, Jan 20, 2021 at 5:39 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to (re)start the discussion about a new feature I'd like to add
>>> for Airflow 2.1, that I am loosely calling "improving schedule_interval"
>>> (catchy name I know!)
>>>
>>> I have two main high-level goals in mind here:
>>>
>>> 1. To reduce the confusion around execution_date (specifically the
>>> naming of the parameter!) - the whole start vs end discussion.
>>> 2. To support more complex schedules.
>>>
>>> Previous thread on this point 1 here:
>>> https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E,
>>> (but I'm taking a bit of a step back from that to think if there's a bigger
>>> change we could make that encompases this)
>>>
>>>
>>> I don't yet have a concrete plan, nor implementation in mind, but I'd
>>> like to start collecting peoples "wish list" when it comes to scheduling
>>> DAGS:
>>>
>>> - What do you wish you could express natively in terms of scheduling
>>> your DAGs? (I.e. without using "hacks" such as date sensor/skip tasks at
>>> start)
>>> - What schedules do you wish you could express now, that you just can't?
>>> - Do you have good example workflows that give a good example of where
>>> you want schedule at start? Follow up question: do you also want this to be
>>> different for different DAGs in your Airflow install?
>>>
>>>
>>> Existing issues:
>>> https://github.com/apache/airflow/issues/8649 "Add support for more
>>> than 1 cron exp per DAG"
>>> https://github.com/apache/airflow/issues/10194 "Ability to better
>>> support odd scheduling time"
>>> https://github.com/apache/airflow/issues/10449 "Dynamic Schedule
>>> Intervals"
>>> https://github.com/apache/airflow/issues/10123 "Job Schedule Interval
>>> on 2nd & 4th Tuesday"
>>>
>>> I'll start:
>>>
>>> Case1:
>>>
>>> One example that came up recently in slack was an actual astronomer
>>> wanting a DAG to run with a schedule of "@sunset"! This also brings up the
>>> subject of "running dags at interval start or end"
>>>
>>> Case2:
>>>
>>> I'd like to be able to run a daily process at the end of each week day.
>>> I.e. to process data for Monday..Friday. The naive way of expressing this
>>> would be "0 0 * * MON-FRI", but that means that the dags would run Tuesday,
>>> Wednesday ,Thursday ,Friday, Monday  -- meaning Friday's data isn't
>>> processed until Monday!
>>>
>>> My thoughts on this is we need to separate schedule interval (when to
>>> run a task) from the period duration (i.e look at one days worth of data).
>>>
>>> Thanks,
>>> Ash
>>>
>>>
>>>
>>>

-- 
+48 660 796 129

Reply via email to