@Jarek the problem with just cron is I don’t think cron can handle “every third 
Thursday” or “the next open market day after the 15th.” I think we need 
something more flexible than just cron (though agree that cron syntax can get a 
fair bit of mileage)

On Wed, Jan 20, 2021 at 10:24 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
My thoughts (no final solution in mind, just wild thoughts):
1) I think we should add support for regular CRON behaviour. Simply "cron" 
schedule for dag execution, without the "data interval" rhetoric.
There are a number of good cases where Airflow can be used as just a scheduler 
to run the jobs. This should be akin to CI jobs - > either trigger the run on 
some event (trigger) or in regular intervals, but each run should not be tied 
with a particular "data interval" - which means that the whole backfill, 
re-running. idempotency of runs etc. will not be applicable. This should be 
IMHO even different type of DAGs, differently treated in the UI (for example 
every rerun should result in a NEW run rather than repetition of the previous 
run for a specific interval). I think we should very very clearly distinguish 
it from the "Data interval" kinds - maybe even the base class should be called 
differently for those (CronDAG) ?? It should be very, very clear what kind of 
DAG you have when you look at it. Both in the code and in the UI,
2) Get rid of CRON in the "Data Interval" (i.e all current DAGs !). This might 
be bold, but I think it might be best.
This is very confusing that we are using the CRON syntax but not the execution 
model. I think this is a major source of confusion among the users. The current 
way of specifying the schedule should be deprecated and dropped in 3.0 or 
automatically convert it to a new form. To that, I am all for Elad's proposal 
of using python function (with predefined set of parameterizable ones 
expressing intervals not start/end times). The CRON specification part is the 
only part that is declarative rather than imperative in airflow. All other 
stuff is python code. Heck, why not schedule? It has of course a number of 
problems to solve (largely optimisations in scheduler that needs to look ahead 
and plan scheduling in the future), but it is all solvable imho.
J.
On Wed, Jan 20, 2021 at 6:20 PM Deng Xiaodong < xd.den...@gmail.com 
[xd.den...@gmail.com] > wrote:
A quick thought ( maybe not making sense ): if schedule_interval accepts a list 
of values, we may support much higher complexity.
For example, I may want to schedule my jobs at every days' 04:05 AND 02:31 , 
which cannot be expressed by single Cron pattern. Then I may want to have 
schedule_interval = ["5 4 * * *", "31 2 * * *"] .
Maybe I missed something or the idea doesn't make sense. Please let me know.

XD
On Wed, Jan 20, 2021 at 6:09 PM Ash Berlin-Taylor < a...@apache.org 
[a...@apache.org] > wrote:
Yes, we quite possibly could do this -- I'm trying to work out what the needs 
are here.
In the example of a twice-a-month dag (not sure if it you have this use case 
too?) what do you expect the "data interval" (i.e. execution_date) to be?
Or for this case does it not matter?
-ash

On Wed, 20 Jan, 2021 at 19:06, Elad Kalif < elad...@gmail.com 
[elad...@gmail.com] > wrote:
Another case that is mentioned in one of the issues is the ability to schedule 
a bi-weekly job (equivalent of bi-weekly meeting that you can set in a 
calendar) which is very much needed.

Maybe this is unrealistic but I think the game changer is if it would be 
possible to let the users define their own logic and airflow will use it to 
schedule DAGs.
My thought here is - if I can define the logic in a python function (regardless 
of what this logic is). Can't Airflow utilize it?

On Wed, Jan 20, 2021 at 5:39 PM Ash Berlin-Taylor < a...@apache.org 
[a...@apache.org] > wrote:
Hi everyone,
I'd like to (re)start the discussion about a new feature I'd like to add for 
Airflow 2.1, that I am loosely calling "improving schedule_interval" (catchy 
name I know!)
I have two main high-level goals in mind here:
1. To reduce the confusion around execution_date (specifically the naming of 
the parameter!) - the whole start vs end discussion. 2. To support more complex 
schedules.
Previous thread on this poin t 1 here: 
https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E
 
[https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E]
 , (but I'm taking a bit of a step back from that to think if there's a bigger 
change we could make that encompases this)

I don't yet have a concrete plan, nor implementation in mind, but I'd like to 
start collecting peoples "wish list" when it comes to scheduling DAGS:
- What do you wish you could express natively in terms of scheduling your DAGs? 
(I.e. without using "hacks" such as date sensor/skip tasks at start) - What 
schedules do you wish you could express now, that you just can't? - Do you have 
good example workflows that give a good example of where you want schedule at 
start? Follow up question: do you also want this to be different for different 
DAGs in your Airflow install?

Existing issues: https://github.com/apache/airflow/issues/8649 
[https://github.com/apache/airflow/issues/8649] "Add support for more than 1 
cron exp per DAG" https://github.com/apache/airflow/issues/10194 
[https://github.com/apache/airflow/issues/10194] "Ability to better support odd 
scheduling time" https://github.com/apache/airflow/issues/10449 
[https://github.com/apache/airflow/issues/10449] "Dynamic Schedule Intervals" 
https://github.com/apache/airflow/issues/10123 
[https://github.com/apache/airflow/issues/10123] "Job Schedule Interval on 2nd 
& 4th Tuesday"
I'll start:
Case1:
One example that came up recently in slack was an actual astronomer wanting a 
DAG to run with a schedule of "@sunset"! This also brings up the subject of 
"running dags at interval start or end"
Case2:
I'd like to be able to run a daily process at the end of each week day. I.e. to 
process data for Monday..Friday. The naive way of expressing this would be "0 0 
* * MON-FRI", but that means that the dags would run Tuesday, Wednesday 
,Thursday ,Friday, Monday -- meaning Friday's data isn't processed until 
Monday! My thoughts on this is we need to separate schedule interval (when to 
run a task) from the period duration (i.e look at one days worth of data).
Thanks, Ash




--
+48 660 796 129

Reply via email to