@Jarek the problem with just cron is I don’t think cron can handle “every third Thursday” or “the next open market day after the 15th.” I think we need something more flexible than just cron (though agree that cron syntax can get a fair bit of mileage)
On Wed, Jan 20, 2021 at 10:24 AM, Jarek Potiuk <ja...@potiuk.com> wrote: My thoughts (no final solution in mind, just wild thoughts): 1) I think we should add support for regular CRON behaviour. Simply "cron" schedule for dag execution, without the "data interval" rhetoric. There are a number of good cases where Airflow can be used as just a scheduler to run the jobs. This should be akin to CI jobs - > either trigger the run on some event (trigger) or in regular intervals, but each run should not be tied with a particular "data interval" - which means that the whole backfill, re-running. idempotency of runs etc. will not be applicable. This should be IMHO even different type of DAGs, differently treated in the UI (for example every rerun should result in a NEW run rather than repetition of the previous run for a specific interval). I think we should very very clearly distinguish it from the "Data interval" kinds - maybe even the base class should be called differently for those (CronDAG) ?? It should be very, very clear what kind of DAG you have when you look at it. Both in the code and in the UI, 2) Get rid of CRON in the "Data Interval" (i.e all current DAGs !). This might be bold, but I think it might be best. This is very confusing that we are using the CRON syntax but not the execution model. I think this is a major source of confusion among the users. The current way of specifying the schedule should be deprecated and dropped in 3.0 or automatically convert it to a new form. To that, I am all for Elad's proposal of using python function (with predefined set of parameterizable ones expressing intervals not start/end times). The CRON specification part is the only part that is declarative rather than imperative in airflow. All other stuff is python code. Heck, why not schedule? It has of course a number of problems to solve (largely optimisations in scheduler that needs to look ahead and plan scheduling in the future), but it is all solvable imho. J. On Wed, Jan 20, 2021 at 6:20 PM Deng Xiaodong < xd.den...@gmail.com [xd.den...@gmail.com] > wrote: A quick thought ( maybe not making sense ): if schedule_interval accepts a list of values, we may support much higher complexity. For example, I may want to schedule my jobs at every days' 04:05 AND 02:31 , which cannot be expressed by single Cron pattern. Then I may want to have schedule_interval = ["5 4 * * *", "31 2 * * *"] . Maybe I missed something or the idea doesn't make sense. Please let me know. XD On Wed, Jan 20, 2021 at 6:09 PM Ash Berlin-Taylor < a...@apache.org [a...@apache.org] > wrote: Yes, we quite possibly could do this -- I'm trying to work out what the needs are here. In the example of a twice-a-month dag (not sure if it you have this use case too?) what do you expect the "data interval" (i.e. execution_date) to be? Or for this case does it not matter? -ash On Wed, 20 Jan, 2021 at 19:06, Elad Kalif < elad...@gmail.com [elad...@gmail.com] > wrote: Another case that is mentioned in one of the issues is the ability to schedule a bi-weekly job (equivalent of bi-weekly meeting that you can set in a calendar) which is very much needed. Maybe this is unrealistic but I think the game changer is if it would be possible to let the users define their own logic and airflow will use it to schedule DAGs. My thought here is - if I can define the logic in a python function (regardless of what this logic is). Can't Airflow utilize it? On Wed, Jan 20, 2021 at 5:39 PM Ash Berlin-Taylor < a...@apache.org [a...@apache.org] > wrote: Hi everyone, I'd like to (re)start the discussion about a new feature I'd like to add for Airflow 2.1, that I am loosely calling "improving schedule_interval" (catchy name I know!) I have two main high-level goals in mind here: 1. To reduce the confusion around execution_date (specifically the naming of the parameter!) - the whole start vs end discussion. 2. To support more complex schedules. Previous thread on this poin t 1 here: https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E [https://lists.apache.org/thread.html/2b12ae265795ff2e655a5161c972f5c7bbe60722a12849a0e2c5c55f%40%3Cdev.airflow.apache.org%3E] , (but I'm taking a bit of a step back from that to think if there's a bigger change we could make that encompases this) I don't yet have a concrete plan, nor implementation in mind, but I'd like to start collecting peoples "wish list" when it comes to scheduling DAGS: - What do you wish you could express natively in terms of scheduling your DAGs? (I.e. without using "hacks" such as date sensor/skip tasks at start) - What schedules do you wish you could express now, that you just can't? - Do you have good example workflows that give a good example of where you want schedule at start? Follow up question: do you also want this to be different for different DAGs in your Airflow install? Existing issues: https://github.com/apache/airflow/issues/8649 [https://github.com/apache/airflow/issues/8649] "Add support for more than 1 cron exp per DAG" https://github.com/apache/airflow/issues/10194 [https://github.com/apache/airflow/issues/10194] "Ability to better support odd scheduling time" https://github.com/apache/airflow/issues/10449 [https://github.com/apache/airflow/issues/10449] "Dynamic Schedule Intervals" https://github.com/apache/airflow/issues/10123 [https://github.com/apache/airflow/issues/10123] "Job Schedule Interval on 2nd & 4th Tuesday" I'll start: Case1: One example that came up recently in slack was an actual astronomer wanting a DAG to run with a schedule of "@sunset"! This also brings up the subject of "running dags at interval start or end" Case2: I'd like to be able to run a daily process at the end of each week day. I.e. to process data for Monday..Friday. The naive way of expressing this would be "0 0 * * MON-FRI", but that means that the dags would run Tuesday, Wednesday ,Thursday ,Friday, Monday -- meaning Friday's data isn't processed until Monday! My thoughts on this is we need to separate schedule interval (when to run a task) from the period duration (i.e look at one days worth of data). Thanks, Ash -- +48 660 796 129