As discussed in the slack I agree with Malthe that the current
timetable interface is complex. But my assessment of the situation and
proposal including a bit more context and plans we had for the AIP-39 are a
bit different.

TL;DR; I think it is about time to complete what we were planning in AIP-39
as "Future Enhancement" and implement a few simple timetable
implementations that will handle most popular use cases (using the
"complex" timetable API) that will be available to regular users to use
(without the need of writing new code). My proposal is that we should
define what timetables to add and aim to implement them to include them in
Airflow 2.3. Sounds doable and should solve the real problem of our users.

Assessment of the situation.

I do not think the current interface is "too complex". Not at all. But I
think that it is targeted to a different audience than Malthe and Bas talk
about. It is addressed for "power users" - not only because it requires
deep understanding of Airflow scheduling internals and optimizations but
also, because it requires "admin" rights to develop, test and install it.
Regular users. who are Dag authors cannot create new Timetables. This is
mostly because of security. The "regular users"  need to convince the
admins to do so. And yes I am talking about the important segment of our
users where you have professional admins/devops configuring Airflow and DAG
authors who just write DAGs. I think this is the most interesting and
biggest segment of our users to be honest. We should always think about
this segment of our users first IMHO.

But what I very strongly agree with - we have very limited "offering" for
the "DAG authors" to be able to harness the powers of our
non-cron-based-timetables. The typical ask that cannot be easily fulfilled
(which I saw many cases of is (from slack discussion from Friday): *"Can
someone provide me some codesamples of scheduling a job on the second to
last day of every month  using timetables?" *
https://apache-airflow.slack.com/archives/CCPRP7943/p1645697960286899 .
Currently, Airflow out-of-the box has no way of supporting that (rather
typical) use case without actually becoming the "power user with admin
rights''. You need to have "someone else" to provide it as a plugin that
you install. This is what we miss currently (and it has been already
planned as future enhancements in AIP-39 actually:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval
.

What current API provides and for whom

The current API is great when it comes to power users who know airflow's
scheduling internals and optimizations that Ash explained. Looking at where
the AIP-39 came from:

* have the "versatile API" that you will be able to implement literally ANY
timetable
* where there are no fixed scheduled intervals
* where the manual runs can co-exist with scheduled run and
* where you could specify backfill range and it will figure out how many
dagruns there will be in this range and run them
* and allow to optimize scheduling decisions (date of next run stored in
the database for easy DB queries among others

I think the current API fulfills those very well and is a great "low level"
API that we can build more "higher-level" implementations of Custom
Timetables on.
But the current API is terrible for casual DAG Authors who want to use
non-cron-compatible timetables - both because of complexity and security
limitations.

What can we do?

I think we should design and write a few (literally a few) higher
level timetables addressed to be used by "regular" DAG authors without
installing anything. Not many. Just a few. We could rather easily ask our
users and produce a list of several timetables that will not have "cron"
limitations but also will handle just a subset of "general timetable"
cases.  For example a Timetable that will allow the user to run for
example: "-2 day of every month" (second to last day for example). Those
timetables should be available in Airflow out-of-the-box. No package
installation and admin permission necessary. We literally need two three
such schedules and be open for user expressing their non-cron-compliant
"typical" schedules and add them as needed.

I do not have yet clear idea on the "UX/declarative configuration" for such
timetables (but something that comes to my mind is that one of those could
allow textual description of the schedule - it would be extremely cool if
the users could create the schedule like "timetable="run on the second to
last day of every month"). With NLP solutions out there, it should be
possible because the domain of "typical" scheduling is really narrow. Maybe
there are some libraries we could use for that :D. But this is just an
idea, maybe we can do it differently.

Those are my thoughts :).


J.



On Wed, Feb 23, 2022 at 4:23 PM Malthe <[email protected]> wrote:

> On Wed, 23 Feb 2022 at 15:20, Ash Berlin-Taylor <[email protected]> wrote:
> >
> > On Wed, Feb 23 2022 at 15:17:48 +0000, Malthe <[email protected]> wrote:
> >
> > Backfilling is not out of scope for a timetable at all. If I run
> `airflow dags backfill mydagid --start-date 2020-01-01 --end-date
> 2021-06-30` how many DagRuns are created and what are logical
> dates/intervals of them?
> >
> > If the timetable has a daily frequency, then one dagrun per day in that
> interval.
> >
> >
> > DAGs don't have a frequency. They have a timetable. They don't even have
> a scheduler_interval anymore -- that gets converted to an instance of the
> CronDataIntervalTimetable
>
> Yes, if the timetable has a daily frequency internally – that is, if
> the timetable has a logic that produces dagruns spaced out daily –
> then I would expect one dagrun per day in the given interval.
>

Reply via email to