I never understood the requirement for start_date — 99% of the use cases simply want to start from the time the DAG is first added and do not explicitly need to start on a certain date. There is certainly a use case for start_date, but defaulting to None would make more sense IMO, and we could internally register the “first added date” as a start date instead.
Bas > On 9 May 2022, at 09:35, Jarek Potiuk <ja...@potiuk.com> wrote: > > I think the only real need for start_date is the "catchup=True". > I think start_date is really part of the metadata of the DAG - that is really > useful in order to determine range of backfill for example. So it's more an > intention of the DAG author to describe when we actually want the DAG > livecycle started. > As such it is nice to keep in the "records" - if we do not have it, we simply > do not know when the DAG should "start". I mean - we could see it by > historical DagRuns, but the problem is that if DagRuns are removed, that > information is lost. > > But it does not have to be specified in the DAG() object in Python IMHO > > I do not think we should actually remove the "start_dag" from Dag model, but > also I think it should be perfectly fine to simply set start_date in Dag > model to "NOW()" if it is not passed. the NOW() should not be NOW() really I > think - because of the intricacies of "execution_date" "start_interval", > "end_interval" it should be automatically adjusted. And here I am not sure > exactly - either so that when you create a DAG without start_date, it starts > immediately for the current interval, or starts for the future interval (not > 100% sure how well it will play with custom timetables but I think it can be > worked out rather easily. > > J. > > > > On Thu, May 5, 2022 at 2:30 PM Malthe <mbo...@gmail.com > <mailto:mbo...@gmail.com>> wrote: > There's been some prior discussion on removing the requirement for a > DAG without a schedule: > > - https://issues.apache.org/jira/browse/AIRFLOW-3739 > <https://issues.apache.org/jira/browse/AIRFLOW-3739> > - https://github.com/apache/airflow/pull/5423 > <https://github.com/apache/airflow/pull/5423> > > But why actually have the requirement at all. > > The documentation isn't particularly clear on why we need "start_date" > and the whole idea seems somewhat confusing: > > https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date > > <https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date> > > Consider: > > croniter("*/5 * * * *", start_time=None).get_next(datetime.datetime) > > My UTC time is "2022-05-05T12:22:16.914769" and the above expression > evaluates to: > > 2022-05-05T12:25:00 > > That is, it's nicely aligned as you would expect. I would assume from > reading the code that this carries over to `CronDataIntervalTimetable` > since it uses croniter in exactly this way. > > Must we require a "start_date" – ?