Re: Missing "start_date" or why must a DAG have one

2022-05-18 Thread Malthe
On Wed, 18 May 2022 at 17:18, Ash Berlin-Taylor wrote: > Start date also makes sense for a cron-based dag with catch-up too though... True. So, 1. A timedelta without a `start_date` is not wrong, but it'll use midnight as the reference time (I think this is better than "date first added"

Re: Missing "start_date" or why must a DAG have one

2022-05-18 Thread Ash Berlin-Taylor
Start date also makes sense for a cron-based dag with catch-up too though... On 18 May 2022 16:58:54 BST, Malthe wrote: >On Sat, 14 May 2022 at 11:21, Bas Harenslak wrote: >> I think we have the following options when no start_date is given: >> >> 1. schedule_interval is alias e.g. “@daily”

Re: Missing "start_date" or why must a DAG have one

2022-05-18 Thread Malthe
On Sat, 14 May 2022 at 11:21, Bas Harenslak wrote: > I think we have the following options when no start_date is given: > > 1. schedule_interval is alias e.g. “@daily” —> is a cron expression > internally (0 0 * * *), so run at 00:00 > 2. schedule_interval is cron e.g. “0 0 * * *” —> cron

Re: Missing "start_date" or why must a DAG have one

2022-05-14 Thread Bas Harenslak
Not in favour of a special marker because that’s essentially what start_date is for. Say somebody has a schedule_interval=timedelta(days=1) and wants their DAG to run at 00:00 without having to think of a specific start date, then they’d have to set start_date="random date and time 00:00" and

Re: Missing "start_date" or why must a DAG have one

2022-05-13 Thread Ping Zhang
"starts whenever you first deploy it", this makes dags nondeterministic. It is true that currently it is very hard to achieve this. Maybe we could use a special start_date marker to indicate this behavior so that users can be very aware of what they are doing. There is also another case where

Re: Missing "start_date" or why must a DAG have one

2022-05-13 Thread Collin McNulty
I disagree, start_date is None and catchup=True still describes a useful behavior that’s currently difficult to achieve in Airflow: a DAG that starts whenever you first deploy it and then catches up missed runs if you pause and unpause it or have downtime. On Thu, May 12, 2022 at 5:49 AM Jarek

Re: Missing "start_date" or why must a DAG have one

2022-05-12 Thread Jarek Potiuk
Yeah. Maybe simply start_date should only be required when catchup=True then? Sounds like it might correctly reflect the intention of catchup=True, while bringing a very solid semantic for explicit start_date. J. On Tue, May 10, 2022 at 11:14 PM Ping Zhang wrote: > I agree that for the

Re: Missing "start_date" or why must a DAG have one

2022-05-10 Thread Ping Zhang
I agree that for the crontab interval with `catchup=False`, the state_date does not make sense. However, the start_date is still very useful when having catchup=True, whose default value is `True`, https://github.com/apache/airflow/blob/main/airflow/config_templates/default_airflow.cfg#L989. If

Re: Missing "start_date" or why must a DAG have one

2022-05-09 Thread Jarek Potiuk
Coincidentally - this discussion in Github Discussions started just now has a clear use cases when omitting start_date makes perfect sense: https://github.com/apache/airflow/discussions/23594 On Mon, May 9, 2022 at 4:01 PM Bas Harenslak wrote: > I never understood the requirement for start_date

Re: Missing "start_date" or why must a DAG have one

2022-05-09 Thread Bas Harenslak
I never understood the requirement for start_date — 99% of the use cases simply want to start from the time the DAG is first added and do not explicitly need to start on a certain date. There is certainly a use case for start_date, but defaulting to None would make more sense IMO, and we could

Re: Missing "start_date" or why must a DAG have one

2022-05-09 Thread Jarek Potiuk
I think the only real need for start_date is the "catchup=True". I think start_date is really part of the metadata of the DAG - that is really useful in order to determine range of backfill for example. So it's more an intention of the DAG author to describe when we actually want the DAG livecycle

Missing "start_date" or why must a DAG have one

2022-05-05 Thread Malthe
There's been some prior discussion on removing the requirement for a DAG without a schedule: - https://issues.apache.org/jira/browse/AIRFLOW-3739 - https://github.com/apache/airflow/pull/5423 But why actually have the requirement at all. The documentation isn't particularly clear on why we need