Making backwards incompatible changes that require altering the thousands (millions?!) of DAGs in the wild will alienate the community and prevent many from orchestrating an upgrade. Upgrading hundreds of DAGs and Airflow atomically would be hard and dangerous.
To mitigate this, changes to the DAG / core API should be backward compatible, at least for a release range, and alert on upcoming deprecation (if needed) To be clear if `execution_date` is renamed to `foo`, we should have a version range where both work just as well, but using `execution_date` would result is warning messages about its upcoming deprecation, maybe something like "It appears you're using the task parameter `execution_date` which will be deprecated in favor or `foo` in upcoming release 2.0.0, find more info at link.to/foo" Max On Tue, Apr 9, 2019 at 7:58 AM James Meickle <jmeic...@quantopian.com.invalid> wrote: > I agree with Ash here. The naming of "execution_date" is incredibly > confusing to people who are new to Airflow, who think it has something to > do with... execution. > > However, I think that there's still room for improvement with > "period_start" and "period_end". Think about manually triggered tasks - > they'd have a "period_start", but no useful "period_end" until the next > trigger occurs. And mid-day ETL tasks that are dated to "today" still have > to reference "period_end" to get the correct date, even though the DAGrun > won't be over yet! > > If we're going to go through and rename "execution_date", I'd rather see a > wider-ranging rework that understands a mapping from a date to a window > (like "every day, process the data from one week ago", or "every Saturday, > process today's data"). Maybe something closer to how Beam does it > (but retaining simplicity for the 99% of cases that are just "run daily, > for the past day"). > > https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions > > I could see something where there are two default DAG parameters are: > schedule="@daily" > window=-1 (or some schedule_delta object that provides a windowing impl, or > None if there is no window) > > In this world, all DAGRuns would have a "schedule_date" (the date this > would have normally started at), and an "execution_date" (when it was > actually executed). If a given DAGrun has a window (one-off DAGs may or may > not!), there would be variables for "window_start" (the start of the > window) and "window_end" (the end of the window; this would default to the > same as schedule_date, and to the next DAGrun's window_start). > > Disconnecting schedule date, execution date, and window/period would also > open a pathway for more advanced users to implement irregular schedules and > windows, without having to rely on hacks. For example: a DAG that runs at > "the end of the week" to produce a end-of-week would normally run on > Friday, but on Friday holidays, instead complete a day early (and scan one > fewer day in its window). > > I know there's some historical baggage here, but I think the above is much > more natural to new users than what we're offering today. And I think from > a current user perspective, there would be breaking variable name changes, > but no logical differences for the majority of DAGs. > > On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org> wrote: > > > To (slightly) hijack this thread: > > > > On the subject of execuction_date: as I'm sure we're all aware the > concept > > of execution_date is confusing to new-commers to Airflow (there are many > > questions about "why hasn't my DAG run yet"? "Why is my dag a day > behind?" > > etc.) and although we mention this in the docs it's a confusing concept. > > > > What to people think about adding two new parameters: `period_start` and > > `period_end` and making these the preferred terms in place of > > execution_date and next_execution_date? > > > > This hopefully avoids any ambitious terms like "execution" or "run" which > > is understandably easy to conflate with the time the task is being run > > (i.e. `now()`) > > > > If people think this naming is better and less confusing I would suggest > > we update all the docs and examples to use these terms (but still mention > > the old names somewhere, probably in the macros docs) > > > > Thoughts? > > > > -ash > > > > > > > On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com> > > wrote: > > > > > > Hi Bas, > > > > > > 1) I am aware of a few places where those parameters are used in > > production > > > in a few hundred jobs. I highly recommend we don't deprecate them > unless > > we > > > do it in a major version. > > > > > > 2) As James mentioned, inlets and outlets are a lineage annotation > > feature > > > which is still under development. Let's leave them in, but we can guard > > > them behind a feature flag if you prefer. > > > > > > 3) the yesterday*/tomorrow* params are convenience ones if you use a > > daily > > > ETL. I agree with you that they are simple to compute, but not everyone > > > using Apache Airflow is amazing with Python. Some users are only trying > > to > > > get a query scheduled and rely on a couple of niceties like these to > get > > by. > > > > > > 4) latest_date, end_date (I feel like there used to be start_date, but > > > maybe it got lost) were a blend of things which were used by a backfill > > > framework used internally at Airbnb. Latest date was used if you needed > > to > > > join to a dimension for which you only wanted the latest version of the > > > attributes in you backfill. end_date was used for time ranges where > > several > > > days were processed together in a range to save on compute. I don't see > > an > > > issue with removing them. > > > > > > Best regards, > > > Arthur > > > > > > > > > > > > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < > > basharens...@godatadriven.com> > > > wrote: > > > > > >> Hi all, > > >> > > >> Following Tao Feng’s question to discuss this PR< > > >> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192< > > >> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss > > here > > >> if you agree/disagree/would change. > > >> > > >> ----------- > > >> > > >> The summary of the PR: > > >> > > >> I was confused by the task context values and suggest to clean up and > > >> clarify these variables. Some are derivations from other variables, > some > > >> are undocumented and unused, some are wrong (name doesn’t match the > > value). > > >> Please discuss what you think of the removal of these variables: > > >> > > >> > > >> * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > > >> tomorrow_ds_nodash. IMO the next_* and previous_* variables are useful > > >> since these require complex logic to compute the next execution date, > > >> however would leave computing the yesterday* and tomorrow* variables > up > > to > > >> the user since they are simple one-liners and don't relate to the DAG > > >> interval. > > >> * Removed tables. This is a field in params, and is thus also > > >> accessible by the user ({{ params.tables }}). Also, it was > undocumented. > > >> * Removed latest_date. It's the same as ds and was also > undocumented. > > >> * Removed inlets and outlets. Also undocumented, and have the > > >> inlets/outlets ever worked/ever been used by anybody? > > >> * Removed end_date and END_DATE. Both have the same value, so it > > >> doesn't make sense to have both variables. Also, the value is ds which > > >> contains the start date of the interval, so the naming didn't make > > sense to > > >> me. However, if anybody argues in favour of adding "start_date" and > > >> "end_date" to provide the start and end datetime of task instance > > >> intervals, I'd be happy to add them. > > >> > > >> Cheers, > > >> Bas > > >> > > > > >