It should be labelled as a "bridge" or "transition" release somehow, and tell the community to absolutely to go there first and address all deprecation warnings prior to upgrade to 2.0. I wonder if semver has something for that.
It might be the time to deprecate more things. Spring cleaning! Should we start another thread labeled "[2.0 spring cleaning] what Airflow features should we deprecate!?" Max On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <[email protected]> wrote: > I am happy to do another small(!) 1.10.x release. > > (There was a small bug I introduced where I broke the rendering of doc_md > on dags) > > On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <[email protected]> > wrote: > >I agree with Max here, we should be careful. > > > >Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > >tomorrow_ds_nodash. I'm not against having better readable shorthands, > >but > >more about the fact that having a tomorrow_ds doesn't make sense when > >we > >have an hourly (or weekly) job. > > > >I'm against guarding the {in,out}lets. Since the guarding will add > >extra > >logic, and this is not worth the added complexity (and another flag) in > >my > >opinion. > > > >- We start by putting deprecation warnings on tables, latest_date, > >end_date > >and END_DATE and remove them in Airflow 2.0. > > > >This is only possible if there is another version before 2.0. > >Currently, > >we're preparing to move to 2.0 > > > >- We add a lineage_enabled config which is false by default and thus > >inlets/outlets aren’t provided, unless set to true. > > > >I'm against guarding the {in,out}lets. Since the guarding will add > >extra > >logic, and this is not worth the added complexity (and another flag) in > >my > >opinion. > > > >- We continue discussion about clarification of execution_date and > >better > >named variables such as something_start and something_end > > > > > https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E > > > >Cheers, Fokko > > > >Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak < > >[email protected]>: > > > >> Great discussion, let’s stay on track. If I can summarise: > >> > >> > >> * yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > >tomorrow_ds_nodash > >> * Arthur: some users use these for convenience > >> * Bas/Fokko: these are values that can be easily derived in a > >> one-liner > >> > >> * tables > >> * nobody? > >> > >> * latest_date > >> * Arthur: used by internal backfill framework at Airbnb, but > >no > >> issue removing them. > >> > >> * inlets/outlets > >> * Arthur: still in development, could be guarded behind > >feature > >> flag. > >> > >> * end_date/END_DATE > >> * Arthur: used by internal backfill framework at Airbnb, but > >no > >> issue removing them. > >> > >> And to add to the discussion: > >> > >> > >> * execution_date > >> * The meaning of execution_date is ambivalent and super > >confusing > >> to everybody because it’s not the date of execution. > >> * Suggestions to add period_start/period_end, or > >> interval_start/interval_end, or something like that, to provide > >datetime > >> variables for start & end datetime of a DAG run. > >> * Other: > >> * Removal of variables should be done in major version and > >> deprecation warnings should be added. > >> > >> So how about the following: > >> - We start by putting deprecation warnings on tables, latest_date, > >> end_date and END_DATE and remove them in Airflow 2.0. > >> - We add a lineage_enabled config which is false by default and thus > >> inlets/outlets aren’t provided, unless set to true. > >> - We continue discussion about yesterday* and tomorrow* > >> - We continue discussion about clarification of execution_date and > >better > >> named variables such as something_start and something_end > >> > >> Cheers, > >> Bas > >> > >> > >> On 11 Apr 2019, at 04:52, Maxime Beauchemin > ><[email protected] > >> <mailto:[email protected]>> wrote: > >> > >> Making backwards incompatible changes that require altering the > >thousands > >> (millions?!) of DAGs in the wild will alienate the community and > >prevent > >> many from orchestrating an upgrade. Upgrading hundreds of DAGs and > >Airflow > >> atomically would be hard and dangerous. > >> > >> To mitigate this, changes to the DAG / core API should be backward > >> compatible, at least for a release range, and alert on upcoming > >deprecation > >> (if needed) > >> > >> To be clear if `execution_date` is renamed to `foo`, we should have a > >> version range where both work just as well, but using > >`execution_date` > >> would result is warning messages about its upcoming deprecation, > >maybe > >> something like "It appears you're using the task parameter > >`execution_date` > >> which will be deprecated in favor or `foo` in upcoming release 2.0.0, > >find > >> more info at link.to/foo<http://link.to/foo>" > >> > >> Max > >> > >> On Tue, Apr 9, 2019 at 7:58 AM James Meickle > >> > ><[email protected]<mailto:[email protected]>> > >> wrote: > >> > >> I agree with Ash here. The naming of "execution_date" is incredibly > >> confusing to people who are new to Airflow, who think it has > >something to > >> do with... execution. > >> > >> However, I think that there's still room for improvement with > >> "period_start" and "period_end". Think about manually triggered tasks > >- > >> they'd have a "period_start", but no useful "period_end" until the > >next > >> trigger occurs. And mid-day ETL tasks that are dated to "today" still > >have > >> to reference "period_end" to get the correct date, even though the > >DAGrun > >> won't be over yet! > >> > >> If we're going to go through and rename "execution_date", I'd rather > >see a > >> wider-ranging rework that understands a mapping from a date to a > >window > >> (like "every day, process the data from one week ago", or "every > >Saturday, > >> process today's data"). Maybe something closer to how Beam does it > >> (but retaining simplicity for the 99% of cases that are just "run > >daily, > >> for the past day"). > >> > >> > >> > > > https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions > >> > >> I could see something where there are two default DAG parameters are: > >> schedule="@daily" > >> window=-1 (or some schedule_delta object that provides a windowing > >impl, or > >> None if there is no window) > >> > >> In this world, all DAGRuns would have a "schedule_date" (the date > >this > >> would have normally started at), and an "execution_date" (when it was > >> actually executed). If a given DAGrun has a window (one-off DAGs may > >or may > >> not!), there would be variables for "window_start" (the start of the > >> window) and "window_end" (the end of the window; this would default > >to the > >> same as schedule_date, and to the next DAGrun's window_start). > >> > >> Disconnecting schedule date, execution date, and window/period would > >also > >> open a pathway for more advanced users to implement irregular > >schedules and > >> windows, without having to rely on hacks. For example: a DAG that > >runs at > >> "the end of the week" to produce a end-of-week would normally run on > >> Friday, but on Friday holidays, instead complete a day early (and > >scan one > >> fewer day in its window). > >> > >> I know there's some historical baggage here, but I think the above is > >much > >> more natural to new users than what we're offering today. And I think > >from > >> a current user perspective, there would be breaking variable name > >changes, > >> but no logical differences for the majority of DAGs. > >> > >> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <[email protected]> > >wrote: > >> > >> To (slightly) hijack this thread: > >> > >> On the subject of execuction_date: as I'm sure we're all aware the > >> concept > >> of execution_date is confusing to new-commers to Airflow (there are > >many > >> questions about "why hasn't my DAG run yet"? "Why is my dag a day > >> behind?" > >> etc.) and although we mention this in the docs it's a confusing > >concept. > >> > >> What to people think about adding two new parameters: `period_start` > >and > >> `period_end` and making these the preferred terms in place of > >> execution_date and next_execution_date? > >> > >> This hopefully avoids any ambitious terms like "execution" or "run" > >which > >> is understandably easy to conflate with the time the task is being > >run > >> (i.e. `now()`) > >> > >> If people think this naming is better and less confusing I would > >suggest > >> we update all the docs and examples to use these terms (but still > >mention > >> the old names somewhere, probably in the macros docs) > >> > >> Thoughts? > >> > >> -ash > >> > >> > >> On 8 Apr 2019, at 16:20, Arthur Wiedmer <[email protected]> > >> wrote: > >> > >> Hi Bas, > >> > >> 1) I am aware of a few places where those parameters are used in > >> production > >> in a few hundred jobs. I highly recommend we don't deprecate them > >> unless > >> we > >> do it in a major version. > >> > >> 2) As James mentioned, inlets and outlets are a lineage annotation > >> feature > >> which is still under development. Let's leave them in, but we can > >guard > >> them behind a feature flag if you prefer. > >> > >> 3) the yesterday*/tomorrow* params are convenience ones if you use a > >> daily > >> ETL. I agree with you that they are simple to compute, but not > >everyone > >> using Apache Airflow is amazing with Python. Some users are only > >trying > >> to > >> get a query scheduled and rely on a couple of niceties like these to > >> get > >> by. > >> > >> 4) latest_date, end_date (I feel like there used to be start_date, > >but > >> maybe it got lost) were a blend of things which were used by a > >backfill > >> framework used internally at Airbnb. Latest date was used if you > >needed > >> to > >> join to a dimension for which you only wanted the latest version of > >the > >> attributes in you backfill. end_date was used for time ranges where > >> several > >> days were processed together in a range to save on compute. I don't > >see > >> an > >> issue with removing them. > >> > >> Best regards, > >> Arthur > >> > >> > >> > >> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < > >> [email protected]> > >> wrote: > >> > >> Hi all, > >> > >> Following Tao Feng’s question to discuss this PR< > >> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192< > >> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss > >> here > >> if you agree/disagree/would change. > >> > >> ----------- > >> > >> The summary of the PR: > >> > >> I was confused by the task context values and suggest to clean up and > >> clarify these variables. Some are derivations from other variables, > >> some > >> are undocumented and unused, some are wrong (name doesn’t match the > >> value). > >> Please discuss what you think of the removal of these variables: > >> > >> > >> * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > >> tomorrow_ds_nodash. IMO the next_* and previous_* variables are > >useful > >> since these require complex logic to compute the next execution date, > >> however would leave computing the yesterday* and tomorrow* variables > >> up > >> to > >> the user since they are simple one-liners and don't relate to the DAG > >> interval. > >> * Removed tables. This is a field in params, and is thus also > >> accessible by the user ({{ params.tables }}). Also, it was > >> undocumented. > >> * Removed latest_date. It's the same as ds and was also > >> undocumented. > >> * Removed inlets and outlets. Also undocumented, and have the > >> inlets/outlets ever worked/ever been used by anybody? > >> * Removed end_date and END_DATE. Both have the same value, so it > >> doesn't make sense to have both variables. Also, the value is ds > >which > >> contains the start date of the interval, so the naming didn't make > >> sense to > >> me. However, if anybody argues in favour of adding "start_date" and > >> "end_date" to provide the start and end datetime of task instance > >> intervals, I'd be happy to add them. > >> > >> Cheers, > >> Bas > >> > >> > >> > >> > >> > >> >
