I would like to add to this discussion a change to `TaskInstance.previous_ti`. It is a useful property in template context, but it has an identity crisis.
Currently this property may return a different value depending on whether your dag is `catchup` and whether your dag has a schedule interval. - If the dag is catchup, it will calculate the prior execution date according to schedule and return that TI if one exists. - If catchup is true and schedule is none then it will return none even if there is a prior TI. - If dag is not catchup it will ignore the schedule and return the absolute prior TI. I propose that in 2.0 behavior should be as follows: - *previous_ti* always returns absolute prior ti (i.e. latest prev_ti such that prev_ti.exec_date < ti.exec_date), if one exists - add property *previous_scheduled_ti* that returns the prior ti according to schedule if *it* exists - consistent with current behavior, if latest ti exec date < prev sched exec date, it will return None; and if no schedule, return none. - *note:* there is a PR in progress for *previous_ti_success* -- this will pick up absolute prior successful TI This way, the behavior of these properties will not change depending on unrelated dag params; they should do what advertise, always. On Tue, Apr 30, 2019 at 11:08 AM Bas Harenslak < basharens...@godatadriven.com> wrote: > Nope. I never planned to have a discussion on execution_date but it > appeared more important than the other variables mentioned in my first > email :) > > I already wrote a summary halfway this thread. With the updates since then: > > > * yesterday_ds, yesterday_ds_nodash, tomorrow_ds, tomorrow_ds_nodash > * Arthur: some users use these for convenience > * Bas/Fokko: these are values that can be easily derived in a > one-liner > * Max agrees these should have been previous_* and next_* execution > date. No such thing as temporary code :-) > > * tables > * Fokko: deprecate and remove in Airflow 2.0 > * latest_date > * Arthur: used by internal backfill framework at Airbnb, but no > issue removing them > * Fokko: deprecate and remove in Airflow 2.0 > * inlets/outlets > * Arthur: still in development, could be guarded behind feature > flag. > * Fokko: no guarding, it adds additional logic > * Bolke: still in development > * end_date/END_DATE > * Arthur: used by internal backfill framework at Airbnb, but no > issue removing them. > * Fokko: deprecate and remove in Airflow 2.0 > * Max finds it useful with a reason (but ignored the fact it’s > actually the start date of the interval?) > * Bolke says there’s some users > * Others question if there are any > * execution_date > * The meaning of execution_date is ambivalent and super confusing > to everybody because it’s not the date of execution. > * Suggestions to add period_start/period_end, or > interval_start/interval_end, or something like that, to provide datetime > variables for start & end datetime of a DAG run. > * Most agree to add new variables with more sensible names and keep > execution_date for backwards compatibility > * Other: > * Removal of variables should be done in major version and > deprecation warnings should be added. > > So, given the discussion it appears almost everybody agrees these > variables are incorrect and/or confusing. However, there are also users > depending on them. How about we: > > > 1. Create new variables with sensible names > 2. At least deprecate tables & latest_date as these appear not > important to anybody > > Cheers, > Bas > > On 30 Apr 2019, at 19:23, Ash Berlin-Taylor <a...@apache.org<mailto: > a...@apache.org>> wrote: > > Did we come to any conclusion on this topic? > > > On 12 Apr 2019, at 10:26, Ash Berlin-Taylor <a...@apache.org<mailto: > a...@apache.org>> wrote: > > Does anyone actually use end_date (of either spelling) given it's value is > currently the same as `ds`: > https://github.com/apache/airflow/blob/3020d9733cb189f091489a62f58f0f586dc8d4a9/airflow/models/taskinstance.py#L1183-L1184 > > > 'END_DATE': ds, > 'end_date': ds, > > That seems like it is just the wrong value - I could see it being useful > as either the same value as `next_execution_date` or `dag.end_date` but the > current value (from a quick thought) doesn't seem like it is useful? > > Given what the value is currently set to I vote for deprecating/removing > these two. > > On 12 Apr 2019, at 09:39, Bolke de Bruin <bdbr...@gmail.com<mailto: > bdbr...@gmail.com>> wrote: > > I don’t think guarding inlets and outlets makes sense, cause if they are > not defined they code won’t be executed anyway. Ie. The logic is already > there, no need for a config flag. > > End-date is actually used by some beyond Airbnb as far as I know. What is > the reason for deprecating it? > > Execution_date in finance is actually very well known. A transaction has > an execution date it is not the time of the sending of the data. I honestly > think that a change here would make it a thing that techies understand but > does not make business sense. What is “today’s data?” it might not be the > data that was sent today. > > > > Verstuurd vanaf mijn iPad > > Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin < > maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>> het > volgende geschreven: > > It should be labelled as a "bridge" or "transition" release somehow, and > tell the community to absolutely to go there first and address all > deprecation warnings prior to upgrade to 2.0. I wonder if semver has > something for that. > > It might be the time to deprecate more things. Spring cleaning! > > Should we start another thread labeled "[2.0 spring cleaning] what Airflow > features should we deprecate!?" > > Max > > On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <a...@apache.org<mailto: > a...@apache.org>> wrote: > > I am happy to do another small(!) 1.10.x release. > > (There was a small bug I introduced where I broke the rendering of doc_md > on dags) > > On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fo...@driesprong.frl > <mailto:fo...@driesprong.frl>> > wrote: > I agree with Max here, we should be careful. > > Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > tomorrow_ds_nodash. I'm not against having better readable shorthands, > but > more about the fact that having a tomorrow_ds doesn't make sense when > we > have an hourly (or weekly) job. > > I'm against guarding the {in,out}lets. Since the guarding will add > extra > logic, and this is not worth the added complexity (and another flag) in > my > opinion. > > - We start by putting deprecation warnings on tables, latest_date, > end_date > and END_DATE and remove them in Airflow 2.0. > > This is only possible if there is another version before 2.0. > Currently, > we're preparing to move to 2.0 > > - We add a lineage_enabled config which is false by default and thus > inlets/outlets aren’t provided, unless set to true. > > I'm against guarding the {in,out}lets. Since the guarding will add > extra > logic, and this is not worth the added complexity (and another flag) in > my > opinion. > > - We continue discussion about clarification of execution_date and > better > named variables such as something_start and something_end > > > > https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E > > Cheers, Fokko > > Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak < > basharens...@godatadriven.com>: > > Great discussion, let’s stay on track. If I can summarise: > > > * yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > tomorrow_ds_nodash > * Arthur: some users use these for convenience > * Bas/Fokko: these are values that can be easily derived in a > one-liner > > * tables > * nobody? > > * latest_date > * Arthur: used by internal backfill framework at Airbnb, but > no > issue removing them. > > * inlets/outlets > * Arthur: still in development, could be guarded behind > feature > flag. > > * end_date/END_DATE > * Arthur: used by internal backfill framework at Airbnb, but > no > issue removing them. > > And to add to the discussion: > > > * execution_date > * The meaning of execution_date is ambivalent and super > confusing > to everybody because it’s not the date of execution. > * Suggestions to add period_start/period_end, or > interval_start/interval_end, or something like that, to provide > datetime > variables for start & end datetime of a DAG run. > * Other: > * Removal of variables should be done in major version and > deprecation warnings should be added. > > So how about the following: > - We start by putting deprecation warnings on tables, latest_date, > end_date and END_DATE and remove them in Airflow 2.0. > - We add a lineage_enabled config which is false by default and thus > inlets/outlets aren’t provided, unless set to true. > - We continue discussion about yesterday* and tomorrow* > - We continue discussion about clarification of execution_date and > better > named variables such as something_start and something_end > > Cheers, > Bas > > > On 11 Apr 2019, at 04:52, Maxime Beauchemin > <maximebeauche...@gmail.com > <mailto:maximebeauche...@gmail.com>> wrote: > > Making backwards incompatible changes that require altering the > thousands > (millions?!) of DAGs in the wild will alienate the community and > prevent > many from orchestrating an upgrade. Upgrading hundreds of DAGs and > Airflow > atomically would be hard and dangerous. > > To mitigate this, changes to the DAG / core API should be backward > compatible, at least for a release range, and alert on upcoming > deprecation > (if needed) > > To be clear if `execution_date` is renamed to `foo`, we should have a > version range where both work just as well, but using > `execution_date` > would result is warning messages about its upcoming deprecation, > maybe > something like "It appears you're using the task parameter > `execution_date` > which will be deprecated in favor or `foo` in upcoming release 2.0.0, > find > more info at link.to/foo<http://link.to/foo>" > > Max > > On Tue, Apr 9, 2019 at 7:58 AM James Meickle > > <jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>> > wrote: > > I agree with Ash here. The naming of "execution_date" is incredibly > confusing to people who are new to Airflow, who think it has > something to > do with... execution. > > However, I think that there's still room for improvement with > "period_start" and "period_end". Think about manually triggered tasks > - > they'd have a "period_start", but no useful "period_end" until the > next > trigger occurs. And mid-day ETL tasks that are dated to "today" still > have > to reference "period_end" to get the correct date, even though the > DAGrun > won't be over yet! > > If we're going to go through and rename "execution_date", I'd rather > see a > wider-ranging rework that understands a mapping from a date to a > window > (like "every day, process the data from one week ago", or "every > Saturday, > process today's data"). Maybe something closer to how Beam does it > (but retaining simplicity for the 99% of cases that are just "run > daily, > for the past day"). > > > > > > https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions > > I could see something where there are two default DAG parameters are: > schedule="@daily" > window=-1 (or some schedule_delta object that provides a windowing > impl, or > None if there is no window) > > In this world, all DAGRuns would have a "schedule_date" (the date > this > would have normally started at), and an "execution_date" (when it was > actually executed). If a given DAGrun has a window (one-off DAGs may > or may > not!), there would be variables for "window_start" (the start of the > window) and "window_end" (the end of the window; this would default > to the > same as schedule_date, and to the next DAGrun's window_start). > > Disconnecting schedule date, execution date, and window/period would > also > open a pathway for more advanced users to implement irregular > schedules and > windows, without having to rely on hacks. For example: a DAG that > runs at > "the end of the week" to produce a end-of-week would normally run on > Friday, but on Friday holidays, instead complete a day early (and > scan one > fewer day in its window). > > I know there's some historical baggage here, but I think the above is > much > more natural to new users than what we're offering today. And I think > from > a current user perspective, there would be breaking variable name > changes, > but no logical differences for the majority of DAGs. > > On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org> > wrote: > > To (slightly) hijack this thread: > > On the subject of execuction_date: as I'm sure we're all aware the > concept > of execution_date is confusing to new-commers to Airflow (there are > many > questions about "why hasn't my DAG run yet"? "Why is my dag a day > behind?" > etc.) and although we mention this in the docs it's a confusing > concept. > > What to people think about adding two new parameters: `period_start` > and > `period_end` and making these the preferred terms in place of > execution_date and next_execution_date? > > This hopefully avoids any ambitious terms like "execution" or "run" > which > is understandably easy to conflate with the time the task is being > run > (i.e. `now()`) > > If people think this naming is better and less confusing I would > suggest > we update all the docs and examples to use these terms (but still > mention > the old names somewhere, probably in the macros docs) > > Thoughts? > > -ash > > > On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com> > wrote: > > Hi Bas, > > 1) I am aware of a few places where those parameters are used in > production > in a few hundred jobs. I highly recommend we don't deprecate them > unless > we > do it in a major version. > > 2) As James mentioned, inlets and outlets are a lineage annotation > feature > which is still under development. Let's leave them in, but we can > guard > them behind a feature flag if you prefer. > > 3) the yesterday*/tomorrow* params are convenience ones if you use a > daily > ETL. I agree with you that they are simple to compute, but not > everyone > using Apache Airflow is amazing with Python. Some users are only > trying > to > get a query scheduled and rely on a couple of niceties like these to > get > by. > > 4) latest_date, end_date (I feel like there used to be start_date, > but > maybe it got lost) were a blend of things which were used by a > backfill > framework used internally at Airbnb. Latest date was used if you > needed > to > join to a dimension for which you only wanted the latest version of > the > attributes in you backfill. end_date was used for time ranges where > several > days were processed together in a range to save on compute. I don't > see > an > issue with removing them. > > Best regards, > Arthur > > > > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < > basharens...@godatadriven.com> > wrote: > > Hi all, > > Following Tao Feng’s question to discuss this PR< > https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192< > https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss > here > if you agree/disagree/would change. > > ----------- > > The summary of the PR: > > I was confused by the task context values and suggest to clean up and > clarify these variables. Some are derivations from other variables, > some > are undocumented and unused, some are wrong (name doesn’t match the > value). > Please discuss what you think of the removal of these variables: > > > * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > tomorrow_ds_nodash. IMO the next_* and previous_* variables are > useful > since these require complex logic to compute the next execution date, > however would leave computing the yesterday* and tomorrow* variables > up > to > the user since they are simple one-liners and don't relate to the DAG > interval. > * Removed tables. This is a field in params, and is thus also > accessible by the user ({{ params.tables }}). Also, it was > undocumented. > * Removed latest_date. It's the same as ds and was also > undocumented. > * Removed inlets and outlets. Also undocumented, and have the > inlets/outlets ever worked/ever been used by anybody? > * Removed end_date and END_DATE. Both have the same value, so it > doesn't make sense to have both variables. Also, the value is ds > which > contains the start date of the interval, so the naming didn't make > sense to > me. However, if anybody argues in favour of adding "start_date" and > "end_date" to provide the start and end datetime of task instance > intervals, I'd be happy to add them. > > Cheers, > Bas > > > > > > > > > > >