I would like to add to this discussion a change to
`TaskInstance.previous_ti`.  It is a useful property in template context,
but it has an identity crisis.

Currently this property may return a different value depending on whether
your dag is `catchup` and whether your dag has a schedule interval.

   - If the dag is catchup, it will calculate the prior execution date
   according to schedule and return that TI if one exists.
   - If catchup is true and schedule is none then it will return none even
   if there is a prior TI.
   - If dag is not catchup it will ignore the schedule and return the
   absolute prior TI.

I propose that in 2.0 behavior should be as follows:

   - *previous_ti* always returns absolute prior ti (i.e. latest prev_ti
   such that prev_ti.exec_date < ti.exec_date), if one exists
   - add property *previous_scheduled_ti* that returns the prior ti
   according to schedule if *it* exists
      - consistent with current behavior, if latest ti exec date < prev
      sched exec date, it will return None; and if no schedule, return none.
   - *note:* there is a PR in progress for *previous_ti_success* -- this
   will pick up absolute prior successful TI

This way, the behavior of these properties will not change depending on
unrelated dag params; they should do what advertise, always.


On Tue, Apr 30, 2019 at 11:08 AM Bas Harenslak <
basharens...@godatadriven.com> wrote:

> Nope. I never planned to have a discussion on execution_date but it
> appeared more important than the other variables mentioned in my first
> email :)
>
> I already wrote a summary halfway this thread. With the updates since then:
>
>
>   *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds, tomorrow_ds_nodash
>      *   Arthur: some users use these for convenience
>      *   Bas/Fokko: these are values that can be easily derived in a
> one-liner
>      *   Max agrees these should have been previous_* and next_* execution
> date. No such thing as temporary code :-)
>
>   *   tables
>      *   Fokko: deprecate and remove in Airflow 2.0
>   *   latest_date
>      *   Arthur: used by internal backfill framework at Airbnb, but no
> issue removing them
>      *   Fokko: deprecate and remove in Airflow 2.0
>   *   inlets/outlets
>      *   Arthur: still in development, could be guarded behind feature
> flag.
>      *   Fokko: no guarding, it adds additional logic
>      *   Bolke: still in development
>   *   end_date/END_DATE
>      *   Arthur: used by internal backfill framework at Airbnb, but no
> issue removing them.
>      *   Fokko: deprecate and remove in Airflow 2.0
>      *   Max finds it useful with a reason (but ignored the fact it’s
> actually the start date of the interval?)
>      *   Bolke says there’s some users
>      *   Others question if there are any
>   *   execution_date
>      *   The meaning of execution_date is ambivalent and super confusing
> to everybody because it’s not the date of execution.
>      *   Suggestions to add period_start/period_end, or
> interval_start/interval_end, or something like that, to provide datetime
> variables for start & end datetime of a DAG run.
>      *   Most agree to add new variables with more sensible names and keep
> execution_date for backwards compatibility
>   *   Other:
>      *   Removal of variables should be done in major version and
> deprecation warnings should be added.
>
> So, given the discussion it appears almost everybody agrees these
> variables are incorrect and/or confusing. However, there are also users
> depending on them. How about we:
>
>
>   1.  Create new variables with sensible names
>   2.  At least deprecate tables & latest_date as these appear not
> important to anybody
>
> Cheers,
> Bas
>
> On 30 Apr 2019, at 19:23, Ash Berlin-Taylor <a...@apache.org<mailto:
> a...@apache.org>> wrote:
>
> Did we come to any conclusion on this topic?
>
>
> On 12 Apr 2019, at 10:26, Ash Berlin-Taylor <a...@apache.org<mailto:
> a...@apache.org>> wrote:
>
> Does anyone actually use end_date (of either spelling) given it's value is
> currently the same as `ds`:
> https://github.com/apache/airflow/blob/3020d9733cb189f091489a62f58f0f586dc8d4a9/airflow/models/taskinstance.py#L1183-L1184
>
>
>   'END_DATE': ds,
>   'end_date': ds,
>
> That seems like it is just the wrong value - I could see it being useful
> as either the same value as `next_execution_date` or `dag.end_date` but the
> current value (from a quick thought) doesn't seem like it is useful?
>
> Given what the value is currently set to I vote for deprecating/removing
> these two.
>
> On 12 Apr 2019, at 09:39, Bolke de Bruin <bdbr...@gmail.com<mailto:
> bdbr...@gmail.com>> wrote:
>
> I don’t think guarding inlets and outlets makes sense, cause if they are
> not defined they code won’t be executed anyway. Ie. The logic is already
> there, no need for a config flag.
>
> End-date is actually used by some beyond Airbnb as far as I know. What is
> the reason for deprecating it?
>
> Execution_date in finance is actually very well known. A transaction has
> an execution date it is not the time of the sending of the data. I honestly
> think that a change here would make it a thing that techies understand but
> does not make business sense. What is “today’s data?” it might not be the
> data that was sent today.
>
>
>
> Verstuurd vanaf mijn iPad
>
> Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin <
> maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>> het
> volgende geschreven:
>
> It should be labelled as a "bridge" or "transition" release somehow, and
> tell the community to absolutely to go there first and address all
> deprecation warnings prior to upgrade to 2.0. I wonder if semver has
> something for that.
>
> It might be the time to deprecate more things. Spring cleaning!
>
> Should we start another thread labeled "[2.0 spring cleaning] what Airflow
> features should we deprecate!?"
>
> Max
>
> On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor <a...@apache.org<mailto:
> a...@apache.org>> wrote:
>
> I am happy to do another small(!) 1.10.x release.
>
> (There was a small bug I introduced where I broke the rendering of doc_md
> on dags)
>
> On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" <fo...@driesprong.frl
> <mailto:fo...@driesprong.frl>>
> wrote:
> I agree with Max here, we should be careful.
>
> Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> tomorrow_ds_nodash. I'm not against having better readable shorthands,
> but
> more about the fact that having a tomorrow_ds doesn't make sense when
> we
> have an hourly (or weekly) job.
>
> I'm against guarding the {in,out}lets. Since the guarding will add
> extra
> logic, and this is not worth the added complexity (and another flag) in
> my
> opinion.
>
> - We start by putting deprecation warnings on tables, latest_date,
> end_date
> and END_DATE and remove them in Airflow 2.0.
>
> This is only possible if there is another version before 2.0.
> Currently,
> we're preparing to move to 2.0
>
> - We add a lineage_enabled config which is false by default and thus
> inlets/outlets aren’t provided, unless set to true.
>
> I'm against guarding the {in,out}lets. Since the guarding will add
> extra
> logic, and this is not worth the added complexity (and another flag) in
> my
> opinion.
>
> - We continue discussion about clarification of execution_date and
> better
> named variables such as something_start and something_end
>
>
>
> https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E
>
> Cheers, Fokko
>
> Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak <
> basharens...@godatadriven.com>:
>
> Great discussion, let’s stay on track. If I can summarise:
>
>
> *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> tomorrow_ds_nodash
>  *   Arthur: some users use these for convenience
>  *   Bas/Fokko: these are values that can be easily derived in a
> one-liner
>
> *   tables
>  *   nobody?
>
> *   latest_date
>  *   Arthur: used by internal backfill framework at Airbnb, but
> no
> issue removing them.
>
> *   inlets/outlets
>  *   Arthur: still in development, could be guarded behind
> feature
> flag.
>
> *   end_date/END_DATE
>  *   Arthur: used by internal backfill framework at Airbnb, but
> no
> issue removing them.
>
> And to add to the discussion:
>
>
> *   execution_date
>  *   The meaning of execution_date is ambivalent and super
> confusing
> to everybody because it’s not the date of execution.
>  *   Suggestions to add period_start/period_end, or
> interval_start/interval_end, or something like that, to provide
> datetime
> variables for start & end datetime of a DAG run.
> *   Other:
>  *   Removal of variables should be done in major version and
> deprecation warnings should be added.
>
> So how about the following:
> - We start by putting deprecation warnings on tables, latest_date,
> end_date and END_DATE and remove them in Airflow 2.0.
> - We add a lineage_enabled config which is false by default and thus
> inlets/outlets aren’t provided, unless set to true.
> - We continue discussion about yesterday* and tomorrow*
> - We continue discussion about clarification of execution_date and
> better
> named variables such as something_start and something_end
>
> Cheers,
> Bas
>
>
> On 11 Apr 2019, at 04:52, Maxime Beauchemin
> <maximebeauche...@gmail.com
> <mailto:maximebeauche...@gmail.com>> wrote:
>
> Making backwards incompatible changes that require altering the
> thousands
> (millions?!) of DAGs in the wild will alienate the community and
> prevent
> many from orchestrating an upgrade. Upgrading hundreds of DAGs and
> Airflow
> atomically would be hard and dangerous.
>
> To mitigate this, changes to the DAG / core API should be backward
> compatible, at least for a release range, and alert on upcoming
> deprecation
> (if needed)
>
> To be clear if `execution_date` is renamed to `foo`, we should have a
> version range where both work just as well, but using
> `execution_date`
> would result is warning messages about its upcoming deprecation,
> maybe
> something like "It appears you're using the task parameter
> `execution_date`
> which will be deprecated in favor or `foo` in upcoming release 2.0.0,
> find
> more info at link.to/foo<http://link.to/foo>"
>
> Max
>
> On Tue, Apr 9, 2019 at 7:58 AM James Meickle
>
> <jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>>
> wrote:
>
> I agree with Ash here. The naming of "execution_date" is incredibly
> confusing to people who are new to Airflow, who think it has
> something to
> do with... execution.
>
> However, I think that there's still room for improvement with
> "period_start" and "period_end". Think about manually triggered tasks
> -
> they'd have a "period_start", but no useful "period_end" until the
> next
> trigger occurs. And mid-day ETL tasks that are dated to "today" still
> have
> to reference "period_end" to get the correct date, even though the
> DAGrun
> won't be over yet!
>
> If we're going to go through and rename "execution_date", I'd rather
> see a
> wider-ranging rework that understands a mapping from a date to a
> window
> (like "every day, process the data from one week ago", or "every
> Saturday,
> process today's data"). Maybe something closer to how Beam does it
> (but retaining simplicity for the 99% of cases that are just "run
> daily,
> for the past day").
>
>
>
>
>
> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
>
> I could see something where there are two default DAG parameters are:
> schedule="@daily"
> window=-1 (or some schedule_delta object that provides a windowing
> impl, or
> None if there is no window)
>
> In this world, all DAGRuns would have a "schedule_date" (the date
> this
> would have normally started at), and an "execution_date" (when it was
> actually executed). If a given DAGrun has a window (one-off DAGs may
> or may
> not!), there would be variables for "window_start" (the start of the
> window) and "window_end" (the end of the window; this would default
> to the
> same as schedule_date, and to the next DAGrun's window_start).
>
> Disconnecting schedule date, execution date, and window/period would
> also
> open a pathway for more advanced users to implement irregular
> schedules and
> windows, without having to rely on hacks. For example: a DAG that
> runs at
> "the end of the week" to produce a end-of-week would normally run on
> Friday, but on Friday holidays, instead complete a day early (and
> scan one
> fewer day in its window).
>
> I know there's some historical baggage here, but I think the above is
> much
> more natural to new users than what we're offering today. And I think
> from
> a current user perspective, there would be breaking variable name
> changes,
> but no logical differences for the majority of DAGs.
>
> On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org>
> wrote:
>
> To (slightly) hijack this thread:
>
> On the subject of execuction_date: as I'm sure we're all aware the
> concept
> of execution_date is confusing to new-commers to Airflow (there are
> many
> questions about "why hasn't my DAG run yet"? "Why is my dag a day
> behind?"
> etc.) and although we mention this in the docs it's a confusing
> concept.
>
> What to people think about adding two new parameters: `period_start`
> and
> `period_end` and making these the preferred terms in place of
> execution_date and next_execution_date?
>
> This hopefully avoids any ambitious terms like "execution" or "run"
> which
> is understandably easy to conflate with the time the task is being
> run
> (i.e. `now()`)
>
> If people think this naming is better and less confusing I would
> suggest
> we update all the docs and examples to use these terms (but still
> mention
> the old names somewhere, probably in the macros docs)
>
> Thoughts?
>
> -ash
>
>
> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com>
> wrote:
>
> Hi Bas,
>
> 1) I am aware of a few places where those parameters are used in
> production
> in a few hundred jobs. I highly recommend we don't deprecate them
> unless
> we
> do it in a major version.
>
> 2) As James mentioned, inlets and outlets are a lineage annotation
> feature
> which is still under development. Let's leave them in, but we can
> guard
> them behind a feature flag if you prefer.
>
> 3) the yesterday*/tomorrow* params are convenience ones if you use a
> daily
> ETL. I agree with you that they are simple to compute, but not
> everyone
> using Apache Airflow is amazing with Python. Some users are only
> trying
> to
> get a query scheduled and rely on a couple of niceties like these to
> get
> by.
>
> 4) latest_date, end_date (I feel like there used to be start_date,
> but
> maybe it got lost) were a blend of things which were used by a
> backfill
> framework used internally at Airbnb. Latest date was used if you
> needed
> to
> join to a dimension for which you only wanted the latest version of
> the
> attributes in you backfill. end_date was used for time ranges where
> several
> days were processed together in a range to save on compute. I don't
> see
> an
> issue with removing them.
>
> Best regards,
> Arthur
>
>
>
> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> basharens...@godatadriven.com>
> wrote:
>
> Hi all,
>
> Following Tao Feng’s question to discuss this PR<
> https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
> https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
> here
> if you agree/disagree/would change.
>
> -----------
>
> The summary of the PR:
>
> I was confused by the task context values and suggest to clean up and
> clarify these variables. Some are derivations from other variables,
> some
> are undocumented and unused, some are wrong (name doesn’t match the
> value).
> Please discuss what you think of the removal of these variables:
>
>
> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
> useful
> since these require complex logic to compute the next execution date,
> however would leave computing the yesterday* and tomorrow* variables
> up
> to
> the user since they are simple one-liners and don't relate to the DAG
> interval.
> *   Removed tables. This is a field in params, and is thus also
> accessible by the user ({{ params.tables }}). Also, it was
> undocumented.
> *   Removed latest_date. It's the same as ds and was also
> undocumented.
> *   Removed inlets and outlets. Also undocumented, and have the
> inlets/outlets ever worked/ever been used by anybody?
> *   Removed end_date and END_DATE. Both have the same value, so it
> doesn't make sense to have both variables. Also, the value is ds
> which
> contains the start date of the interval, so the naming didn't make
> sense to
> me. However, if anybody argues in favour of adding "start_date" and
> "end_date" to provide the start and end datetime of task instance
> intervals, I'd be happy to add them.
>
> Cheers,
> Bas
>
>
>
>
>
>
>
>
>
>
>

Reply via email to