Nope. I never planned to have a discussion on execution_date but it appeared 
more important than the other variables mentioned in my first email :)

I already wrote a summary halfway this thread. With the updates since then:


  *   yesterday_ds, yesterday_ds_nodash, tomorrow_ds, tomorrow_ds_nodash
     *   Arthur: some users use these for convenience
     *   Bas/Fokko: these are values that can be easily derived in a one-liner
     *   Max agrees these should have been previous_* and next_* execution 
date. No such thing as temporary code :-)

  *   tables
     *   Fokko: deprecate and remove in Airflow 2.0
  *   latest_date
     *   Arthur: used by internal backfill framework at Airbnb, but no issue 
removing them
     *   Fokko: deprecate and remove in Airflow 2.0
  *   inlets/outlets
     *   Arthur: still in development, could be guarded behind feature flag.
     *   Fokko: no guarding, it adds additional logic
     *   Bolke: still in development
  *   end_date/END_DATE
     *   Arthur: used by internal backfill framework at Airbnb, but no issue 
removing them.
     *   Fokko: deprecate and remove in Airflow 2.0
     *   Max finds it useful with a reason (but ignored the fact it’s actually 
the start date of the interval?)
     *   Bolke says there’s some users
     *   Others question if there are any
  *   execution_date
     *   The meaning of execution_date is ambivalent and super confusing to 
everybody because it’s not the date of execution.
     *   Suggestions to add period_start/period_end, or 
interval_start/interval_end, or something like that, to provide datetime 
variables for start & end datetime of a DAG run.
     *   Most agree to add new variables with more sensible names and keep 
execution_date for backwards compatibility
  *   Other:
     *   Removal of variables should be done in major version and deprecation 
warnings should be added.

So, given the discussion it appears almost everybody agrees these variables are 
incorrect and/or confusing. However, there are also users depending on them. 
How about we:


  1.  Create new variables with sensible names
  2.  At least deprecate tables & latest_date as these appear not important to 
anybody

Cheers,
Bas

On 30 Apr 2019, at 19:23, Ash Berlin-Taylor 
<a...@apache.org<mailto:a...@apache.org>> wrote:

Did we come to any conclusion on this topic?


On 12 Apr 2019, at 10:26, Ash Berlin-Taylor 
<a...@apache.org<mailto:a...@apache.org>> wrote:

Does anyone actually use end_date (of either spelling) given it's value is 
currently the same as `ds`: 
https://github.com/apache/airflow/blob/3020d9733cb189f091489a62f58f0f586dc8d4a9/airflow/models/taskinstance.py#L1183-L1184


  'END_DATE': ds,
  'end_date': ds,

That seems like it is just the wrong value - I could see it being useful as 
either the same value as `next_execution_date` or `dag.end_date` but the 
current value (from a quick thought) doesn't seem like it is useful?

Given what the value is currently set to I vote for deprecating/removing these 
two.

On 12 Apr 2019, at 09:39, Bolke de Bruin 
<bdbr...@gmail.com<mailto:bdbr...@gmail.com>> wrote:

I don’t think guarding inlets and outlets makes sense, cause if they are not 
defined they code won’t be executed anyway. Ie. The logic is already there, no 
need for a config flag.

End-date is actually used by some beyond Airbnb as far as I know. What is the 
reason for deprecating it?

Execution_date in finance is actually very well known. A transaction has an 
execution date it is not the time of the sending of the data. I honestly think 
that a change here would make it a thing that techies understand but does not 
make business sense. What is “today’s data?” it might not be the data that was 
sent today.



Verstuurd vanaf mijn iPad

Op 12 apr. 2019 om 08:35 heeft Maxime Beauchemin 
<maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>> het volgende 
geschreven:

It should be labelled as a "bridge" or "transition" release somehow, and
tell the community to absolutely to go there first and address all
deprecation warnings prior to upgrade to 2.0. I wonder if semver has
something for that.

It might be the time to deprecate more things. Spring cleaning!

Should we start another thread labeled "[2.0 spring cleaning] what Airflow
features should we deprecate!?"

Max

On Thu, Apr 11, 2019 at 11:26 PM Ash Berlin-Taylor 
<a...@apache.org<mailto:a...@apache.org>> wrote:

I am happy to do another small(!) 1.10.x release.

(There was a small bug I introduced where I broke the rendering of doc_md
on dags)

On 11 April 2019 22:52:13 BST, "Driesprong, Fokko" 
<fo...@driesprong.frl<mailto:fo...@driesprong.frl>>
wrote:
I agree with Max here, we should be careful.

Regarding the yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
tomorrow_ds_nodash. I'm not against having better readable shorthands,
but
more about the fact that having a tomorrow_ds doesn't make sense when
we
have an hourly (or weekly) job.

I'm against guarding the {in,out}lets. Since the guarding will add
extra
logic, and this is not worth the added complexity (and another flag) in
my
opinion.

- We start by putting deprecation warnings on tables, latest_date,
end_date
and END_DATE and remove them in Airflow 2.0.

This is only possible if there is another version before 2.0.
Currently,
we're preparing to move to 2.0

- We add a lineage_enabled config which is false by default and thus
inlets/outlets aren’t provided, unless set to true.

I'm against guarding the {in,out}lets. Since the guarding will add
extra
logic, and this is not worth the added complexity (and another flag) in
my
opinion.

- We continue discussion about clarification of execution_date and
better
named variables such as something_start and something_end


https://lists.apache.org/thread.html/31423aa7feba421c53356a1e566f777c7a7973966c3320611286a2fb@%3Cdev.airflow.apache.org%3E

Cheers, Fokko

Op do 11 apr. 2019 om 08:44 schreef Bas Harenslak <
basharens...@godatadriven.com>:

Great discussion, let’s stay on track. If I can summarise:


*   yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
tomorrow_ds_nodash
 *   Arthur: some users use these for convenience
 *   Bas/Fokko: these are values that can be easily derived in a
one-liner

*   tables
 *   nobody?

*   latest_date
 *   Arthur: used by internal backfill framework at Airbnb, but
no
issue removing them.

*   inlets/outlets
 *   Arthur: still in development, could be guarded behind
feature
flag.

*   end_date/END_DATE
 *   Arthur: used by internal backfill framework at Airbnb, but
no
issue removing them.

And to add to the discussion:


*   execution_date
 *   The meaning of execution_date is ambivalent and super
confusing
to everybody because it’s not the date of execution.
 *   Suggestions to add period_start/period_end, or
interval_start/interval_end, or something like that, to provide
datetime
variables for start & end datetime of a DAG run.
*   Other:
 *   Removal of variables should be done in major version and
deprecation warnings should be added.

So how about the following:
- We start by putting deprecation warnings on tables, latest_date,
end_date and END_DATE and remove them in Airflow 2.0.
- We add a lineage_enabled config which is false by default and thus
inlets/outlets aren’t provided, unless set to true.
- We continue discussion about yesterday* and tomorrow*
- We continue discussion about clarification of execution_date and
better
named variables such as something_start and something_end

Cheers,
Bas


On 11 Apr 2019, at 04:52, Maxime Beauchemin
<maximebeauche...@gmail.com
<mailto:maximebeauche...@gmail.com>> wrote:

Making backwards incompatible changes that require altering the
thousands
(millions?!) of DAGs in the wild will alienate the community and
prevent
many from orchestrating an upgrade. Upgrading hundreds of DAGs and
Airflow
atomically would be hard and dangerous.

To mitigate this, changes to the DAG / core API should be backward
compatible, at least for a release range, and alert on upcoming
deprecation
(if needed)

To be clear if `execution_date` is renamed to `foo`, we should have a
version range where both work just as well, but using
`execution_date`
would result is warning messages about its upcoming deprecation,
maybe
something like "It appears you're using the task parameter
`execution_date`
which will be deprecated in favor or `foo` in upcoming release 2.0.0,
find
more info at link.to/foo<http://link.to/foo>"

Max

On Tue, Apr 9, 2019 at 7:58 AM James Meickle

<jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>>
wrote:

I agree with Ash here. The naming of "execution_date" is incredibly
confusing to people who are new to Airflow, who think it has
something to
do with... execution.

However, I think that there's still room for improvement with
"period_start" and "period_end". Think about manually triggered tasks
-
they'd have a "period_start", but no useful "period_end" until the
next
trigger occurs. And mid-day ETL tasks that are dated to "today" still
have
to reference "period_end" to get the correct date, even though the
DAGrun
won't be over yet!

If we're going to go through and rename "execution_date", I'd rather
see a
wider-ranging rework that understands a mapping from a date to a
window
(like "every day, process the data from one week ago", or "every
Saturday,
process today's data"). Maybe something closer to how Beam does it
(but retaining simplicity for the 99% of cases that are just "run
daily,
for the past day").




https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions

I could see something where there are two default DAG parameters are:
schedule="@daily"
window=-1 (or some schedule_delta object that provides a windowing
impl, or
None if there is no window)

In this world, all DAGRuns would have a "schedule_date" (the date
this
would have normally started at), and an "execution_date" (when it was
actually executed). If a given DAGrun has a window (one-off DAGs may
or may
not!), there would be variables for "window_start" (the start of the
window) and "window_end" (the end of the window; this would default
to the
same as schedule_date, and to the next DAGrun's window_start).

Disconnecting schedule date, execution date, and window/period would
also
open a pathway for more advanced users to implement irregular
schedules and
windows, without having to rely on hacks. For example: a DAG that
runs at
"the end of the week" to produce a end-of-week would normally run on
Friday, but on Friday holidays, instead complete a day early (and
scan one
fewer day in its window).

I know there's some historical baggage here, but I think the above is
much
more natural to new users than what we're offering today. And I think
from
a current user perspective, there would be breaking variable name
changes,
but no logical differences for the majority of DAGs.

On Tue, Apr 9, 2019 at 6:17 AM Ash Berlin-Taylor <a...@apache.org>
wrote:

To (slightly) hijack this thread:

On the subject of execuction_date: as I'm sure we're all aware the
concept
of execution_date is confusing to new-commers to Airflow (there are
many
questions about "why hasn't my DAG run yet"? "Why is my dag a day
behind?"
etc.) and although we mention this in the docs it's a confusing
concept.

What to people think about adding two new parameters: `period_start`
and
`period_end` and making these the preferred terms in place of
execution_date and next_execution_date?

This hopefully avoids any ambitious terms like "execution" or "run"
which
is understandably easy to conflate with the time the task is being
run
(i.e. `now()`)

If people think this naming is better and less confusing I would
suggest
we update all the docs and examples to use these terms (but still
mention
the old names somewhere, probably in the macros docs)

Thoughts?

-ash


On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com>
wrote:

Hi Bas,

1) I am aware of a few places where those parameters are used in
production
in a few hundred jobs. I highly recommend we don't deprecate them
unless
we
do it in a major version.

2) As James mentioned, inlets and outlets are a lineage annotation
feature
which is still under development. Let's leave them in, but we can
guard
them behind a feature flag if you prefer.

3) the yesterday*/tomorrow* params are convenience ones if you use a
daily
ETL. I agree with you that they are simple to compute, but not
everyone
using Apache Airflow is amazing with Python. Some users are only
trying
to
get a query scheduled and rely on a couple of niceties like these to
get
by.

4) latest_date, end_date (I feel like there used to be start_date,
but
maybe it got lost) were a blend of things which were used by a
backfill
framework used internally at Airbnb. Latest date was used if you
needed
to
join to a dimension for which you only wanted the latest version of
the
attributes in you backfill. end_date was used for time ranges where
several
days were processed together in a range to save on compute. I don't
see
an
issue with removing them.

Best regards,
Arthur



On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
basharens...@godatadriven.com>
wrote:

Hi all,

Following Tao Feng’s question to discuss this PR<
https://github.com/apache/airflow/pull/5010> (AIRFLOW-4192<
https://issues.apache.org/jira/browse/AIRFLOW-4192>), please discuss
here
if you agree/disagree/would change.

-----------

The summary of the PR:

I was confused by the task context values and suggest to clean up and
clarify these variables. Some are derivations from other variables,
some
are undocumented and unused, some are wrong (name doesn’t match the
value).
Please discuss what you think of the removal of these variables:


*   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
tomorrow_ds_nodash. IMO the next_* and previous_* variables are
useful
since these require complex logic to compute the next execution date,
however would leave computing the yesterday* and tomorrow* variables
up
to
the user since they are simple one-liners and don't relate to the DAG
interval.
*   Removed tables. This is a field in params, and is thus also
accessible by the user ({{ params.tables }}). Also, it was
undocumented.
*   Removed latest_date. It's the same as ds and was also
undocumented.
*   Removed inlets and outlets. Also undocumented, and have the
inlets/outlets ever worked/ever been used by anybody?
*   Removed end_date and END_DATE. Both have the same value, so it
doesn't make sense to have both variables. Also, the value is ds
which
contains the start date of the interval, so the naming didn't make
sense to
me. However, if anybody argues in favour of adding "start_date" and
"end_date" to provide the start and end datetime of task instance
intervals, I'd be happy to add them.

Cheers,
Bas










Reply via email to