From my perspective the confusion is twofold:

  1.  A DAG run starts at the end of an interval. At first this is confusing to 
people but after some time it starts to make sense. Some visual change in the 
UI would be helpful though.
  2.  execution_date in the TI context - the name is confusing to everybody and 
should be changed IMO. My suggestion is to leave execution_date as is since 
many users depend on it, and add two new variables “interval_start” and 
“interval_end” to denote the start and end of the DAG run interval.

Bas

On 15 Apr 2019, at 15:52, Dan Davydov 
<ddavy...@twitter.com.INVALID<mailto:ddavy...@twitter.com.INVALID>> wrote:

You could start a [VOTE][PMC ONLY] thread on this topic (
https://www.apache.org/foundation/voting.html). Not sure if that's the best
Apache way of doing things, but seems fine to me. My PMC vote personally
would maybe be to switch the semantics to the opposite of what they are now
without having an additional config value, but since that's not very
realistic given the migration effort required by users I think a flag would
probably be worth the costs I mentioned in my previous email although it's
definitely a trade-off. Some kinds of new user/existing survey would
probably help collect data to support a decision but could be tough to
conduct.

On Mon, Apr 15, 2019 at 4:34 PM James Meickle
<jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid>> wrote:

Personally I would be very interested in working on a flexible schedule
window/window projection patch. But it would be a big undertaking so it
doesn't make sense to start it unless there's a lot of community buy-in to
the idea that we aren't just for day-after ETL systems.

On Mon, Apr 15, 2019 at 8:52 AM airflowuser
<airflowu...@protonmail.com.invalid<mailto:airflowu...@protonmail.com.invalid>> 
wrote:

To quote my user-experience professor from ages ago:
"If too many people misuse something you wrote it means that YOU are
doing
something wrong".

Something can be well documented but if it's not intuitive it's likely
that people will get it wrong.

Say someone ask "When did you execute the code?" Your answer will be
direct - the time the code started to run. This is why so many people
misunderstand the execution_date in the terms of Airflow. Airflow took a
word that is well defined in our conscious and replaced it's meaning.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, April 15, 2019 3:35 PM, Dan Davydov
<ddavy...@twitter.com.INVALID<mailto:ddavy...@twitter.com.INVALID>> wrote:

I think if the mission of Airflow is to be a generic Workflow engine,
the
current semantics of execution date aren't a good default. This might
be
an
unpopular opinion given past threads on this topic :).

The execution_date = end_date semantics make sense for the ETL use case
but
not for other use cases I think Cron syntax is more intuitive to users,
i.e. start_date should match execution_date (although I don't have data
to
back this up). This is especially prevalent in ML, it's almost a rite
of
passage for users to get confused by execution date semantics. I think
a
flag to support different execution date semantics makes sense, even at
the
cost of being a headache to support both and the complexity increase
could
lead to bugs and trickier mailing list support.

On Wed, Apr 10, 2019 at 9:00 PM Gabriel Silk 
gs...@dropbox.com.invalid<mailto:gs...@dropbox.com.invalid>
wrote:

My two cents:
"execution_date" is definitely confusing to newcomers, and it's
partly
the
ambiguity of the wording, and partly the UI's fault. When I first saw
execution date, I assumed it meant the earliest time at which the
task
will execute, which is wrong. I was confused when no tasks appeared
for3pm until 4pm.
My proposal to fix that:

1.  Always show the next task to be executed in the UI, but explain
to
the
   user that it's not running because its interval has not yet
completed.
   Indicate this state visually, perhaps by using some transparency
or another
   color.

2.  Instead of just showing execution date in the UI, show the
low/high
   range of the time period it covers (for periodic jobs).


As for what we call the low/high timestamps, I like these two
options:

-   low_ts, high_ts
-   interval_start, interval_end

On Wed, Apr 10, 2019 at 6:43 AM James Meickle
jmeic...@quantopian.com.invalid<mailto:jmeic...@quantopian.com.invalid> wrote:

Strictly tying execution start to interval end doesn't work for
some
workflows (my guess, 1-5% of them?):

-   You need to start performing tasks before the interval is over
-   You have tasks that reference a single interval, but can't be
completed
   until several intervals later (due to data latency)

-   The frequency you need to run the task on is different than the
   frequency
   of the interval you need to process (like processing all
records
from the
   last five days, every day)


Airflow doesn't handle any of these situations gracefully and I've
seen
people attempt all sorts of workarounds for them. Probably even
more
people
would try, if we provided decent idioms for doing it rather than
those
workarounds.
On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko
fo...@driesprong.frl<mailto:fo...@driesprong.frl>
wrote:

I see what you mean. I don't really like the `period_{start,end}`
name,
but
something such as `interval_{start,end}` might do it for me.
Personally, I think running the job after the interval closes
(since
then

you have all the data over the interval), makes complete sense
for
ETL
jobs. I agree it requires some time to get used to. Maybe we're
lacking
on
documentation here.
Cheers, Fokko
Op wo 10 apr. 2019 om 10:08 schreef Flo Rance
troura...@gmail.com<mailto:troura...@gmail.com>:

I didn't expect to participate at any debate on that software,
as
I'm a

complete newcomer. But I'm almost forced as I am the target
audience,
too.
To answer your initial question, after reading a lot of
documentation I

find the term execution_date really counterintuitive, so yes
maybe
period_start and period_end might be a better naming to help to
understand
how all the initial scheduling works. Because even after
reading
the
scheduling section of the doc and the FAQ, it was still not
clear in
my

mind. Btw, I find some ideas exposed by James Meickle in the
[DISCUSS]

AIRFLOW-4192 very interesting and I share his opinion that
there's
still

room for improvement.
But a mode to change from "run at end of period, I need all the
data
available for this period" (the current) to "run at this time
on
the

schedule_interval would be awesome.
Regards,
Flo
On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor
a...@apache.org<mailto:a...@apache.org>
wrote:

Yeah, that's the other thing that has been talked about from
time-to-time,
which is a mode to change from "run at end of period, I need
all
the

data

available for this period" (the current) to "run at this time
on
the

schedule_interval, don't wait for the period to end".
(No such flag exists right now, before you go looking.)

On 9 Apr 2019, at 15:31, Shaw, Damian P. <
damian.sha...@credit-suisse.com> wrote:
Hi all,
I'm new to this Airflow Dev mailing list so I wasn't
expecting to
reply

to anything but I feel I am the target audience for this
question.
I
am

quite new to airflow and have been setting up an airflow
environment

for

my

business this last month.

I find the current "execution_date" a small technical
burden
and
a

large

cognitive burden. Our workflow is based on DAGs running at a
specified

time

in a specified timezone using the same date as the current
calendar
date.

I have worked around this by creating my own macro and
context
variables, with the logic looking like this:
airflow_execution_date = context['execution_date']
dag_timezone = context['dag'].timezone
local_execution_date =
dag_timezone.convert(airflow_execution_date)
local_cal_date = local_execution_date +
datetime.timedelta(days=1)

As you can see this isn't a lot of technical effort, but
having a
date

that 1) is in the timezone the business users are working in,
and

2.

Is

the

same calendar date the business users are working in it
significantly

reduces the cognitive effort required to set-up tasks. Of
course
this

doesn't help with cron format scheduling which I just let the
business

give

me the requirements for and I set it up myself as the date
logic
there
is

still confusing as it doesn't work like real cron scheduling
which
everyone
is familiar with.

Maybe "period_start" and "period_end" might help people on
Day 0
of

understanding Airflow get that the dates you are dealing with
are
not

what

you expect, but Day 1+ there's still a lot of cognitive
overhead if
you

don't have the exact same model as AirBnb for running DAGs
and
tasks.

My 2 cents anyway,
Damian Shaw
-----Original Message-----
From: Ash Berlin-Taylor [mailto:a...@apache.org]
Sent: Tuesday, April 09, 2019 10:08 AM
To: dev@airflow.apache.org
Subject: [DISCUSS] period_start/period_end instead of
execution_date/next_execution_date
(trying to break this out in to another thread)
The ML doesn't allow images, but I can guess that it is the
deps
section of a task instance details screen?
I'm not saying it's not clear once you know to look there,
but
I'm

trying remove/reduce the confusion in the first place. And I
think
we

as

committers aren't best placed to know what makes sense as we
have
internalised how Airflow works :)

So I guess this is a question to the newest people on the
list:
Would

`period_start` and `period_end` be more or less confusing for
you
when

you

were first getting started with Airflow?

-ash

On 9 Apr 2019, at 14:47, Driesprong, Fokko
<fo...@driesprong.frl

wrote:

Ash,
Personally, I think this is quite clear, there is a list
of
reasons

why

the job isn't being scheduled:

Coming back to the question of Bas, I believe that
yesterday_ds
does

not make sense since we cannot assume that the schedule is
daily. I
don't

see any usage of this variable. Personally, I do use
next_execution_date

quite extensively. When you have a job that runs daily, but
you
want
to

change this to an hourly job. In such a case you don't want
to
change

{{

(execution_date + macros.timedelta(days=1)) }} to {{
(execution_date

-

macros.timedelta(hours=1)) }} everywhere.

I'm just not sure if the aggressive deprecation of is
really
worth

it.

I don't see too much harm in letting them stay.

Cheers, Fokko
Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor <
a...@apache.org

mailto:a...@apache.org>:

To (slightly) hijack this thread:
On the subject of execuction_date: as I'm sure we're all
aware
the

concept of execution_date is confusing to new-commers to
Airflow
(there

are

many questions about "why hasn't my DAG run yet"? "Why is my
dag a
day

behind?" etc.) and although we mention this in the docs it's
a
confusing

concept.

What to people think about adding two new parameters:
`period_start`

and `period_end` and making these the preferred terms in
place
of
execution_date and next_execution_date?

This hopefully avoids any ambitious terms like
"execution"
or
"run"

which is understandably easy to conflate with the time the
task is
being

run (i.e. `now()`)

If people think this naming is better and less confusing
I
would
suggest we update all the docs and examples to use these
terms (but
still

mention the old names somewhere, probably in the macros docs)

Thoughts?
-ash

On 8 Apr 2019, at 16:20, Arthur Wiedmer <
arthur.wied...@gmail.com

mailto:arthur.wied...@gmail.com> wrote:

Hi Bas,

1.  I am aware of a few places where those parameters
are used
   in


production

in a few hundred jobs. I highly recommend we don't
deprecate
them

unless we

do it in a major version.

2.  As James mentioned, inlets and outlets are a
lineage
   annotation


feature

which is still under development. Let's leave them in,
but we
can

guard

them behind a feature flag if you prefer.

3.  the yesterday*/tomorrow* params are convenience
ones
if you
   use
   a


daily

ETL. I agree with you that they are simple to compute,
but not
everyone

using Apache Airflow is amazing with Python. Some users
are
only

trying to

get a query scheduled and rely on a couple of niceties
like
these

to

get by.

4.  latest_date, end_date (I feel like there used to be
   start_date,


but

maybe it got lost) were a blend of things which were
used by a
backfill

framework used internally at Airbnb. Latest date was
used if
you

needed to

join to a dimension for which you only wanted the
latest
version
of

the

attributes in you backfill. end_date was used for time
ranges
where

several

days were processed together in a range to save on
compute. I
don't

see an

issue with removing them.
Best regards,
Arthur
On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
basharens...@godatadriven.com <mailto:
basharens...@godatadriven.com

wrote:

Hi all,
Following Tao Feng’s question to discuss this PR<
https://github.com/apache/airflow/pull/5010 <
https://github.com/apache/airflow/pull/5010>>
(AIRFLOW-4192<

https://issues.apache.org/jira/browse/AIRFLOW-4192 <
https://issues.apache.org/jira/browse/AIRFLOW-4192
),
please
discuss

here

if you agree/disagree/would change.

The summary of the PR:
I was confused by the task context values and suggest
to clean
up

and

clarify these variables. Some are derivations from
other
variables,

some

are undocumented and unused, some are wrong (name
doesn’t
match

the

value).

Please discuss what you think of the removal of these
variables:

-   Removed yesterday_ds, yesterday_ds_nodash,
tomorrow_ds,
   tomorrow_ds_nodash. IMO the next_* and previous_*
variables
   are


useful

since these require complex logic to compute the next
execution

date,

however would leave computing the yesterday* and
tomorrow*
variables

up to

the user since they are simple one-liners and don't
relate to
the

DAG

interval.

-   Removed tables. This is a field in params, and is
thus
   also


accessible by the user ({{ params.tables }}). Also,
it
was
undocumented.

-   Removed latest_date. It's the same as ds and was
also
   undocumented.


-   Removed inlets and outlets. Also undocumented,
and
have
   the


inlets/outlets ever worked/ever been used by anybody?

-   Removed end_date and END_DATE. Both have the same
value,
   so
   it


doesn't make sense to have both variables. Also, the
value is
ds

which

contains the start date of the interval, so the
naming
didn't
make

sense to

me. However, if anybody argues in favour of adding
"start_date"

and

"end_date" to provide the start and end datetime of
task
instance

intervals, I'd be happy to add them.
Cheers,
Bas



===============================================================================

Please access the attached hyperlink for an important
electronic
communications disclaimer:

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html



===============================================================================







Reply via email to