Personally I would be very interested in working on a flexible schedule
window/window projection patch. But it would be a big undertaking so it
doesn't make sense to start it unless there's a lot of community buy-in to
the idea that we aren't just for day-after ETL systems.

On Mon, Apr 15, 2019 at 8:52 AM airflowuser
<airflowu...@protonmail.com.invalid> wrote:

> To quote my user-experience professor from ages ago:
> "If too many people misuse something you wrote it means that YOU are doing
> something wrong".
>
> Something can be well documented but if it's not intuitive it's likely
> that people will get it wrong.
>
> Say someone ask "When did you execute the code?" Your answer will be
> direct - the time the code started to run. This is why so many people
> misunderstand the execution_date in the terms of Airflow. Airflow took a
> word that is well defined in our conscious and replaced it's meaning.
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, April 15, 2019 3:35 PM, Dan Davydov
> <ddavy...@twitter.com.INVALID> wrote:
>
> > I think if the mission of Airflow is to be a generic Workflow engine, the
> > current semantics of execution date aren't a good default. This might be
> an
> > unpopular opinion given past threads on this topic :).
> >
> > The execution_date = end_date semantics make sense for the ETL use case
> but
> > not for other use cases I think Cron syntax is more intuitive to users,
> > i.e. start_date should match execution_date (although I don't have data
> to
> > back this up). This is especially prevalent in ML, it's almost a rite of
> > passage for users to get confused by execution date semantics. I think a
> > flag to support different execution date semantics makes sense, even at
> the
> > cost of being a headache to support both and the complexity increase
> could
> > lead to bugs and trickier mailing list support.
> >
> > On Wed, Apr 10, 2019 at 9:00 PM Gabriel Silk gs...@dropbox.com.invalid
> > wrote:
> >
> > > My two cents:
> > > "execution_date" is definitely confusing to newcomers, and it's partly
> the
> > > ambiguity of the wording, and partly the UI's fault. When I first saw
> > > execution date, I assumed it meant the earliest time at which the task
> > > will execute, which is wrong. I was confused when no tasks appeared
> for3pm until 4pm.
> > > My proposal to fix that:
> > >
> > > 1.  Always show the next task to be executed in the UI, but explain to
> the
> > >     user that it's not running because its interval has not yet
> completed.
> > >     Indicate this state visually, perhaps by using some transparency
> or another
> > >     color.
> > >
> > > 2.  Instead of just showing execution date in the UI, show the low/high
> > >     range of the time period it covers (for periodic jobs).
> > >
> > >
> > > As for what we call the low/high timestamps, I like these two options:
> > >
> > > -   low_ts, high_ts
> > > -   interval_start, interval_end
> > >
> > > On Wed, Apr 10, 2019 at 6:43 AM James Meickle
> > > jmeic...@quantopian.com.invalid wrote:
> > >
> > > > Strictly tying execution start to interval end doesn't work for some
> > > > workflows (my guess, 1-5% of them?):
> > > >
> > > > -   You need to start performing tasks before the interval is over
> > > > -   You have tasks that reference a single interval, but can't be
> completed
> > > >     until several intervals later (due to data latency)
> > > >
> > > > -   The frequency you need to run the task on is different than the
> > > >     frequency
> > > >     of the interval you need to process (like processing all records
> from the
> > > >     last five days, every day)
> > > >
> > > >
> > > > Airflow doesn't handle any of these situations gracefully and I've
> seen
> > > > people attempt all sorts of workarounds for them. Probably even more
> > > > people
> > > > would try, if we provided decent idioms for doing it rather than
> those
> > > > workarounds.
> > > > On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko
> fo...@driesprong.frl
> > > > wrote:
> > > >
> > > > > I see what you mean. I don't really like the `period_{start,end}`
> name,
> > > > > but
> > > > > something such as `interval_{start,end}` might do it for me.
> > > > > Personally, I think running the job after the interval closes
> (since
> > > > > then
> > > >
> > > > > you have all the data over the interval), makes complete sense for
> ETL
> > > > > jobs. I agree it requires some time to get used to. Maybe we're
> lacking
> > > > > on
> > > > > documentation here.
> > > > > Cheers, Fokko
> > > > > Op wo 10 apr. 2019 om 10:08 schreef Flo Rance troura...@gmail.com:
> > > > >
> > > > > > I didn't expect to participate at any debate on that software, as
> > > > > > I'm a
> > > >
> > > > > > complete newcomer. But I'm almost forced as I am the target
> audience,
> > > > > > too.
> > > > > > To answer your initial question, after reading a lot of
> > > > > > documentation I
> > > >
> > > > > > find the term execution_date really counterintuitive, so yes
> maybe
> > > > > > period_start and period_end might be a better naming to help to
> > > > > > understand
> > > > > > how all the initial scheduling works. Because even after reading
> the
> > > > > > scheduling section of the doc and the FAQ, it was still not
> clear in
> > > > > > my
> > > >
> > > > > > mind. Btw, I find some ideas exposed by James Meickle in the
> > > > > > [DISCUSS]
> > > >
> > > > > > AIRFLOW-4192 very interesting and I share his opinion that
> there's
> > > > > > still
> > > > >
> > > > > > room for improvement.
> > > > > > But a mode to change from "run at end of period, I need all the
> data
> > > > > > available for this period" (the current) to "run at this time on
> > > > > > the
> > > >
> > > > > > schedule_interval would be awesome.
> > > > > > Regards,
> > > > > > Flo
> > > > > > On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor a...@apache.org
> > > > > > wrote:
> > > > >
> > > > > > > Yeah, that's the other thing that has been talked about from
> > > > > > > time-to-time,
> > > > > > > which is a mode to change from "run at end of period, I need
> all
> > > > > > > the
> > > >
> > > > > data
> > > > >
> > > > > > > available for this period" (the current) to "run at this time
> on
> > > > > > > the
> > > > >
> > > > > > > schedule_interval, don't wait for the period to end".
> > > > > > > (No such flag exists right now, before you go looking.)
> > > > > > >
> > > > > > > > On 9 Apr 2019, at 15:31, Shaw, Damian P. <
> > > > > > > > damian.sha...@credit-suisse.com> wrote:
> > > > > > > > Hi all,
> > > > > > > > I'm new to this Airflow Dev mailing list so I wasn't
> expecting to
> > > > > > > > reply
> > > > > >
> > > > > > > to anything but I feel I am the target audience for this
> question.
> > > > > > > I
> > > > > > > am
> > > >
> > > > > > > quite new to airflow and have been setting up an airflow
> > > > > > > environment
> > > >
> > > > > for
> > > > >
> > > > > > my
> > > > > >
> > > > > > > business this last month.
> > > > > > >
> > > > > > > > I find the current "execution_date" a small technical burden
> and
> > > > > > > > a
> > > >
> > > > > > large
> > > > > >
> > > > > > > cognitive burden. Our workflow is based on DAGs running at a
> > > > > > > specified
> > > > >
> > > > > > time
> > > > > >
> > > > > > > in a specified timezone using the same date as the current
> calendar
> > > > > > > date.
> > > > > >
> > > > > > > > I have worked around this by creating my own macro and
> context
> > > > > > > > variables, with the logic looking like this:
> > > > > > > > airflow_execution_date = context['execution_date']
> > > > > > > > dag_timezone = context['dag'].timezone
> > > > > > > > local_execution_date =
> > > > > > > > dag_timezone.convert(airflow_execution_date)
> > > > > > > > local_cal_date = local_execution_date +
> > > > > > > > datetime.timedelta(days=1)
> > > > > > >
> > > > > > > > As you can see this isn't a lot of technical effort, but
> having a
> > > > > > > > date
> > > > > >
> > > > > > > that 1) is in the timezone the business users are working in,
> and
> > >
> > > 2.
> > >
> > > > Is
> > > >
> > > > > > the
> > > > > >
> > > > > > > same calendar date the business users are working in it
> > > > > > > significantly
> > > >
> > > > > > > reduces the cognitive effort required to set-up tasks. Of
> course
> > > > > > > this
> > > >
> > > > > > > doesn't help with cron format scheduling which I just let the
> > > > > > > business
> > > > >
> > > > > > give
> > > > > >
> > > > > > > me the requirements for and I set it up myself as the date
> logic
> > > > > > > there
> > > > > > > is
> > > > >
> > > > > > > still confusing as it doesn't work like real cron scheduling
> which
> > > > > > > everyone
> > > > > > > is familiar with.
> > > > > > >
> > > > > > > > Maybe "period_start" and "period_end" might help people on
> Day 0
> > > > > > > > of
> > > >
> > > > > > > understanding Airflow get that the dates you are dealing with
> are
> > > > > > > not
> > > >
> > > > > > what
> > > > > >
> > > > > > > you expect, but Day 1+ there's still a lot of cognitive
> overhead if
> > > > > > > you
> > > > >
> > > > > > > don't have the exact same model as AirBnb for running DAGs and
> > > > > > > tasks.
> > > >
> > > > > > > > My 2 cents anyway,
> > > > > > > > Damian Shaw
> > > > > > > > -----Original Message-----
> > > > > > > > From: Ash Berlin-Taylor [mailto:a...@apache.org]
> > > > > > > > Sent: Tuesday, April 09, 2019 10:08 AM
> > > > > > > > To: dev@airflow.apache.org
> > > > > > > > Subject: [DISCUSS] period_start/period_end instead of
> > > > > > > > execution_date/next_execution_date
> > > > > > > > (trying to break this out in to another thread)
> > > > > > > > The ML doesn't allow images, but I can guess that it is the
> deps
> > > > > > > > section of a task instance details screen?
> > > > > > > > I'm not saying it's not clear once you know to look there,
> but
> > > > > > > > I'm
> > > >
> > > > > > > trying remove/reduce the confusion in the first place. And I
> think
> > > > > > > we
> > > >
> > > > > as
> > > > >
> > > > > > > committers aren't best placed to know what makes sense as we
> have
> > > > > > > internalised how Airflow works :)
> > > > > > >
> > > > > > > > So I guess this is a question to the newest people on the
> list:
> > > > > > > > Would
> > > > >
> > > > > > > `period_start` and `period_end` be more or less confusing for
> you
> > > > > > > when
> > > > >
> > > > > > you
> > > > > >
> > > > > > > were first getting started with Airflow?
> > > > > > >
> > > > > > > > -ash
> > > > > > > >
> > > > > > > > > On 9 Apr 2019, at 14:47, Driesprong, Fokko
> <fo...@driesprong.frl
> > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > > Ash,
> > > > > > > > > Personally, I think this is quite clear, there is a list of
> > > > > > > > > reasons
> > > > >
> > > > > > why
> > > > > >
> > > > > > > the job isn't being scheduled:
> > > > > > >
> > > > > > > > > Coming back to the question of Bas, I believe that
> yesterday_ds
> > > > > > > > > does
> > > > >
> > > > > > > not make sense since we cannot assume that the schedule is
> daily. I
> > > > > > > don't
> > > > > >
> > > > > > > see any usage of this variable. Personally, I do use
> > > > > > > next_execution_date
> > > > > >
> > > > > > > quite extensively. When you have a job that runs daily, but you
> > > > > > > want
> > > > > > > to
> > > >
> > > > > > > change this to an hourly job. In such a case you don't want to
> > > > > > > change
> > > >
> > > > > {{
> > > > >
> > > > > > > (execution_date + macros.timedelta(days=1)) }} to {{
> > > > > > > (execution_date
> > > >
> > > > -
> > > >
> > > > > > > macros.timedelta(hours=1)) }} everywhere.
> > > > > > >
> > > > > > > > > I'm just not sure if the aggressive deprecation of is
> really
> > > > > > > > > worth
> > > >
> > > > > it.
> > > > >
> > > > > > > I don't see too much harm in letting them stay.
> > > > > > >
> > > > > > > > > Cheers, Fokko
> > > > > > > > > Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor <
> > > > > > > > > a...@apache.org
> > > > > >
> > > > > > > mailto:a...@apache.org>:
> > > > > > >
> > > > > > > > > To (slightly) hijack this thread:
> > > > > > > > > On the subject of execuction_date: as I'm sure we're all
> aware
> > > > > > > > > the
> > > >
> > > > > > > concept of execution_date is confusing to new-commers to
> Airflow
> > > > > > > (there
> > > > >
> > > > > > are
> > > > > >
> > > > > > > many questions about "why hasn't my DAG run yet"? "Why is my
> dag a
> > > > > > > day
> > > > >
> > > > > > > behind?" etc.) and although we mention this in the docs it's a
> > > > > > > confusing
> > > > > >
> > > > > > > concept.
> > > > > > >
> > > > > > > > > What to people think about adding two new parameters:
> > > > > > > > > `period_start`
> > > > >
> > > > > > > and `period_end` and making these the preferred terms in place
> of
> > > > > > > execution_date and next_execution_date?
> > > > > > >
> > > > > > > > > This hopefully avoids any ambitious terms like "execution"
> or
> > > > > > > > > "run"
> > > > >
> > > > > > > which is understandably easy to conflate with the time the
> task is
> > > > > > > being
> > > > > >
> > > > > > > run (i.e. `now()`)
> > > > > > >
> > > > > > > > > If people think this naming is better and less confusing I
> would
> > > > > > > > > suggest we update all the docs and examples to use these
> terms (but
> > > > > > > > > still
> > > > > >
> > > > > > > mention the old names somewhere, probably in the macros docs)
> > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > > -ash
> > > > > > > > >
> > > > > > > > > > On 8 Apr 2019, at 16:20, Arthur Wiedmer <
> > > > > > > > > > arthur.wied...@gmail.com
> > > > >
> > > > > > > mailto:arthur.wied...@gmail.com> wrote:
> > > > > > >
> > > > > > > > > > Hi Bas,
> > > > > > > > > >
> > > > > > > > > > 1.  I am aware of a few places where those parameters
> are used
> > > > > > > > > >     in
> > > > > > > > > >
> > > >
> > > > > > > production
> > > > > > >
> > > > > > > > > > in a few hundred jobs. I highly recommend we don't
> deprecate
> > > > > > > > > > them
> > > >
> > > > > > > unless we
> > > > > > >
> > > > > > > > > > do it in a major version.
> > > > > > > > > >
> > > > > > > > > > 2.  As James mentioned, inlets and outlets are a lineage
> > > > > > > > > >     annotation
> > > > > > > > > >
> > > > >
> > > > > > > feature
> > > > > > >
> > > > > > > > > > which is still under development. Let's leave them in,
> but we
> > > > > > > > > > can
> > > >
> > > > > > guard
> > > > > >
> > > > > > > > > > them behind a feature flag if you prefer.
> > > > > > > > > >
> > > > > > > > > > 3.  the yesterday*/tomorrow* params are convenience ones
> if you
> > > > > > > > > >     use
> > > > > > > > > >     a
> > > > > > > > > >
> > > > >
> > > > > > > daily
> > > > > > >
> > > > > > > > > > ETL. I agree with you that they are simple to compute,
> but not
> > > > > > > > > > everyone
> > > > > > >
> > > > > > > > > > using Apache Airflow is amazing with Python. Some users
> are
> > > > > > > > > > only
> > > >
> > > > > > > trying to
> > > > > > >
> > > > > > > > > > get a query scheduled and rely on a couple of niceties
> like
> > > > > > > > > > these
> > > >
> > > > > to
> > > > >
> > > > > > > get by.
> > > > > > >
> > > > > > > > > > 4.  latest_date, end_date (I feel like there used to be
> > > > > > > > > >     start_date,
> > > > > > > > > >
> > > > >
> > > > > > but
> > > > > >
> > > > > > > > > > maybe it got lost) were a blend of things which were
> used by a
> > > > > > > > > > backfill
> > > > > > >
> > > > > > > > > > framework used internally at Airbnb. Latest date was
> used if
> > > > > > > > > > you
> > > >
> > > > > > > needed to
> > > > > > >
> > > > > > > > > > join to a dimension for which you only wanted the latest
> > > > > > > > > > version
> > > > > > > > > > of
> > > >
> > > > > > the
> > > > > >
> > > > > > > > > > attributes in you backfill. end_date was used for time
> ranges
> > > > > > > > > > where
> > > > >
> > > > > > > several
> > > > > > >
> > > > > > > > > > days were processed together in a range to save on
> compute. I
> > > > > > > > > > don't
> > > > >
> > > > > > > see an
> > > > > > >
> > > > > > > > > > issue with removing them.
> > > > > > > > > > Best regards,
> > > > > > > > > > Arthur
> > > > > > > > > > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> > > > > > > > > > basharens...@godatadriven.com <mailto:
> > > > > > > > > > basharens...@godatadriven.com
> > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > > Following Tao Feng’s question to discuss this PR<
> > > > > > > > > > > https://github.com/apache/airflow/pull/5010 <
> > > > > > > > > > > https://github.com/apache/airflow/pull/5010>>
> (AIRFLOW-4192<
> > > > > > > >
> > > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192 <
> > > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192>>),
> please
> > > > > > > > > > > discuss
> > > > >
> > > > > > here
> > > > > >
> > > > > > > > > > > if you agree/disagree/would change.
> > > > > > > > > > >
> > > > > > > > > > > The summary of the PR:
> > > > > > > > > > > I was confused by the task context values and suggest
> to clean
> > > > > > > > > > > up
> > > > >
> > > > > > and
> > > > > >
> > > > > > > > > > > clarify these variables. Some are derivations from
> other
> > > > > > > > > > > variables,
> > > > > >
> > > > > > > some
> > > > > > >
> > > > > > > > > > > are undocumented and unused, some are wrong (name
> doesn’t
> > > > > > > > > > > match
> > > >
> > > > > the
> > > > >
> > > > > > > value).
> > > > > > >
> > > > > > > > > > > Please discuss what you think of the removal of these
> > > > > > > > > > > variables:
> > > >
> > > > > > > > > > > -   Removed yesterday_ds, yesterday_ds_nodash,
> tomorrow_ds,
> > > > > > > > > > >     tomorrow_ds_nodash. IMO the next_* and previous_*
> variables
> > > > > > > > > > >     are
> > > > > > > > > > >
> > > >
> > > > > > useful
> > > > > >
> > > > > > > > > > > since these require complex logic to compute the next
> > > > > > > > > > > execution
> > > >
> > > > > > date,
> > > > > >
> > > > > > > > > > > however would leave computing the yesterday* and
> tomorrow*
> > > > > > > > > > > variables
> > > > > >
> > > > > > > up to
> > > > > > >
> > > > > > > > > > > the user since they are simple one-liners and don't
> relate to
> > > > > > > > > > > the
> > > > >
> > > > > > DAG
> > > > > >
> > > > > > > > > > > interval.
> > > > > > > > > > >
> > > > > > > > > > > -   Removed tables. This is a field in params, and is
> thus
> > > > > > > > > > >     also
> > > > > > > > > > >
> > > >
> > > > > > > > > > > accessible by the user ({{ params.tables }}). Also, it
> was
> > > > > > > > > > > undocumented.
> > > > > > > >
> > > > > > > > > > > -   Removed latest_date. It's the same as ds and was
> also
> > > > > > > > > > >     undocumented.
> > > > > > > > > > >
> > > > > > > >
> > > > > > > > > > > -   Removed inlets and outlets. Also undocumented, and
> have
> > > > > > > > > > >     the
> > > > > > > > > > >
> > > >
> > > > > > > > > > > inlets/outlets ever worked/ever been used by anybody?
> > > > > > > > > > >
> > > > > > > > > > > -   Removed end_date and END_DATE. Both have the same
> value,
> > > > > > > > > > >     so
> > > > > > > > > > >     it
> > > > > > > > > > >
> > > >
> > > > > > > > > > > doesn't make sense to have both variables. Also, the
> value is
> > > > > > > > > > > ds
> > > >
> > > > > > which
> > > > > >
> > > > > > > > > > > contains the start date of the interval, so the naming
> didn't
> > > > > > > > > > > make
> > > > >
> > > > > > > sense to
> > > > > > >
> > > > > > > > > > > me. However, if anybody argues in favour of adding
> > > > > > > > > > > "start_date"
> > > >
> > > > > and
> > > > >
> > > > > > > > > > > "end_date" to provide the start and end datetime of
> task
> > > > > > > > > > > instance
> > > > >
> > > > > > > > > > > intervals, I'd be happy to add them.
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Bas
> > >
> > >
> ===============================================================================
> > >
> > > > > > > > Please access the attached hyperlink for an important
> electronic
> > > > > > > > communications disclaimer:
> > > > > > > >
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >
> > >
> ===============================================================================
> > >
> > > > > > >
>
>
>

Reply via email to