Strictly tying execution start to interval end doesn't work for some
workflows (my guess, 1-5% of them?):

- You need to start performing tasks before the interval is over
- You have tasks that reference a single interval, but can't be completed
until several intervals later (due to data latency)
- The frequency you need to run the task on is different than the frequency
of the interval you need to process (like processing all records from the
last five days, every day)

Airflow doesn't handle any of these situations gracefully and I've seen
people attempt all sorts of workarounds for them. Probably even more people
would try, if we provided decent idioms for doing it rather than those
workarounds.

On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> I see what you mean. I don't really like the `period_{start,end}` name, but
> something such as `interval_{start,end}` might do it for me.
>
> Personally, I think running the job after the interval closes (since then
> you have all the data over the interval), makes complete sense for ETL
> jobs. I agree it requires some time to get used to. Maybe we're lacking on
> documentation here.
>
> Cheers, Fokko
>
> Op wo 10 apr. 2019 om 10:08 schreef Flo Rance <troura...@gmail.com>:
>
> > I didn't expect to participate at any debate on that software, as I'm a
> > complete newcomer. But I'm almost forced as I am the target audience,
> too.
> >
> > To answer your initial question, after reading a lot of documentation I
> > find the term execution_date really counterintuitive, so yes maybe
> > period_start and period_end might be a better naming to help to
> understand
> > how all the initial scheduling works. Because even after reading the
> > scheduling section of the doc and the FAQ, it was still not clear in my
> > mind. Btw, I find some ideas exposed by James Meickle in the [DISCUSS]
> > AIRFLOW-4192 very interesting and I share his opinion that there's still
> > room for improvement.
> > But a mode to change from "run at end of period, I need all the data
> > available for this period" (the current) to "run at _this_ time on the
> > schedule_interval would be awesome.
> >
> > Regards,
> > Flo
> >
> > On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >
> > > Yeah, that's the other thing that has been talked about from
> > time-to-time,
> > > which is a mode to change from "run at end of period, I need all the
> data
> > > available for this period" (the current) to "run at _this_ time on the
> > > schedule_interval, don't wait for the period to end".
> > >
> > > (No such flag exists right now, before you go looking.)
> > >
> > > > On 9 Apr 2019, at 15:31, Shaw, Damian P. <
> > > damian.sha...@credit-suisse.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I'm new to this Airflow Dev mailing list so I wasn't expecting to
> reply
> > > to anything but I feel I am the target audience for this question. I am
> > > quite new to airflow and have been setting up an airflow environment
> for
> > my
> > > business this last month.
> > > >
> > > > I find the current "execution_date" a small technical burden and a
> > large
> > > cognitive burden. Our workflow is based on DAGs running at a specified
> > time
> > > in a specified timezone using the same date as the current calendar
> date.
> > > >
> > > > I have worked around this by creating my own macro and context
> > > variables, with the logic looking like this:
> > > >        airflow_execution_date = context['execution_date']
> > > >        dag_timezone = context['dag'].timezone
> > > >        local_execution_date =
> > > dag_timezone.convert(airflow_execution_date)
> > > >        local_cal_date = local_execution_date +
> > datetime.timedelta(days=1)
> > > >
> > > > As you can see this isn't a lot of technical effort, but having a
> date
> > > that 1) is in the timezone the business users are working in, and 2) Is
> > the
> > > same calendar date the business users are working in it significantly
> > > reduces the cognitive effort required to set-up tasks. Of course this
> > > doesn't help with cron format scheduling which I just let the business
> > give
> > > me the requirements for and I set it up myself as the date logic there
> is
> > > still confusing as it doesn't work like real cron scheduling which
> > everyone
> > > is familiar with.
> > > >
> > > > Maybe "period_start" and "period_end" might help people on Day 0 of
> > > understanding Airflow get that the dates you are dealing with are not
> > what
> > > you expect, but Day 1+ there's still a lot of cognitive overhead if you
> > > don't have the exact same model as AirBnb for running DAGs and tasks.
> > > >
> > > > My 2 cents anyway,
> > > > Damian Shaw
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ash Berlin-Taylor [mailto:a...@apache.org]
> > > > Sent: Tuesday, April 09, 2019 10:08 AM
> > > > To: dev@airflow.apache.org
> > > > Subject: [DISCUSS] period_start/period_end instead of
> > > execution_date/next_execution_date
> > > >
> > > > (trying to break this out in to another thread)
> > > >
> > > > The ML doesn't allow  images, but I can guess that it is the deps
> > > section of a task instance details screen?
> > > >
> > > > I'm not saying it's not clear once you know to look there, but I'm
> > > trying remove/reduce the confusion in the first place. And I think we
> as
> > > committers aren't best placed to know what makes sense as we have
> > > internalised how Airflow works :)
> > > >
> > > > So I guess this is a question to the newest people on the list: Would
> > > `period_start` and `period_end` be more or less confusing for you when
> > you
> > > were first getting started with Airflow?
> > > >
> > > > -ash
> > > >
> > > >> On 9 Apr 2019, at 14:47, Driesprong, Fokko <fo...@driesprong.frl>
> > > wrote:
> > > >>
> > > >> Ash,
> > > >>
> > > >> Personally, I think this is quite clear, there is a list of reasons
> > why
> > > the job isn't being scheduled:
> > > >>
> > > >>
> > > >> Coming back to the question of Bas, I believe that yesterday_ds does
> > > not make sense since we cannot assume that the schedule is daily. I
> don't
> > > see any usage of this variable. Personally, I do use
> next_execution_date
> > > quite extensively. When you have a job that runs daily, but you want to
> > > change this to an hourly job. In such a case you don't want to change
> {{
> > > (execution_date + macros.timedelta(days=1)) }} to {{ (execution_date +
> > > macros.timedelta(hours=1)) }} everywhere.
> > > >>
> > > >> I'm just not sure if the aggressive deprecation of is really worth
> it.
> > > I don't see too much harm in letting them stay.
> > > >>
> > > >> Cheers, Fokko
> > > >>
> > > >> Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor <
> a...@apache.org
> > > <mailto:a...@apache.org>>:
> > > >> To (slightly) hijack this thread:
> > > >>
> > > >> On the subject of execuction_date: as I'm sure we're all aware the
> > > concept of execution_date is confusing to new-commers to Airflow (there
> > are
> > > many questions about "why hasn't my DAG run yet"? "Why is my dag a day
> > > behind?" etc.) and although we mention this in the docs it's a
> confusing
> > > concept.
> > > >>
> > > >> What to people think about adding two new parameters: `period_start`
> > > and `period_end` and making these the preferred terms in place of
> > > execution_date and next_execution_date?
> > > >>
> > > >> This hopefully avoids any ambitious terms like "execution" or "run"
> > > which is understandably easy to conflate with the time the task is
> being
> > > run (i.e. `now()`)
> > > >>
> > > >> If people think this naming is better and less confusing I would
> > > suggest we update all the docs and examples to use these terms (but
> still
> > > mention the old names somewhere, probably in the macros docs)
> > > >>
> > > >> Thoughts?
> > > >>
> > > >> -ash
> > > >>
> > > >>
> > > >>> On 8 Apr 2019, at 16:20, Arthur Wiedmer <arthur.wied...@gmail.com
> > > <mailto:arthur.wied...@gmail.com>> wrote:
> > > >>>
> > > >>> Hi Bas,
> > > >>>
> > > >>> 1) I am aware of a few places where those parameters are used in
> > > production
> > > >>> in a few hundred jobs. I highly recommend we don't deprecate them
> > > unless we
> > > >>> do it in a major version.
> > > >>>
> > > >>> 2) As James mentioned, inlets and outlets are a lineage annotation
> > > feature
> > > >>> which is still under development. Let's leave them in, but we can
> > guard
> > > >>> them behind a feature flag if you prefer.
> > > >>>
> > > >>> 3) the yesterday*/tomorrow* params are convenience ones if you use
> a
> > > daily
> > > >>> ETL. I agree with you that they are simple to compute, but not
> > everyone
> > > >>> using Apache Airflow is amazing with Python. Some users are only
> > > trying to
> > > >>> get a query scheduled and rely on a couple of niceties like these
> to
> > > get by.
> > > >>>
> > > >>> 4) latest_date, end_date (I feel like there used to be start_date,
> > but
> > > >>> maybe it got lost) were a blend of things which were used by a
> > backfill
> > > >>> framework used internally at Airbnb. Latest date was used if you
> > > needed to
> > > >>> join to a dimension for which you only wanted the latest version of
> > the
> > > >>> attributes in you backfill. end_date was used for time ranges where
> > > several
> > > >>> days were processed together in a range to save on compute. I don't
> > > see an
> > > >>> issue with removing them.
> > > >>>
> > > >>> Best regards,
> > > >>> Arthur
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak <
> > > basharens...@godatadriven.com <mailto:basharens...@godatadriven.com>>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> Following Tao Feng’s question to discuss this PR<
> > > >>>> https://github.com/apache/airflow/pull/5010 <
> > > https://github.com/apache/airflow/pull/5010>> (AIRFLOW-4192<
> > > >>>> https://issues.apache.org/jira/browse/AIRFLOW-4192 <
> > > https://issues.apache.org/jira/browse/AIRFLOW-4192>>), please discuss
> > here
> > > >>>> if you agree/disagree/would change.
> > > >>>>
> > > >>>> -----------
> > > >>>>
> > > >>>> The summary of the PR:
> > > >>>>
> > > >>>> I was confused by the task context values and suggest to clean up
> > and
> > > >>>> clarify these variables. Some are derivations from other
> variables,
> > > some
> > > >>>> are undocumented and unused, some are wrong (name doesn’t match
> the
> > > value).
> > > >>>> Please discuss what you think of the removal of these variables:
> > > >>>>
> > > >>>>
> > > >>>> *   Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds,
> > > >>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables are
> > useful
> > > >>>> since these require complex logic to compute the next execution
> > date,
> > > >>>> however would leave computing the yesterday* and tomorrow*
> variables
> > > up to
> > > >>>> the user since they are simple one-liners and don't relate to the
> > DAG
> > > >>>> interval.
> > > >>>> *   Removed tables. This is a field in params, and is thus also
> > > >>>> accessible by the user ({{ params.tables }}). Also, it was
> > > undocumented.
> > > >>>> *   Removed latest_date. It's the same as ds and was also
> > > undocumented.
> > > >>>> *   Removed inlets and outlets. Also undocumented, and have the
> > > >>>> inlets/outlets ever worked/ever been used by anybody?
> > > >>>> *   Removed end_date and END_DATE. Both have the same value, so it
> > > >>>> doesn't make sense to have both variables. Also, the value is ds
> > which
> > > >>>> contains the start date of the interval, so the naming didn't make
> > > sense to
> > > >>>> me. However, if anybody argues in favour of adding "start_date"
> and
> > > >>>> "end_date" to provide the start and end datetime of task instance
> > > >>>> intervals, I'd be happy to add them.
> > > >>>>
> > > >>>> Cheers,
> > > >>>> Bas
> > > >>>>
> > > >>
> > > >
> > > >
> > > >
> > > >
> > >
> >
> ===============================================================================
> > >
> > > > Please access the attached hyperlink for an important electronic
> > > communications disclaimer:
> > > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > > >
> > >
> >
> ===============================================================================
> > >
> > >
> > >
> >
>

Reply via email to