I think if the mission of Airflow is to be a generic Workflow engine, the current semantics of execution date aren't a good default. This might be an unpopular opinion given past threads on this topic :).
The execution_date = end_date semantics make sense for the ETL use case but not for other use cases I think Cron syntax is more intuitive to users, i.e. start_date should match execution_date (although I don't have data to back this up). This is especially prevalent in ML, it's almost a rite of passage for users to get confused by execution date semantics. I think a flag to support different execution date semantics makes sense, even at the cost of being a headache to support both and the complexity increase could lead to bugs and trickier mailing list support. On Wed, Apr 10, 2019 at 9:00 PM Gabriel Silk <gs...@dropbox.com.invalid> wrote: > My two cents: > > "execution_date" is definitely confusing to newcomers, and it's partly the > ambiguity of the wording, and partly the UI's fault. When I first saw > execution date, I assumed it meant *the earliest time at which the task > will execute*, which is wrong. I was confused when no tasks appeared for > 3pm until 4pm. > > My proposal to fix that: > 1) Always show the *next* task to be executed in the UI, but explain to the > user that it's not running because its interval has not yet completed. > Indicate this state visually, perhaps by using some transparency or another > color. > 2) Instead of just showing execution date in the UI, show the low/high > range of the time period it covers (for periodic jobs). > > As for what we call the low/high timestamps, I like these two options: > - low_ts, high_ts > - interval_start, interval_end > > On Wed, Apr 10, 2019 at 6:43 AM James Meickle > <jmeic...@quantopian.com.invalid> wrote: > > > Strictly tying execution start to interval end doesn't work for some > > workflows (my guess, 1-5% of them?): > > > > - You need to start performing tasks before the interval is over > > - You have tasks that reference a single interval, but can't be completed > > until several intervals later (due to data latency) > > - The frequency you need to run the task on is different than the > frequency > > of the interval you need to process (like processing all records from the > > last five days, every day) > > > > Airflow doesn't handle any of these situations gracefully and I've seen > > people attempt all sorts of workarounds for them. Probably even more > people > > would try, if we provided decent idioms for doing it rather than those > > workarounds. > > > > On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko <fo...@driesprong.frl> > > wrote: > > > > > I see what you mean. I don't really like the `period_{start,end}` name, > > but > > > something such as `interval_{start,end}` might do it for me. > > > > > > Personally, I think running the job after the interval closes (since > then > > > you have all the data over the interval), makes complete sense for ETL > > > jobs. I agree it requires some time to get used to. Maybe we're lacking > > on > > > documentation here. > > > > > > Cheers, Fokko > > > > > > Op wo 10 apr. 2019 om 10:08 schreef Flo Rance <troura...@gmail.com>: > > > > > > > I didn't expect to participate at any debate on that software, as > I'm a > > > > complete newcomer. But I'm almost forced as I am the target audience, > > > too. > > > > > > > > To answer your initial question, after reading a lot of > documentation I > > > > find the term execution_date really counterintuitive, so yes maybe > > > > period_start and period_end might be a better naming to help to > > > understand > > > > how all the initial scheduling works. Because even after reading the > > > > scheduling section of the doc and the FAQ, it was still not clear in > my > > > > mind. Btw, I find some ideas exposed by James Meickle in the > [DISCUSS] > > > > AIRFLOW-4192 very interesting and I share his opinion that there's > > still > > > > room for improvement. > > > > But a mode to change from "run at end of period, I need all the data > > > > available for this period" (the current) to "run at _this_ time on > the > > > > schedule_interval would be awesome. > > > > > > > > Regards, > > > > Flo > > > > > > > > On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor <a...@apache.org> > > wrote: > > > > > > > > > Yeah, that's the other thing that has been talked about from > > > > time-to-time, > > > > > which is a mode to change from "run at end of period, I need all > the > > > data > > > > > available for this period" (the current) to "run at _this_ time on > > the > > > > > schedule_interval, don't wait for the period to end". > > > > > > > > > > (No such flag exists right now, before you go looking.) > > > > > > > > > > > On 9 Apr 2019, at 15:31, Shaw, Damian P. < > > > > > damian.sha...@credit-suisse.com> wrote: > > > > > > > > > > > > Hi all, > > > > > > > > > > > > I'm new to this Airflow Dev mailing list so I wasn't expecting to > > > reply > > > > > to anything but I feel I am the target audience for this question. > I > > am > > > > > quite new to airflow and have been setting up an airflow > environment > > > for > > > > my > > > > > business this last month. > > > > > > > > > > > > I find the current "execution_date" a small technical burden and > a > > > > large > > > > > cognitive burden. Our workflow is based on DAGs running at a > > specified > > > > time > > > > > in a specified timezone using the same date as the current calendar > > > date. > > > > > > > > > > > > I have worked around this by creating my own macro and context > > > > > variables, with the logic looking like this: > > > > > > airflow_execution_date = context['execution_date'] > > > > > > dag_timezone = context['dag'].timezone > > > > > > local_execution_date = > > > > > dag_timezone.convert(airflow_execution_date) > > > > > > local_cal_date = local_execution_date + > > > > datetime.timedelta(days=1) > > > > > > > > > > > > As you can see this isn't a lot of technical effort, but having a > > > date > > > > > that 1) is in the timezone the business users are working in, and > 2) > > Is > > > > the > > > > > same calendar date the business users are working in it > significantly > > > > > reduces the cognitive effort required to set-up tasks. Of course > this > > > > > doesn't help with cron format scheduling which I just let the > > business > > > > give > > > > > me the requirements for and I set it up myself as the date logic > > there > > > is > > > > > still confusing as it doesn't work like real cron scheduling which > > > > everyone > > > > > is familiar with. > > > > > > > > > > > > Maybe "period_start" and "period_end" might help people on Day 0 > of > > > > > understanding Airflow get that the dates you are dealing with are > not > > > > what > > > > > you expect, but Day 1+ there's still a lot of cognitive overhead if > > you > > > > > don't have the exact same model as AirBnb for running DAGs and > tasks. > > > > > > > > > > > > My 2 cents anyway, > > > > > > Damian Shaw > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Ash Berlin-Taylor [mailto:a...@apache.org] > > > > > > Sent: Tuesday, April 09, 2019 10:08 AM > > > > > > To: dev@airflow.apache.org > > > > > > Subject: [DISCUSS] period_start/period_end instead of > > > > > execution_date/next_execution_date > > > > > > > > > > > > (trying to break this out in to another thread) > > > > > > > > > > > > The ML doesn't allow images, but I can guess that it is the deps > > > > > section of a task instance details screen? > > > > > > > > > > > > I'm not saying it's not clear once you know to look there, but > I'm > > > > > trying remove/reduce the confusion in the first place. And I think > we > > > as > > > > > committers aren't best placed to know what makes sense as we have > > > > > internalised how Airflow works :) > > > > > > > > > > > > So I guess this is a question to the newest people on the list: > > Would > > > > > `period_start` and `period_end` be more or less confusing for you > > when > > > > you > > > > > were first getting started with Airflow? > > > > > > > > > > > > -ash > > > > > > > > > > > >> On 9 Apr 2019, at 14:47, Driesprong, Fokko <fo...@driesprong.frl > > > > > > > wrote: > > > > > >> > > > > > >> Ash, > > > > > >> > > > > > >> Personally, I think this is quite clear, there is a list of > > reasons > > > > why > > > > > the job isn't being scheduled: > > > > > >> > > > > > >> > > > > > >> Coming back to the question of Bas, I believe that yesterday_ds > > does > > > > > not make sense since we cannot assume that the schedule is daily. I > > > don't > > > > > see any usage of this variable. Personally, I do use > > > next_execution_date > > > > > quite extensively. When you have a job that runs daily, but you > want > > to > > > > > change this to an hourly job. In such a case you don't want to > change > > > {{ > > > > > (execution_date + macros.timedelta(days=1)) }} to {{ > (execution_date > > + > > > > > macros.timedelta(hours=1)) }} everywhere. > > > > > >> > > > > > >> I'm just not sure if the aggressive deprecation of is really > worth > > > it. > > > > > I don't see too much harm in letting them stay. > > > > > >> > > > > > >> Cheers, Fokko > > > > > >> > > > > > >> Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor < > > > a...@apache.org > > > > > <mailto:a...@apache.org>>: > > > > > >> To (slightly) hijack this thread: > > > > > >> > > > > > >> On the subject of execuction_date: as I'm sure we're all aware > the > > > > > concept of execution_date is confusing to new-commers to Airflow > > (there > > > > are > > > > > many questions about "why hasn't my DAG run yet"? "Why is my dag a > > day > > > > > behind?" etc.) and although we mention this in the docs it's a > > > confusing > > > > > concept. > > > > > >> > > > > > >> What to people think about adding two new parameters: > > `period_start` > > > > > and `period_end` and making these the preferred terms in place of > > > > > execution_date and next_execution_date? > > > > > >> > > > > > >> This hopefully avoids any ambitious terms like "execution" or > > "run" > > > > > which is understandably easy to conflate with the time the task is > > > being > > > > > run (i.e. `now()`) > > > > > >> > > > > > >> If people think this naming is better and less confusing I would > > > > > suggest we update all the docs and examples to use these terms (but > > > still > > > > > mention the old names somewhere, probably in the macros docs) > > > > > >> > > > > > >> Thoughts? > > > > > >> > > > > > >> -ash > > > > > >> > > > > > >> > > > > > >>> On 8 Apr 2019, at 16:20, Arthur Wiedmer < > > arthur.wied...@gmail.com > > > > > <mailto:arthur.wied...@gmail.com>> wrote: > > > > > >>> > > > > > >>> Hi Bas, > > > > > >>> > > > > > >>> 1) I am aware of a few places where those parameters are used > in > > > > > production > > > > > >>> in a few hundred jobs. I highly recommend we don't deprecate > them > > > > > unless we > > > > > >>> do it in a major version. > > > > > >>> > > > > > >>> 2) As James mentioned, inlets and outlets are a lineage > > annotation > > > > > feature > > > > > >>> which is still under development. Let's leave them in, but we > can > > > > guard > > > > > >>> them behind a feature flag if you prefer. > > > > > >>> > > > > > >>> 3) the yesterday*/tomorrow* params are convenience ones if you > > use > > > a > > > > > daily > > > > > >>> ETL. I agree with you that they are simple to compute, but not > > > > everyone > > > > > >>> using Apache Airflow is amazing with Python. Some users are > only > > > > > trying to > > > > > >>> get a query scheduled and rely on a couple of niceties like > these > > > to > > > > > get by. > > > > > >>> > > > > > >>> 4) latest_date, end_date (I feel like there used to be > > start_date, > > > > but > > > > > >>> maybe it got lost) were a blend of things which were used by a > > > > backfill > > > > > >>> framework used internally at Airbnb. Latest date was used if > you > > > > > needed to > > > > > >>> join to a dimension for which you only wanted the latest > version > > of > > > > the > > > > > >>> attributes in you backfill. end_date was used for time ranges > > where > > > > > several > > > > > >>> days were processed together in a range to save on compute. I > > don't > > > > > see an > > > > > >>> issue with removing them. > > > > > >>> > > > > > >>> Best regards, > > > > > >>> Arthur > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < > > > > > basharens...@godatadriven.com <mailto: > basharens...@godatadriven.com > > >> > > > > > >>> wrote: > > > > > >>> > > > > > >>>> Hi all, > > > > > >>>> > > > > > >>>> Following Tao Feng’s question to discuss this PR< > > > > > >>>> https://github.com/apache/airflow/pull/5010 < > > > > > https://github.com/apache/airflow/pull/5010>> (AIRFLOW-4192< > > > > > >>>> https://issues.apache.org/jira/browse/AIRFLOW-4192 < > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192>>), please > > discuss > > > > here > > > > > >>>> if you agree/disagree/would change. > > > > > >>>> > > > > > >>>> ----------- > > > > > >>>> > > > > > >>>> The summary of the PR: > > > > > >>>> > > > > > >>>> I was confused by the task context values and suggest to clean > > up > > > > and > > > > > >>>> clarify these variables. Some are derivations from other > > > variables, > > > > > some > > > > > >>>> are undocumented and unused, some are wrong (name doesn’t > match > > > the > > > > > value). > > > > > >>>> Please discuss what you think of the removal of these > variables: > > > > > >>>> > > > > > >>>> > > > > > >>>> * Removed yesterday_ds, yesterday_ds_nodash, tomorrow_ds, > > > > > >>>> tomorrow_ds_nodash. IMO the next_* and previous_* variables > are > > > > useful > > > > > >>>> since these require complex logic to compute the next > execution > > > > date, > > > > > >>>> however would leave computing the yesterday* and tomorrow* > > > variables > > > > > up to > > > > > >>>> the user since they are simple one-liners and don't relate to > > the > > > > DAG > > > > > >>>> interval. > > > > > >>>> * Removed tables. This is a field in params, and is thus > also > > > > > >>>> accessible by the user ({{ params.tables }}). Also, it was > > > > > undocumented. > > > > > >>>> * Removed latest_date. It's the same as ds and was also > > > > > undocumented. > > > > > >>>> * Removed inlets and outlets. Also undocumented, and have > the > > > > > >>>> inlets/outlets ever worked/ever been used by anybody? > > > > > >>>> * Removed end_date and END_DATE. Both have the same value, > so > > it > > > > > >>>> doesn't make sense to have both variables. Also, the value is > ds > > > > which > > > > > >>>> contains the start date of the interval, so the naming didn't > > make > > > > > sense to > > > > > >>>> me. However, if anybody argues in favour of adding > "start_date" > > > and > > > > > >>>> "end_date" to provide the start and end datetime of task > > instance > > > > > >>>> intervals, I'd be happy to add them. > > > > > >>>> > > > > > >>>> Cheers, > > > > > >>>> Bas > > > > > >>>> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =============================================================================== > > > > > > > > > > > Please access the attached hyperlink for an important electronic > > > > > communications disclaimer: > > > > > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > > > > > > > > > > > > > > > > > > > > =============================================================================== > > > > > > > > > > > > > > > > > > > > > > > > >