Personally I would be very interested in working on a flexible schedule window/window projection patch. But it would be a big undertaking so it doesn't make sense to start it unless there's a lot of community buy-in to the idea that we aren't just for day-after ETL systems.
On Mon, Apr 15, 2019 at 8:52 AM airflowuser <airflowu...@protonmail.com.invalid> wrote: > To quote my user-experience professor from ages ago: > "If too many people misuse something you wrote it means that YOU are doing > something wrong". > > Something can be well documented but if it's not intuitive it's likely > that people will get it wrong. > > Say someone ask "When did you execute the code?" Your answer will be > direct - the time the code started to run. This is why so many people > misunderstand the execution_date in the terms of Airflow. Airflow took a > word that is well defined in our conscious and replaced it's meaning. > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Monday, April 15, 2019 3:35 PM, Dan Davydov > <ddavy...@twitter.com.INVALID> wrote: > > > I think if the mission of Airflow is to be a generic Workflow engine, the > > current semantics of execution date aren't a good default. This might be > an > > unpopular opinion given past threads on this topic :). > > > > The execution_date = end_date semantics make sense for the ETL use case > but > > not for other use cases I think Cron syntax is more intuitive to users, > > i.e. start_date should match execution_date (although I don't have data > to > > back this up). This is especially prevalent in ML, it's almost a rite of > > passage for users to get confused by execution date semantics. I think a > > flag to support different execution date semantics makes sense, even at > the > > cost of being a headache to support both and the complexity increase > could > > lead to bugs and trickier mailing list support. > > > > On Wed, Apr 10, 2019 at 9:00 PM Gabriel Silk gs...@dropbox.com.invalid > > wrote: > > > > > My two cents: > > > "execution_date" is definitely confusing to newcomers, and it's partly > the > > > ambiguity of the wording, and partly the UI's fault. When I first saw > > > execution date, I assumed it meant the earliest time at which the task > > > will execute, which is wrong. I was confused when no tasks appeared > for3pm until 4pm. > > > My proposal to fix that: > > > > > > 1. Always show the next task to be executed in the UI, but explain to > the > > > user that it's not running because its interval has not yet > completed. > > > Indicate this state visually, perhaps by using some transparency > or another > > > color. > > > > > > 2. Instead of just showing execution date in the UI, show the low/high > > > range of the time period it covers (for periodic jobs). > > > > > > > > > As for what we call the low/high timestamps, I like these two options: > > > > > > - low_ts, high_ts > > > - interval_start, interval_end > > > > > > On Wed, Apr 10, 2019 at 6:43 AM James Meickle > > > jmeic...@quantopian.com.invalid wrote: > > > > > > > Strictly tying execution start to interval end doesn't work for some > > > > workflows (my guess, 1-5% of them?): > > > > > > > > - You need to start performing tasks before the interval is over > > > > - You have tasks that reference a single interval, but can't be > completed > > > > until several intervals later (due to data latency) > > > > > > > > - The frequency you need to run the task on is different than the > > > > frequency > > > > of the interval you need to process (like processing all records > from the > > > > last five days, every day) > > > > > > > > > > > > Airflow doesn't handle any of these situations gracefully and I've > seen > > > > people attempt all sorts of workarounds for them. Probably even more > > > > people > > > > would try, if we provided decent idioms for doing it rather than > those > > > > workarounds. > > > > On Wed, Apr 10, 2019 at 9:30 AM Driesprong, Fokko > fo...@driesprong.frl > > > > wrote: > > > > > > > > > I see what you mean. I don't really like the `period_{start,end}` > name, > > > > > but > > > > > something such as `interval_{start,end}` might do it for me. > > > > > Personally, I think running the job after the interval closes > (since > > > > > then > > > > > > > > > you have all the data over the interval), makes complete sense for > ETL > > > > > jobs. I agree it requires some time to get used to. Maybe we're > lacking > > > > > on > > > > > documentation here. > > > > > Cheers, Fokko > > > > > Op wo 10 apr. 2019 om 10:08 schreef Flo Rance troura...@gmail.com: > > > > > > > > > > > I didn't expect to participate at any debate on that software, as > > > > > > I'm a > > > > > > > > > > complete newcomer. But I'm almost forced as I am the target > audience, > > > > > > too. > > > > > > To answer your initial question, after reading a lot of > > > > > > documentation I > > > > > > > > > > find the term execution_date really counterintuitive, so yes > maybe > > > > > > period_start and period_end might be a better naming to help to > > > > > > understand > > > > > > how all the initial scheduling works. Because even after reading > the > > > > > > scheduling section of the doc and the FAQ, it was still not > clear in > > > > > > my > > > > > > > > > > mind. Btw, I find some ideas exposed by James Meickle in the > > > > > > [DISCUSS] > > > > > > > > > > AIRFLOW-4192 very interesting and I share his opinion that > there's > > > > > > still > > > > > > > > > > > room for improvement. > > > > > > But a mode to change from "run at end of period, I need all the > data > > > > > > available for this period" (the current) to "run at this time on > > > > > > the > > > > > > > > > > schedule_interval would be awesome. > > > > > > Regards, > > > > > > Flo > > > > > > On Tue, Apr 9, 2019 at 4:41 PM Ash Berlin-Taylor a...@apache.org > > > > > > wrote: > > > > > > > > > > > > Yeah, that's the other thing that has been talked about from > > > > > > > time-to-time, > > > > > > > which is a mode to change from "run at end of period, I need > all > > > > > > > the > > > > > > > > > data > > > > > > > > > > > > available for this period" (the current) to "run at this time > on > > > > > > > the > > > > > > > > > > > > schedule_interval, don't wait for the period to end". > > > > > > > (No such flag exists right now, before you go looking.) > > > > > > > > > > > > > > > On 9 Apr 2019, at 15:31, Shaw, Damian P. < > > > > > > > > damian.sha...@credit-suisse.com> wrote: > > > > > > > > Hi all, > > > > > > > > I'm new to this Airflow Dev mailing list so I wasn't > expecting to > > > > > > > > reply > > > > > > > > > > > > > to anything but I feel I am the target audience for this > question. > > > > > > > I > > > > > > > am > > > > > > > > > > > quite new to airflow and have been setting up an airflow > > > > > > > environment > > > > > > > > > for > > > > > > > > > > > my > > > > > > > > > > > > > business this last month. > > > > > > > > > > > > > > > I find the current "execution_date" a small technical burden > and > > > > > > > > a > > > > > > > > > > large > > > > > > > > > > > > > cognitive burden. Our workflow is based on DAGs running at a > > > > > > > specified > > > > > > > > > > > time > > > > > > > > > > > > > in a specified timezone using the same date as the current > calendar > > > > > > > date. > > > > > > > > > > > > > > I have worked around this by creating my own macro and > context > > > > > > > > variables, with the logic looking like this: > > > > > > > > airflow_execution_date = context['execution_date'] > > > > > > > > dag_timezone = context['dag'].timezone > > > > > > > > local_execution_date = > > > > > > > > dag_timezone.convert(airflow_execution_date) > > > > > > > > local_cal_date = local_execution_date + > > > > > > > > datetime.timedelta(days=1) > > > > > > > > > > > > > > > As you can see this isn't a lot of technical effort, but > having a > > > > > > > > date > > > > > > > > > > > > > that 1) is in the timezone the business users are working in, > and > > > > > > 2. > > > > > > > Is > > > > > > > > > > the > > > > > > > > > > > > > same calendar date the business users are working in it > > > > > > > significantly > > > > > > > > > > > reduces the cognitive effort required to set-up tasks. Of > course > > > > > > > this > > > > > > > > > > > doesn't help with cron format scheduling which I just let the > > > > > > > business > > > > > > > > > > > give > > > > > > > > > > > > > me the requirements for and I set it up myself as the date > logic > > > > > > > there > > > > > > > is > > > > > > > > > > > > still confusing as it doesn't work like real cron scheduling > which > > > > > > > everyone > > > > > > > is familiar with. > > > > > > > > > > > > > > > Maybe "period_start" and "period_end" might help people on > Day 0 > > > > > > > > of > > > > > > > > > > > understanding Airflow get that the dates you are dealing with > are > > > > > > > not > > > > > > > > > > what > > > > > > > > > > > > > you expect, but Day 1+ there's still a lot of cognitive > overhead if > > > > > > > you > > > > > > > > > > > > don't have the exact same model as AirBnb for running DAGs and > > > > > > > tasks. > > > > > > > > > > > > My 2 cents anyway, > > > > > > > > Damian Shaw > > > > > > > > -----Original Message----- > > > > > > > > From: Ash Berlin-Taylor [mailto:a...@apache.org] > > > > > > > > Sent: Tuesday, April 09, 2019 10:08 AM > > > > > > > > To: dev@airflow.apache.org > > > > > > > > Subject: [DISCUSS] period_start/period_end instead of > > > > > > > > execution_date/next_execution_date > > > > > > > > (trying to break this out in to another thread) > > > > > > > > The ML doesn't allow images, but I can guess that it is the > deps > > > > > > > > section of a task instance details screen? > > > > > > > > I'm not saying it's not clear once you know to look there, > but > > > > > > > > I'm > > > > > > > > > > > trying remove/reduce the confusion in the first place. And I > think > > > > > > > we > > > > > > > > > as > > > > > > > > > > > > committers aren't best placed to know what makes sense as we > have > > > > > > > internalised how Airflow works :) > > > > > > > > > > > > > > > So I guess this is a question to the newest people on the > list: > > > > > > > > Would > > > > > > > > > > > > `period_start` and `period_end` be more or less confusing for > you > > > > > > > when > > > > > > > > > > > you > > > > > > > > > > > > > were first getting started with Airflow? > > > > > > > > > > > > > > > -ash > > > > > > > > > > > > > > > > > On 9 Apr 2019, at 14:47, Driesprong, Fokko > <fo...@driesprong.frl > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > Ash, > > > > > > > > > Personally, I think this is quite clear, there is a list of > > > > > > > > > reasons > > > > > > > > > > > why > > > > > > > > > > > > > the job isn't being scheduled: > > > > > > > > > > > > > > > > Coming back to the question of Bas, I believe that > yesterday_ds > > > > > > > > > does > > > > > > > > > > > > not make sense since we cannot assume that the schedule is > daily. I > > > > > > > don't > > > > > > > > > > > > > see any usage of this variable. Personally, I do use > > > > > > > next_execution_date > > > > > > > > > > > > > quite extensively. When you have a job that runs daily, but you > > > > > > > want > > > > > > > to > > > > > > > > > > > change this to an hourly job. In such a case you don't want to > > > > > > > change > > > > > > > > > {{ > > > > > > > > > > > > (execution_date + macros.timedelta(days=1)) }} to {{ > > > > > > > (execution_date > > > > > > > > - > > > > > > > > > > > macros.timedelta(hours=1)) }} everywhere. > > > > > > > > > > > > > > > > I'm just not sure if the aggressive deprecation of is > really > > > > > > > > > worth > > > > > > > > > it. > > > > > > > > > > > > I don't see too much harm in letting them stay. > > > > > > > > > > > > > > > > Cheers, Fokko > > > > > > > > > Op di 9 apr. 2019 om 12:17 schreef Ash Berlin-Taylor < > > > > > > > > > a...@apache.org > > > > > > > > > > > > > mailto:a...@apache.org>: > > > > > > > > > > > > > > > > To (slightly) hijack this thread: > > > > > > > > > On the subject of execuction_date: as I'm sure we're all > aware > > > > > > > > > the > > > > > > > > > > > concept of execution_date is confusing to new-commers to > Airflow > > > > > > > (there > > > > > > > > > > > are > > > > > > > > > > > > > many questions about "why hasn't my DAG run yet"? "Why is my > dag a > > > > > > > day > > > > > > > > > > > > behind?" etc.) and although we mention this in the docs it's a > > > > > > > confusing > > > > > > > > > > > > > concept. > > > > > > > > > > > > > > > > What to people think about adding two new parameters: > > > > > > > > > `period_start` > > > > > > > > > > > > and `period_end` and making these the preferred terms in place > of > > > > > > > execution_date and next_execution_date? > > > > > > > > > > > > > > > > This hopefully avoids any ambitious terms like "execution" > or > > > > > > > > > "run" > > > > > > > > > > > > which is understandably easy to conflate with the time the > task is > > > > > > > being > > > > > > > > > > > > > run (i.e. `now()`) > > > > > > > > > > > > > > > > If people think this naming is better and less confusing I > would > > > > > > > > > suggest we update all the docs and examples to use these > terms (but > > > > > > > > > still > > > > > > > > > > > > > mention the old names somewhere, probably in the macros docs) > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > -ash > > > > > > > > > > > > > > > > > > > On 8 Apr 2019, at 16:20, Arthur Wiedmer < > > > > > > > > > > arthur.wied...@gmail.com > > > > > > > > > > > > mailto:arthur.wied...@gmail.com> wrote: > > > > > > > > > > > > > > > > > Hi Bas, > > > > > > > > > > > > > > > > > > > > 1. I am aware of a few places where those parameters > are used > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > production > > > > > > > > > > > > > > > > > in a few hundred jobs. I highly recommend we don't > deprecate > > > > > > > > > > them > > > > > > > > > > > unless we > > > > > > > > > > > > > > > > > do it in a major version. > > > > > > > > > > > > > > > > > > > > 2. As James mentioned, inlets and outlets are a lineage > > > > > > > > > > annotation > > > > > > > > > > > > > > > > > > > > > > feature > > > > > > > > > > > > > > > > > which is still under development. Let's leave them in, > but we > > > > > > > > > > can > > > > > > > > > > guard > > > > > > > > > > > > > > > > them behind a feature flag if you prefer. > > > > > > > > > > > > > > > > > > > > 3. the yesterday*/tomorrow* params are convenience ones > if you > > > > > > > > > > use > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > daily > > > > > > > > > > > > > > > > > ETL. I agree with you that they are simple to compute, > but not > > > > > > > > > > everyone > > > > > > > > > > > > > > > > > using Apache Airflow is amazing with Python. Some users > are > > > > > > > > > > only > > > > > > > > > > > trying to > > > > > > > > > > > > > > > > > get a query scheduled and rely on a couple of niceties > like > > > > > > > > > > these > > > > > > > > > to > > > > > > > > > > > > get by. > > > > > > > > > > > > > > > > > 4. latest_date, end_date (I feel like there used to be > > > > > > > > > > start_date, > > > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > maybe it got lost) were a blend of things which were > used by a > > > > > > > > > > backfill > > > > > > > > > > > > > > > > > framework used internally at Airbnb. Latest date was > used if > > > > > > > > > > you > > > > > > > > > > > needed to > > > > > > > > > > > > > > > > > join to a dimension for which you only wanted the latest > > > > > > > > > > version > > > > > > > > > > of > > > > > > > > > > the > > > > > > > > > > > > > > > > attributes in you backfill. end_date was used for time > ranges > > > > > > > > > > where > > > > > > > > > > > > several > > > > > > > > > > > > > > > > > days were processed together in a range to save on > compute. I > > > > > > > > > > don't > > > > > > > > > > > > see an > > > > > > > > > > > > > > > > > issue with removing them. > > > > > > > > > > Best regards, > > > > > > > > > > Arthur > > > > > > > > > > On Mon, Apr 8, 2019 at 5:37 AM Bas Harenslak < > > > > > > > > > > basharens...@godatadriven.com <mailto: > > > > > > > > > > basharens...@godatadriven.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > Following Tao Feng’s question to discuss this PR< > > > > > > > > > > > https://github.com/apache/airflow/pull/5010 < > > > > > > > > > > > https://github.com/apache/airflow/pull/5010>> > (AIRFLOW-4192< > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192 < > > > > > > > > > > > https://issues.apache.org/jira/browse/AIRFLOW-4192>>), > please > > > > > > > > > > > discuss > > > > > > > > > > > here > > > > > > > > > > > > > > > > > if you agree/disagree/would change. > > > > > > > > > > > > > > > > > > > > > > The summary of the PR: > > > > > > > > > > > I was confused by the task context values and suggest > to clean > > > > > > > > > > > up > > > > > > > > > > > and > > > > > > > > > > > > > > > > > clarify these variables. Some are derivations from > other > > > > > > > > > > > variables, > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > > > are undocumented and unused, some are wrong (name > doesn’t > > > > > > > > > > > match > > > > > > > > > the > > > > > > > > > > > > value). > > > > > > > > > > > > > > > > > > Please discuss what you think of the removal of these > > > > > > > > > > > variables: > > > > > > > > > > > > > > > - Removed yesterday_ds, yesterday_ds_nodash, > tomorrow_ds, > > > > > > > > > > > tomorrow_ds_nodash. IMO the next_* and previous_* > variables > > > > > > > > > > > are > > > > > > > > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > > > > since these require complex logic to compute the next > > > > > > > > > > > execution > > > > > > > > > > date, > > > > > > > > > > > > > > > > > however would leave computing the yesterday* and > tomorrow* > > > > > > > > > > > variables > > > > > > > > > > > > > up to > > > > > > > > > > > > > > > > > > the user since they are simple one-liners and don't > relate to > > > > > > > > > > > the > > > > > > > > > > > DAG > > > > > > > > > > > > > > > > > interval. > > > > > > > > > > > > > > > > > > > > > > - Removed tables. This is a field in params, and is > thus > > > > > > > > > > > also > > > > > > > > > > > > > > > > > > > > > > > > > > accessible by the user ({{ params.tables }}). Also, it > was > > > > > > > > > > > undocumented. > > > > > > > > > > > > > > > > > > > - Removed latest_date. It's the same as ds and was > also > > > > > > > > > > > undocumented. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Removed inlets and outlets. Also undocumented, and > have > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > inlets/outlets ever worked/ever been used by anybody? > > > > > > > > > > > > > > > > > > > > > > - Removed end_date and END_DATE. Both have the same > value, > > > > > > > > > > > so > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > > > > > > > > > doesn't make sense to have both variables. Also, the > value is > > > > > > > > > > > ds > > > > > > > > > > which > > > > > > > > > > > > > > > > > contains the start date of the interval, so the naming > didn't > > > > > > > > > > > make > > > > > > > > > > > > sense to > > > > > > > > > > > > > > > > > > me. However, if anybody argues in favour of adding > > > > > > > > > > > "start_date" > > > > > > > > > and > > > > > > > > > > > > > > > > "end_date" to provide the start and end datetime of > task > > > > > > > > > > > instance > > > > > > > > > > > > > > > > intervals, I'd be happy to add them. > > > > > > > > > > > Cheers, > > > > > > > > > > > Bas > > > > > > > =============================================================================== > > > > > > > > > > > Please access the attached hyperlink for an important > electronic > > > > > > > > communications disclaimer: > > > > > > > > > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > > > > > > =============================================================================== > > > > > > > > > > > > >