I would like to challenge the notion that "execution_date" is well documented. Looking at airflow.apache.org right now and searching for all references to "execution_date", I find that the only definition of execution_date is, "The execution date of the DAG". There are some other passing references that imply more but nothing explicit.
>From the documentation, as currently published, it seems reasonable to expect some concurrence between "execution_date" and when a dag executes, especially given the heavy repetition of, "execution_date - The execution date of the DAG". Personally, I think the problem is the word "execution", not with which bound is used to label/define an interval. I think this is especially difficult for people coming to Airflow with a cron background who are not necessarily thinking about intervals. On Thu, Sep 27, 2018 at 11:23 AM Brian Greene < br...@heisenbergwoodworking.com> wrote: > Second use of “inane” on this subject. Brilliant, less combative response > Chris. > > There’s another point.. left bound makes sense to some people, right bound > to others. > > There’s no way to know or measure how “hard” this is to new users, so even > if the change was made - new name, use right bound... how can you be sure > you’re not actually confusing a LARGER number of new users from that point > on. > > It’s like left handed versus right handed people, except there’s no > statistical basis for your argument that one group is larger than the > other, or that there would actually be a measurable uptick in understanding > and usability across the ENTIRE user community. > > So your proposal 100% breaks backwards compatibility of code AND concept, > on anecdotal evidence that it would somehow make usage magically easier? > > Airflow is like a bulldozer made out of scalpels that can fly(not well, > but it’s possible). A slick dag can accomplish a staggering amount of work > with the smallest little bit of elegant code. Learning to “think in > airflow” though is so, so much more than understanding execution date. > That’s barely table stakes in terms of concepts you’ll need to accept to be > effective with airflow. > > Maybe somebody just has a thing against lefty’s? Some kind of > left-bound-thinking conspiracy? > > Sent from a device with less than stellar autocorrect > > > On Sep 27, 2018, at 12:56 PM, Chris Palmer <ch...@crpalmer.com> wrote: > > > > While taking a step back makes some sense, we also need to identify what > > the issue is. Simply saying 'execution_date behavior is confusing to new > > users' isn't good enough. What is confusing about it? Is it what it > > represents, or just the name itself? > > > > There are a number of different timestamps that might be of interest, > > including (but not limited to): > > > > *Identifying timestamp* > > For any time interval, there are two natural choices of timestamps to > > represent that interval, the left and right bounds. For Airflow the left > > bound has been chosen, and is called execution_date. For various > reasons, I > > think that makes a much better choice than the right bound. > > > > *Create/update/delete timestamps* > > Timestamps representing when particular database records where created, > > updated and or deleted. I don't believe that Airflow currently records > > these. > > > > *Runtime timestamps* > > The timestamps that a task or other process started and stopped. Airflow > > records these for Tasks, but I think the implementation is maybe a little > > lacking for DagRuns. > > > > > > So what's the confusion with execution_date? Is it what it represents or > > the name itself? > > > > I think part of the learning curve with Airflow is understanding that > > execution_date is the left bound of the interval. No matter what name you > > use for the identifying timestamp I think new users will need to learn > what > > that choice means. Changing the name won't magically make all the > confusion > > go away. > > > > While I don't think execution_date is the greatest name in the world, > it's > > a lot better than the suggested alternative run_stamped. Tasks also have > an > > identifying timestamp, and if I saw run_stamped on a Task I would have no > > idea what it means (stamped by what?). > > > > While there may be better names than execution_date, I don't think they > are > > so much better that it is worth the effort to overhaul such an integral > > part of Airflow. Maybe some improvements to the documentation could be > > made, but nothing so drastic as to renaming such a core item. > > > > > > As for the second suggestion to add "a new variable which indicated the > > actual datetime when the DAG run was generated. call it > > execution_start_date". It is very unclear what the desired outcome is > with > > this. > > > > To me "generated" implies creation time, i.e. recorded in the database. > > However, creation of a DagRun record in the database is a distinct event > > from when Tasks associated with that DagRun start executing. Plus DagRuns > > themselves don't actually "run" - Tasks are the only thing that really > gets > > run by Airflow. > > > > What is actually desired here? > > - The right bound of the schedule interval? > > - The time the DagRun was created? > > - The time that any Tasks associated with a DagRun were first considered > > by the scheduler? > > - The time that any Tasks associated with a DagRun were first scheduled? > > - The time that any Tasks associated with a DagRun were actually started > > by a worker? > > > > > > The lack of clarity and completeness around these suggestions, alongside > > inane declarations like "This name won't cause people to get confused" is > > hardly a good way to get people to take suggestions seriously. > > > > Chris > > > > > > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman <waks...@gmail.com > > > > wrote: > > > >> This comes up a lot. I've seen it on this mailing list multiple times > and > >> it's something that I have to explicitly call out to every single person > >> that I've helped train up on Airflow. > >> > >> If we take a moment to set aside why things are the way they are, what > the > >> documentation says, and how experienced users feel things should behave; > >> there still remains the fact that a lot of new users get confused by how > >> "execution_date" works. > >> > >> Whether it's a problem, whether we need to do something, and what we > could > >> do are all separate questions but I think it's important that we > >> acknowledge and start from: > >> > >> A lot of new users get confused by how "execution_date" works. > >> > >> I recognize that some of this is a learning curve issue and some of > this is > >> a mindset issue but it begs the question: do enough users benefit from > the > >> current structure to justify the harm to new users? > >> > >> --George > >> > >> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene < > >> br...@heisenbergwoodworking.com> wrote: > >> > >>> It took a minute to grok, but in the larger context of how af works it > >>> makes perfect sense the way it is. Changing something so fundamentally > >>> breaking to every dag in existence should bring a comparable benefit. > >>> Beyond the avoiding teaching a concept you disagree with, what benefits > >>> does the proposal bring to offset the cost of change? > >>> > >>> I’m gonna make a meme - “do you even airflow bro?” > >>> > >>> Sent from a device with less than stellar autocorrect > >>> > >>>> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin < > >>> maximebeauche...@gmail.com> wrote: > >>>> > >>>> I think if you have a functional mindset (as in "functional data > >>> engineering > >>>> < > >>> > >> > https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a > >>>> ") > >>>> as opposed to a cron mindset, using the left bound of the time > interval > >>>> makes a lot of sense. Things like your daily table partition keys > align > >>>> with your Airflow execution_date. > >>>> > >>>> The main thing is that whatever we do we cannot break backwards > >>>> compatibility. Offering both views (left bound/right bound), as it's > >> been > >>>> proposed before, either as an environment setting or a user personal > >>>> preference is even more confusing to me personally. Users would have > to > >>>> switch context as they help each other or change environments. > >>>> > >>>> Also note that your intuition may differ from other people's > intuition, > >>> and > >>>> that "unlearning" something is way harder than learning something. > >>>> > >>>> My personal take on this is to make this a rite of passage. This is > >> just > >>>> one of the many thing you have to learn when learning Airflow. > >>>> > >>>> Max > >>>> > >>>>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin <hussam.ela...@gmail.com> > >>> wrote: > >>>>> > >>>>> Hi Bolke > >>>>> > >>>>> Speaking as a consultant who is constantly training other teams how > to > >>> use > >>>>> airflow, I do frequently see this confusion. > >>>>> Another one is how the batch_date is always batch_date + interval or > >> as > >>> the > >>>>> docs make it quite clear > >>>>> > >>>>> "*Let’s Repeat That* The scheduler runs your job one > schedule_interval > >>>>> AFTER > >>>>> the start date, at the END of the period." > >>>>> > >>>>> Renaming it would make it simpler for newbies, but essentially they > >> will > >>>>> need to understand how Airflow behaves, execution_date being the > batch > >>>>> execution date not the run_date of the DAG > >>>>> > >>>>> I am actually in the process of writing a blog post > >>>>> < > >> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/> > >>>>> about this which I could use peoples feedback > >>>>> > >>>>> If it helps, I find that explaining how backfills work and why they > >> are > >>>>> important will drive home what the execution_date is :) > >>>>> > >>>>> > >>>>> Regards > >>>>> Sam > >>>>> > >>>>> > >>>>> > >>>>>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin <bdbr...@gmail.com> > >>> wrote: > >>>>>> > >>>>>> I dont think this makes sense and I dont that think anyone had a > real > >>>>>> issue with this. Execution date has been clearly documented and is > >>> part > >>>>> of > >>>>>> the core principles of airflow. Renaming will create more confusion. > >>>>>> > >>>>>> Please note that I do think that as an anonymous user you cannot > >> speak > >>>>> for > >>>>>> any "new airflow user". That is a contradiction to me. > >>>>>> > >>>>>> Thanks > >>>>>> Bolke > >>>>>> > >>>>>> Sent from my iPhone > >>>>>> > >>>>>>> On 26 Sep 2018, at 07:59, airflowuser <airflowu...@protonmail.com > >>>>> .INVALID> > >>>>>> wrote: > >>>>>>> > >>>>>>> One of the most annoying, hard to understand and against all common > >>>>>> sense is the execution_date behavior. I assume that any new Airflow > >>> user > >>>>>> has been struggling with it. > >>>>>>> The amount of questions with answers referring to : > >>>>>> https://airflow.apache.org/scheduler.html?scheduling-triggers is > >>>>>> uncountable. > >>>>>>> > >>>>>>> Most people mistakenly think that execution_date is the datetime > >> which > >>>>>> the DAG started to run. > >>>>>>> > >>>>>>> I suggest the following changes: > >>>>>>> 1. Renaming the execution_date to something else like: run_stamped > >>>>>> This name won't cause people to get confused. > >>>>>>> 2. Adding a new variable which indicated the actual datetime when > >> the > >>>>>> DAG run was generated. call it execution_start_date. People seem to > >>> want > >>>>>> the information when the DAG actually started to be executed/run. > >>>>>>> > >>>>>>> This is only naming changes. No need to actual change the behavior > - > >>>>>> This will only make things simpler as when user encounter > >> run_stamped > >>>>> he > >>>>>> won't be confused by the name like execution_date > >>>>>> > >>>>> > >>> > >> >