I agree with Tomek. TBH, *Timetable *to me does seem to be a complex concept, and I can't quite understand what it is at first sight.
I think *Schedule *does convey the message better - consider the sentence: "*the Scheduler arranges jobs to run based on some _______**.*" Here, *"schedules" *seem to fit better than *"timetables"* As for search results on schedule vs scheduler, I wonder if reorganizing the docs to have `schedule` in the top section and have `scheduler` under operation::architecture will help with the search result? (I don't know much about SEO) Nevertheless, the naming shouldn't be a blocker for this feature to move forward. Best Bin On Thu, Mar 11, 2021 at 4:31 PM Tomasz Urbaszek <[email protected]> wrote: > > Timetable is a synonym of ’Schedule’ > > I'm not a native speaker and I don't get it easily as a synonym. > > To be honest the "timetable" sounds like a complex piece of software. > Personally I experienced that people use schedule and > schedule_interval interchangeably. Additionally, schedule being more linked > to scheduler imho is an advantage because it suggests some connection > between these two. > > I feel that by introducing timetable we will bring yet more complexity to > airflow vocabulary. And I personally would treat it as yet another moving > part of an already complex system. > > I think we can move forward with this feature. We renamed "functional > DAGs" to "Taskflow API" so, naming is not a blocker. If we can't get > consensus we can always ask the users - they will use the feature. > > Best, > Tomek > > > On Fri, 12 Mar 2021 at 01:15, James Timmins <[email protected]> > wrote: > >> *Timetable vs Schedule* >> Re Timetable. I agree that if this was a greenfield project, it might >> make sense to use Schedule. But as it stands, we need to find the right >> balance between the most specific name and being sufficiently unique that >> it’s easy to work with in code and, perhaps most importantly, easy to find >> when searching on Google and in the Airflow Docs. >> >> There are more than 10,000 references to `schedule*` in the Airflow >> codebase. `schedule` and `scheduler` are also identical to most search >> engines/libraries, since they have the same stem, `schedule`. This means >> that when a user Googles `Airflow Schedule`, they will get back intermixed >> results of the Schedule class and the Scheduler. >> >> Timetable is a synonym of ’Schedule’, so it passes the accuracy test, >> won’t ever be ambiguous in code, and is distinct in search results. >> >> *Should "interval-less DAGs” have data_interval_start and end available >> in the context?* >> I think they should be present so it’s consistent across DAGs. Let’s not >> make users think too hard about what values are available in what context. >> What if someone sets the interval to 0? What if sometimes the interval is >> 0, and sometimes it’s 1 hour? Rather than changing the rules depending on >> usage, it’s easiest to have one rule that the users can depend upon. >> >> *Re set_context_variables()* >> What context is being defined here? The code comment says "Update or set >> new context variables to become available in task templates and operators.” >> The Timetable seems like the wrong place for variables that will get passed >> into task templates/operators, unless this is actually a way to pass >> Airflow macros into the Timetable context. In which case I fully support >> this. If not, we may want to add this functionality. >> >> James >> On Mar 11, 2021, 9:40 AM -0800, Kaxil Naik <[email protected]>, wrote: >> >> Yup I have no strong opinions on either so happy to keep it TimeTable or >> if there is another suggestion. >> >> Regards, >> Kaxil >> >> On Thu, Mar 11, 2021 at 5:00 PM James Timmins <[email protected]> >> wrote: >> >>> Respectfully, I strongly disagree with the renaming of Timetable to >>> Schedule. Schedule and Scheduler aren't meaningfully different, which can >>> lead to a lot of confusion. Even as a native English speaker, and someone >>> who works on Airflow full time, I routinely need to ask for clarification >>> about what schedule-related concept someone is referring to. I foresee >>> Schedule and Scheduler as two distinct yet closely related concepts >>> becoming a major source of confusion. >>> >>> If folks dislike Timetable, we could certainly change to something else, >>> but let's not use something so similar to existing Airflow classes. >>> >>> -James >>> >>> On Thu, Mar 11, 2021 at 2:13 AM Ash Berlin-Taylor <[email protected]> >>> wrote: >>> >>>> Summary of changes so far on the AIP: >>>> >>>> My proposed rename of DagRun.execution_date is now DagRun.schedule_date >>>> (previously I had proposed run_date. Thanks dstandish!) >>>> >>>> Timetable classes are renamed to Schedule classes (CronSchedule etc), >>>> similarly the DAG argument is now schedule (reminder: schedule_interval >>>> will not be removed or deprecated, and will still be the way to use >>>> "simple" expressions) >>>> >>>> -ash >>>> >>>> On Wed, 10 Mar, 2021 at 14:15, Ash Berlin-Taylor <[email protected]> >>>> wrote: >>>> >>>> Could change Timetable To Schedule -- that would mean the DAG arg >>>> becomes `schedule=CronSchedule(...)` -- a bit close to the current >>>> `schedule_interval` but I think clear enough difference. >>>> >>>> I do like the name but my one worry with "schedule" is that Scheduler >>>> and Schedule are very similar, and might be be confused with each other for >>>> non-native English speakers? (I defer to others' judgment here, as this is >>>> not something I can experience myself.) >>>> >>>> @Kevin Yang <[email protected]> @Daniel Standish <[email protected]> any >>>> final input on this AIP? >>>> >>>> >>>> >>>> On Tue, 9 Mar, 2021 at 16:59, Kaxil Naik <[email protected]> wrote: >>>> >>>> Hi Ash and all, >>>> >>>> >>>> What do people think of this? Worth it? Too complex to reason about >>>>> what context variables might exist as a result? >>>> >>>> >>>> I think I wouldn't worry about it right now or maybe not as part of >>>> this AIP. Currently, in one of the Github Issue, a user mentioned that it >>>> is not straightforward to know what is inside the context dictionary- >>>> https://github.com/apache/airflow/issues/14396. So maybe we can tackle >>>> this issue separately once the AbstractTimetable is built. >>>> >>>> Should "interval-less DAGs" (ones using "CronTimetable" in my proposal >>>>> vs "DataTimetable") have data_interval_start and end available in the >>>>> context? >>>> >>>> >>>> hmm.. I would say No but then it contradicts my suggestion to remove >>>> context dict from this AIP. If we are going to use it in scheduler then >>>> yes, where data_interval_start = data_interval_end from CronTimetable. >>>> >>>> Does anyone have any better names than TimeDeltaTimetable, >>>>> DataTimetable, and CronTimetable? (We can probably change these names >>>>> right >>>>> up until release, so not important to get this correct *now*.) >>>> >>>> >>>> No strong opinion here. Just an alternate suggestion can >>>> be TimeDeltaSchedule, DataSchedule and CronSchedule >>>> >>>> >>>>> Should I try to roll AIP-30 >>>>> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence> >>>>> in >>>>> to this, or should we make that a future addition? (My vote is for future >>>>> addition) >>>> >>>> >>>> I would vote for Future addition too. >>>> >>>> Regards, >>>> Kaxil >>>> >>>> On Sat, Mar 6, 2021 at 11:05 AM Ash Berlin-Taylor <[email protected]> >>>> wrote: >>>> >>>>> I think, yes, AIP-35 or something like it would happily co-exist with >>>>> this proposal. >>>>> >>>>> @Daniel <[email protected]> and I have been discussing this a bit >>>>> on Slack, and one of the questions he asked was if the concept of >>>>> data_interval should be moved from DagRun as James and I suggested down on >>>>> to the individual task: >>>>> >>>>> suppose i have a new dag hitting 5 api endpoints and pulling data to >>>>> s3. suppose that yesterday 4 of them succeeded but one failed. today, 4 of >>>>> them should pull from yesterday. but the one that failed should pull from >>>>> 2 >>>>> days back. so even though these normally have the same interval, today >>>>> they >>>>> should not. >>>>> >>>>> >>>>> My view on this is two fold: one, this should primarily be handled by >>>>> retries on the task, and secondly, having different TaskIstances in the >>>>> same DagRun have different data intervals would be much harder to reason >>>>> about/design the UI around, so for those reasons I still think interval >>>>> should be a DagRun-level concept. >>>>> >>>>> (He has a stalled AIP-30 where he proposed something to address this >>>>> kind of "watermark" case, which we might pick up next after this AIP is >>>>> complete) >>>>> >>>>> One thing we might want to do is extend the interface of >>>>> AbstractTimetable to be able to add/update parameters in the context dict, >>>>> so the interface could become this: >>>>> >>>>> class AbstractTimetable(ABC): >>>>> @abstractmethod >>>>> def next_dagrun_info( >>>>> date_last_automated_dagrun: Optional[pendulum.DateTime], >>>>> >>>>> session: Session, >>>>> ) -> Optional[DagRunInfo]: >>>>> """ >>>>> Get information about the next DagRun of this dag after >>>>> ``date_last_automated_dagrun`` -- the >>>>> execution date, and the earliest it could be scheduled >>>>> >>>>> :param date_last_automated_dagrun: The max(execution_date) of >>>>> existing >>>>> "automated" DagRuns for this dag (scheduled or backfill, >>>>> but not >>>>> manual) >>>>> """ >>>>> >>>>> @abstractmethod >>>>> def set_context_variables(self, dagrun: DagRun, context: Dict[str, >>>>> Any]) -> None: >>>>> """ >>>>> Update or set new context variables to become available in >>>>> task templates and operators. >>>>> """ >>>>> >>>>> >>>>> What do people think of this? Worth it? Too complex to reason about >>>>> what context variables might exist as a result? >>>>> >>>>> *Outstanding question*: >>>>> >>>>> - Should "interval-less DAGs" (ones using "CronTimetable" in my >>>>> proposal vs "DataTimetable") have data_interval_start and end >>>>> available in >>>>> the context? >>>>> - Does anyone have any better names than TimeDeltaTimetable, >>>>> DataTimetable, and CronTimetable? (We can probably change these names >>>>> right >>>>> up until release, so not important to get this correct *now*.) >>>>> - Should I try to roll AIP-30 >>>>> >>>>> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence> >>>>> in to this, or should we make that a future addition? (My vote is for >>>>> future addition) >>>>> >>>>> >>>>> I'd like to start voting on this AIP next week (probably on Tuesday) >>>>> as I think this will be a powerful feature that eases confusing to new >>>>> users. >>>>> >>>>> -Ash >>>>> >>>>> >>>>> On Tue, 2 Mar, 2021 at 23:05, Alex Inhert <[email protected]> >>>>> wrote: >>>>> >>>>> Is this AIP going to co-exist with AIP-35 "Add Signal Based Scheduling >>>>> To Airflow"? >>>>> I think streaming was also discussed there (though it wasn't really >>>>> the use case). >>>>> >>>>> >>>>> 02.03.2021, 22:10, "Ash Berlin-Taylor" <[email protected]>: >>>>> >>>>> Hi Kevin, >>>>> >>>>> Interesting idea. My original idea was actually for "interval-less >>>>> DAGs" (i.e. ones where it's just "run at this time") would not have >>>>> data_interval_start or end, but (while drafting the AIP) we decided that >>>>> it >>>>> was probably "easier" if those values were always datetimes. >>>>> >>>>> That said, I think having the DB model have those values be nullable >>>>> would future proof it without needing another migration to change it. Do >>>>> you think this is worth doing now? >>>>> >>>>> I haven't (yet! It's on my list) spent any significant time thinking >>>>> about how to make Airflow play nicely with streaming jobs. If anyone else >>>>> has ideas here please share them >>>>> >>>>> -ash >>>>> >>>>> On Sat, 27 Feb, 2021 at 16:09, Kevin Yang <[email protected]> wrote: >>>>> >>>>> Hi Ash and James, >>>>> >>>>> This is an exciting move. What do you think about using this >>>>> opportunity to extend Airflow's support to streaming like use cases? I.e >>>>> DAGs/tasks that want to run forever like a service. For such use cases, >>>>> schedule interval might not be meaningful, then do we want to make the >>>>> date >>>>> interval param optional to DagRun and task instances? That sounds like a >>>>> pretty major change to the underlying model of Airflow, but this AIP is so >>>>> far the best opportunity I saw that can level up Airflow's support for >>>>> streaming/service use cases. >>>>> >>>>> >>>>> Cheers, >>>>> Kevin Y >>>>> >>>>> On Fri, Feb 26, 2021 at 8:56 AM Daniel Standish <[email protected]> >>>>> wrote: >>>>> >>>>> Very excited to see this proposal come through and love the direction >>>>> this has gone. >>>>> >>>>> Couple comments... >>>>> >>>>> *Tree view / Data completeness view* >>>>> >>>>> When you design your tasks with the canonical idempotence pattern, the >>>>> tree view shows you both data completeness and task execution history >>>>> (success / failure etc). >>>>> >>>>> When you don't use that pattern (which is my general preference), tree >>>>> view is only task execution history. >>>>> >>>>> This change has the potential to unlock a data completeness view for >>>>> canonical tasks. It's possible that the "data completeness view" can >>>>> simply be the tree view. I.e. somehow it can use these new classes to >>>>> know >>>>> what data was successfully filled and what data wasn't. >>>>> >>>>> To the extent we like the idea of either extending / plugging / >>>>> modifying tree view, or adding a distinct data completeness view, we might >>>>> want to anticipate the needs of that in this change. And maybe no >>>>> alteration to the proposal would be needed but just want to throw the idea >>>>> out there. >>>>> >>>>> *Watermark workflow / incremental processing* >>>>> >>>>> A common pattern in data warehousing is pulling data incrementally >>>>> from a source. >>>>> >>>>> A standard way to achieve this is at the start of the task, select max >>>>> `updated_at` in source table and hold on to that value for a minute. This >>>>> is your tentative new high watermark. >>>>> Now it's time to pull your data. If your task ran before, grab last >>>>> high watermark. If not, use initial load value. >>>>> If successful, update high watermark. >>>>> >>>>> On my team we implemented this with a stateful tasks / stateful >>>>> processes concept (there's a dormant draft AIP here >>>>> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence>) >>>>> and a WatermarkOperator that handled the boilerplate*. >>>>> >>>>> Again here, I don't have a specific suggestion at this moment. But I >>>>> wanted to articulate this workflow because it is common and it wasn't >>>>> immediately obvious to me in reading AIP-39 how I would use it to >>>>> implement >>>>> it. >>>>> >>>>> AIP-39 makes airflow more data-aware. So if it can support this kind >>>>> of workflow that's great. @Ash Berlin-Taylor <[email protected]> do >>>>> you have thoughts on how it might be compatible with this kind of thing as >>>>> is? >>>>> >>>>> --- >>>>> >>>>> * The base operator is designed so that Subclasses only need to >>>>> implement two methods: >>>>> - `get_high_watermark`: produce the tentative new high watermark >>>>> ' `watermark_execute`: analogous to implementing poke in a sensor, >>>>> this is where your work is done. `execute` is left to the base class, and >>>>> it orchestrates (1) getting last high watermark or inital load value and >>>>> (2) updating new high watermark if job successful. >>>>> >>>>>
