Hey Bolke, > Are scheduler loop times a concern at all?
Yes, I strongly believe that they are. Especially as we add more DAGs/tasks. I am not a fan of (1). Caching is just going to create cache consistency issues, and be really annoying to manage, IMO. I agree that (2) seems more appealing. I can't comment on the feasibility of it, as I'm not well acquainted enough with the scheduler yet. Cheers, Chris On Fri, Jun 3, 2016 at 2:26 PM, Bolke de Bruin <bdbr...@gmail.com> wrote: > Hi, > > I am looking at speeding up the scheduler. Currently loop times increase > with the amount of tasks in a dag. This is due to > TaskInstance.are_depedencies_met executing several aggregation functions on > the database. These calls are expensive: between 0.05-0.15s per task and > for every scheduler loop this gets called twice. This call is where the > scheduler spends around 90% of its time when evaluating dags and is the > reason for people that have a large amount of tasks per dag to so quite > large loop times (north of 600s). > > I see 2 options to optimize the loop without going to a multiprocessing > approach which will just put the problem down the line (ie. the db or when > you don’t have enough cores anymore). > > 1. Cache the call to TI.are_dependencies_met by either caching in a > something like memcache or removing the need for the double call > (update_state and process_dag both make the call to > TI.are_dependencies_met). This would more or less cut the time in half. > > 2. Notify the downstream tasks of a state change of a upstream task. This > would remove the need for the aggregation as the task would just ‘know’. It > is a bit harder to implement correctly as you need to make sure you keep > being in a consistent state. Obviously you could still run a integrity > check once in a while. This option would make the aggregation event based > and significantly reduce the time spend here to around 1-5% of the current > scheduler. There is a slight overhead added at a state change of the > TaskInstance (managed by the TaskInstance itself). > > What do you think? My preferred option is #2. Am i missing any other > options? Are scheduler loop times a concern at all? > > Thanks > Bolke > > >