Hey Bolke,

> Are scheduler loop times a concern at all?

Yes, I strongly believe that they are. Especially as we add more DAGs/tasks.

I am not a fan of (1). Caching is just going to create cache consistency
issues, and be really annoying to manage, IMO.

I agree that (2) seems more appealing. I can't comment on the feasibility
of it, as I'm not well acquainted enough with the scheduler yet.

Cheers,
Chris

On Fri, Jun 3, 2016 at 2:26 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi,
>
> I am looking at speeding up the scheduler. Currently loop times increase
> with the amount of tasks in a dag. This is due to
> TaskInstance.are_depedencies_met executing several aggregation functions on
> the database. These calls are expensive: between 0.05-0.15s per task and
> for every scheduler loop this gets called twice. This call is where the
> scheduler spends around 90% of its time when evaluating dags and is the
> reason for people that have a large amount of tasks per dag to so quite
> large loop times (north of 600s).
>
> I see 2 options to optimize the loop without going to a multiprocessing
> approach which will just put the problem down the line (ie. the db or when
> you don’t have enough cores anymore).
>
> 1. Cache the call to TI.are_dependencies_met by either caching in a
> something like memcache or removing the need for the double call
> (update_state and process_dag both make the call to
> TI.are_dependencies_met). This would more or less cut the time in half.
>
> 2. Notify the downstream tasks of a state change of a upstream task. This
> would remove the need for the aggregation as the task would just ‘know’. It
> is a bit harder to implement correctly as you need to make sure you keep
> being in a consistent state. Obviously you could still run a integrity
> check once in a while. This option would make the aggregation event based
> and significantly reduce the time spend here to around 1-5% of the current
> scheduler. There is a slight overhead added at a state change of the
> TaskInstance (managed by the TaskInstance itself).
>
> What do you think? My preferred option is #2. Am i missing any other
> options? Are scheduler loop times a concern at all?
>
> Thanks
> Bolke
>
>
>

Reply via email to