[OpenStack-Infra] RFC: Zuul executor congestion control

Tobias.Henkel Tue, 16 Jan 2018 03:34:43 -0800

Hi zuulers,

the zuul-executor resource governor topic seems to be a recurring now and we 
might want take the step and make it a bit smarter.


I think the current approach of a set of on/off governors based on the current 
conditions may not be sufficient. I thought about that and like to have 
feedback about what you think about that.

TLDR; I propose having a congestion control algorithm managing a congestion 
window utilizing a slow start with a generic sensor interface and weighted job 
costs.


Algorithm
--------------

The algorithm I propose would manage a congestion window of an abstract metric 
measured in points. This is intended to leverage some (simple) weighting of 
jobs as multi node jobs e.g. probably take more resources than single node jobs.

The algorithm consists of two threads. One managing the congestion window, one 
accepting the jobs.

Congestion window management:

  1.  Start with a window of size START_WINDOW_SIZE points
  2.  Get current used percentage of window
  3.  Ask sensors for green/red
  4.  If all green AND window-usage > INCREASE_WINDOW_THRESHOLD
     *   Increase window
  5.  If one red
     *   Decrease window below current usage
  6.  Loop back to step 2

Job accepting:

  1.  Get current used window-percentage
  2.  If window-usage < window-size
     *   Register function if necessary
     *   Accept job
     *   summarize window-usage (the job will update asynchronously when 
finished)
  3.  Else
     *   Deregister function if necessary
  4.  Loop back to step 1

The magic numbers used are subject for further discussion and algorithm 
tweaking.


Weighting of jobs
------------------------
Now different jobs take different amounts of resources so we would need some 
simple estimation about that. This could be tuned in the future. For the start 
I’d propose something simple like this:

Cost_job = 5 + 5 * size of inventory

In the future this could be improved to estimate the costs based on historical 
data of the individual jobs.


Sensors
----------

Further different ways of deployment will have different needs about the 
sensors. E.g. the load and ram sensors which utilize load1 and memfree won’t 
work in a kubernetes based deployments as they assume the executor is located 
exclusively on a VM. In order to mitigate I’d like to have some generic sensor 
interface where we also could put a cgroups sensor into which checks resource 
usage according to the cgroup limit (which is what we need for a kubernetes 
hosted zuul). We also could put a filesystem sensor in which monitors if there 
is enough local storage. For hooking this into the algorithm I think we could 
start with a single function

def isStatusOk() -> bool


Exposing the data
-------------------------

The window-usage and window-size values could also be exported to statsd. This 
could enable autoscaling of the number of executors in deployments supporting 
that.


What are your thoughts about that?

Kind regards
Tobi


--
BMW Car IT GmbH
Tobias Henkel
Spezialist Entwicklung
Moosacher Straße 86
80809 München

Tel.:  +49 89 189311-48
Fax:  +49 89 189311-20
Mail: [email protected]<mailto:[email protected]>
Web: http://www.bmw-carit.de<http://www.bmw-carit.de/>
-----------------------------------------------------------------------------
BMW Car IT GmbH
Geschäftsführer: Kai-Uwe Balszuweit
und Christian Salzmann
Sitz und Registergericht: München HRB 134810
-----------------------------------------------------------------------------

_______________________________________________
OpenStack-Infra mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] RFC: Zuul executor congestion control

Reply via email to