On 14.04.2009 23:23, Jess Holle wrote:
> Jess Holle wrote:
>> Similarly, when retrying workers in various routines in
>> mod_proxy_balancer.c those worker's lbstatus is incremented.  If the
>> retry fails, however, the lbstatus is never reset.  This issue also
>> leads to an lbstatus that increases without bound.  Just because a
>> worker was dead for 8 hours does not mean it can handle all the work
>> load now.  It needs to start fresh -- not 8 hours in the hole.  This
>> issue also creates an unduly huge impact when doing
>>
>>     mycandidate->s->lbstatus -= total_factor;
>>
> Actually I'm offbase here.  total_factor places undue emphasis on any
> worker that satisfies a request when multiple dead workers are retried. 
> For instance, if there are 7 dead workers, all being retried, 2 healthy
> workers, and all with an lbfactor of 1 the worker that gets the request
> gets its lbstatus decremented by 9, whereas it really should only be
> decremented by 2 -- else the weighting gets thrown way off.  However, it
> is /not/ thrown off more due to the huge lbstatus values that build up
> in dead workers.  That only becomes an issue when dead workers come to life.
>>
>> We're seeing the load balancing be thrown dramatically off in this case.
>>
>> Does anyone have suggestions for how this should be fixed?  If not,
>> again I can take a swing at this, e.g. reseting lbstatus to 0 in
>> ap_proxy_retry_worker().
>>
>> It *seems* like both of the issue center on handling of dead workers,
>> especially having a multiple dead workers and/or workers that are dead
>> for long periods of time.
>>
>> I've not yet checked whether mod_jk (where I believe these basic
>> algorithms came from) has similar issues.

The same type of balancing decision algorithm was part of mod_jk between
1.2.7 and 1.2.15. I always had problems to understand, how it exactly
behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request, traffic or
session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way load
that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets the
biggest current load number when being reactivated, so that it starts a
smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm not
sure what experience others have with the fairness of balancing in case
you add dynamics to the workers (errors and administrative downtimes).

Regards,

Rainer

Reply via email to