Re: mod_proxy/mod_proxy_balancer bug

2009-04-23 Thread Jim Jagielski


On Apr 23, 2009, at 8:45 AM, Jim Jagielski wrote:



+1... Maybe I'll branch off a 2.2-proxy branch as a sandbox to play
around in... Then we can front-port to trunk and use the sandbox as
the backport source :)



Just in case people didn't see it, I've created a branch
from 2.2.x as a place for us to try these proxy improvements
including backports of some trunk related things...

https://svn.apache.org/repos/asf/httpd/httpd/branches/httpd-2.2-proxy



Re: mod_proxy/mod_proxy_balancer bug

2009-04-23 Thread Jim Jagielski


On Apr 22, 2009, at 5:16 AM, jean-frederic clere wrote:


Rainer Jung wrote:

On 20.04.2009 15:57, Jim Jagielski wrote:

On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote:
The same type of balancing decision algorithm was part of mod_jk  
between
1.2.7 and 1.2.15. I always had problems to understand, how it  
exactly

behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request,  
traffic or

session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way  
load

that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets  
the
biggest current load number when being reactivated, so that it  
starts a

smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm  
not
sure what experience others have with the fairness of balancing  
in case
you add dynamics to the workers (errors and administrative  
downtimes).



I have some ideas on the "soft start" when a errored-out worker
returns (or when a new worker is added *hint* *hint*) that I've
been playing with. The main thing, for me at least, is low overhead,
even if it means sacrificing accuracy to the nth decimal place...
I used to think aging was not something we wanted to do in
mod_proxy, but mostly it was based on complex aging, and the
overhead associated with that. But I have some ideas there as
well.

The main thing I've been working on is trying to do all these
things in trunk in a way that is "easily" backportable to 2.2...

So that makes my answer to JFC partially obsolete. Sorry I read your
post later.


Yep I have also experimented in the area... We need to do something.



+1... Maybe I'll branch off a 2.2-proxy branch as a sandbox to play
around in... Then we can front-port to trunk and use the sandbox as
the backport source :)



Re: mod_proxy/mod_proxy_balancer bug

2009-04-22 Thread jean-frederic clere

Rainer Jung wrote:

On 20.04.2009 15:57, Jim Jagielski wrote:

On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote:

The same type of balancing decision algorithm was part of mod_jk between
1.2.7 and 1.2.15. I always had problems to understand, how it exactly
behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request, traffic or
session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way load
that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets the
biggest current load number when being reactivated, so that it starts a
smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm not
sure what experience others have with the fairness of balancing in case
you add dynamics to the workers (errors and administrative downtimes).


I have some ideas on the "soft start" when a errored-out worker
returns (or when a new worker is added *hint* *hint*) that I've
been playing with. The main thing, for me at least, is low overhead,
even if it means sacrificing accuracy to the nth decimal place...
I used to think aging was not something we wanted to do in
mod_proxy, but mostly it was based on complex aging, and the
overhead associated with that. But I have some ideas there as
well.

The main thing I've been working on is trying to do all these
things in trunk in a way that is "easily" backportable to 2.2...


So that makes my answer to JFC partially obsolete. Sorry I read your
post later.


Yep I have also experimented in the area... We need to do something.

Cheers

Jean-Frederic



Regards,

Rainer





Re: mod_proxy/mod_proxy_balancer bug

2009-04-21 Thread Rainer Jung
On 20.04.2009 15:57, Jim Jagielski wrote:
> 
> On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote:
>>
>> The same type of balancing decision algorithm was part of mod_jk between
>> 1.2.7 and 1.2.15. I always had problems to understand, how it exactly
>> behaves in case some workers are out of order. The algorithm is
>> interesting, but I found it very hard to model its mathematics into
>> formulas.
>>
>> We finally decided to switch to something else. For request, traffic or
>> session based balancing we do count items (requests, bytes or new
>> sessions), and divide the counters by two once a minute. That way load
>> that happened in the past does count less.
>>
>> Furthermore a worker that was dead or deactivated some time gets the
>> biggest current load number when being reactivated, so that it starts a
>> smooth as possible.
>>
>> I expect porting this to mod_proxy in trunk will be easy, but I'm not
>> sure what experience others have with the fairness of balancing in case
>> you add dynamics to the workers (errors and administrative downtimes).
>>
> 
> I have some ideas on the "soft start" when a errored-out worker
> returns (or when a new worker is added *hint* *hint*) that I've
> been playing with. The main thing, for me at least, is low overhead,
> even if it means sacrificing accuracy to the nth decimal place...
> I used to think aging was not something we wanted to do in
> mod_proxy, but mostly it was based on complex aging, and the
> overhead associated with that. But I have some ideas there as
> well.
> 
> The main thing I've been working on is trying to do all these
> things in trunk in a way that is "easily" backportable to 2.2...

So that makes my answer to JFC partially obsolete. Sorry I read your
post later.

Regards,

Rainer


Re: mod_proxy/mod_proxy_balancer bug

2009-04-20 Thread Jim Jagielski


On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote:


The same type of balancing decision algorithm was part of mod_jk  
between

1.2.7 and 1.2.15. I always had problems to understand, how it exactly
behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request, traffic  
or

session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way load
that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets the
biggest current load number when being reactivated, so that it  
starts a

smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm not
sure what experience others have with the fairness of balancing in  
case

you add dynamics to the workers (errors and administrative downtimes).



I have some ideas on the "soft start" when a errored-out worker
returns (or when a new worker is added *hint* *hint*) that I've
been playing with. The main thing, for me at least, is low overhead,
even if it means sacrificing accuracy to the nth decimal place...
I used to think aging was not something we wanted to do in
mod_proxy, but mostly it was based on complex aging, and the
overhead associated with that. But I have some ideas there as
well.

The main thing I've been working on is trying to do all these
things in trunk in a way that is "easily" backportable to 2.2...



Re: mod_proxy/mod_proxy_balancer bug

2009-04-17 Thread Jess Holle

Rainer Jung wrote:

The same type of balancing decision algorithm was part of mod_jk between
1.2.7 and 1.2.15. I always had problems to understand, how it exactly
behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request, traffic or
session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way load
that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets the
biggest current load number when being reactivated, so that it starts a
smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm not
sure what experience others have with the fairness of balancing in case
you add dynamics to the workers (errors and administrative downtimes).
  
I'd be /_very_ /interested in such a port to mod_proxy_balancer -- in 
2.2.x in my case.  Any help/pointers/assistance would be appreciated.  I 
could apply such a change as a patch to just my version, but I'd be a 
lot more interested in getting 2.2.x as a whole to a better place and 
not having to maintain my own fork of things.


I get a strong impression that others haven't really pushed 
mod_proxy_balancer in this area.


Overall having solid mod_proxy_balancer functionality obviously benefits 
more than just AJP and I like the idea of mod_proxy_ajp.  That said, if 
mod_jk is going to move ahead and mod_proxy_ajp become a backwater at 
some point I'll need to move back to mod_jk, though I'd really want the 
ability to gracefully throttle requests in mod_jk first.  [When mod_jk 
runs out of connections it gives a 503.  mod_proxy can queue up requests 
instead.]


--
Jess Holle



Re: mod_proxy/mod_proxy_balancer bug

2009-04-17 Thread Rainer Jung
On 14.04.2009 23:23, Jess Holle wrote:
> Jess Holle wrote:
>> Similarly, when retrying workers in various routines in
>> mod_proxy_balancer.c those worker's lbstatus is incremented.  If the
>> retry fails, however, the lbstatus is never reset.  This issue also
>> leads to an lbstatus that increases without bound.  Just because a
>> worker was dead for 8 hours does not mean it can handle all the work
>> load now.  It needs to start fresh -- not 8 hours in the hole.  This
>> issue also creates an unduly huge impact when doing
>>
>> mycandidate->s->lbstatus -= total_factor;
>>
> Actually I'm offbase here.  total_factor places undue emphasis on any
> worker that satisfies a request when multiple dead workers are retried. 
> For instance, if there are 7 dead workers, all being retried, 2 healthy
> workers, and all with an lbfactor of 1 the worker that gets the request
> gets its lbstatus decremented by 9, whereas it really should only be
> decremented by 2 -- else the weighting gets thrown way off.  However, it
> is /not/ thrown off more due to the huge lbstatus values that build up
> in dead workers.  That only becomes an issue when dead workers come to life.
>>
>> We're seeing the load balancing be thrown dramatically off in this case.
>>
>> Does anyone have suggestions for how this should be fixed?  If not,
>> again I can take a swing at this, e.g. reseting lbstatus to 0 in
>> ap_proxy_retry_worker().
>>
>> It *seems* like both of the issue center on handling of dead workers,
>> especially having a multiple dead workers and/or workers that are dead
>> for long periods of time.
>>
>> I've not yet checked whether mod_jk (where I believe these basic
>> algorithms came from) has similar issues.

The same type of balancing decision algorithm was part of mod_jk between
1.2.7 and 1.2.15. I always had problems to understand, how it exactly
behaves in case some workers are out of order. The algorithm is
interesting, but I found it very hard to model its mathematics into
formulas.

We finally decided to switch to something else. For request, traffic or
session based balancing we do count items (requests, bytes or new
sessions), and divide the counters by two once a minute. That way load
that happened in the past does count less.

Furthermore a worker that was dead or deactivated some time gets the
biggest current load number when being reactivated, so that it starts a
smooth as possible.

I expect porting this to mod_proxy in trunk will be easy, but I'm not
sure what experience others have with the fairness of balancing in case
you add dynamics to the workers (errors and administrative downtimes).

Regards,

Rainer



Re: mod_proxy/mod_proxy_balancer bug

2009-04-14 Thread Jess Holle

Jess Holle wrote:
proxy_handler() calls ap_proxy_pre_request() inside a do loop over 
balanced workers.


This in turn calls proxy_balancer_pre_request() which does

(*worker)->s->busy++.

Correspondingly proxy_balancer_post_request() does:

if (worker && worker->s->busy)
worker->s->busy--;

Unfortunately, proxy_handler only calls proxy_run_post_request() and 
thus proxy_balancer_post_request() outside the do loop.  Thus the 
"busy" count of workers which currently cannot take requests (e.g. 
that are currently dead) increases without bound due to retries -- and 
is never reset.


Does anyone (i.e. who is more familiar with this code) have 
suggestions for how this should be fixed?  If not, I can take a swing 
at it.


Similarly, when retrying workers in various routines in 
mod_proxy_balancer.c those worker's lbstatus is incremented.  If the 
retry fails, however, the lbstatus is never reset.  This issue also 
leads to an lbstatus that increases without bound.  Just because a 
worker was dead for 8 hours does not mean it can handle all the work 
load now.  It needs to start fresh -- not 8 hours in the hole.  This 
issue also creates an unduly huge impact when doing


mycandidate->s->lbstatus -= total_factor;

Actually I'm offbase here.  total_factor places undue emphasis on any 
worker that satisfies a request when multiple dead workers are retried.  
For instance, if there are 7 dead workers, all being retried, 2 healthy 
workers, and all with an lbfactor of 1 the worker that gets the request 
gets its lbstatus decremented by 9, whereas it really should only be 
decremented by 2 -- else the weighting gets thrown way off.  However, it 
is /not/ thrown off more due to the huge lbstatus values that build up 
in dead workers.  That only becomes an issue when dead workers come to life.


We're seeing the load balancing be thrown dramatically off in this case.

Does anyone have suggestions for how this should be fixed?  If not, 
again I can take a swing at this, e.g. reseting lbstatus to 0 in 
ap_proxy_retry_worker().


It *seems* like both of the issue center on handling of dead workers, 
especially having a multiple dead workers and/or workers that are dead 
for long periods of time.


I've not yet checked whether mod_jk (where I believe these basic 
algorithms came from) has similar issues.


--
Jess Holle