Re: mod_proxy/mod_proxy_balancer bug
On Apr 23, 2009, at 8:45 AM, Jim Jagielski wrote: +1... Maybe I'll branch off a 2.2-proxy branch as a sandbox to play around in... Then we can front-port to trunk and use the sandbox as the backport source :) Just in case people didn't see it, I've created a branch from 2.2.x as a place for us to try these proxy improvements including backports of some trunk related things... https://svn.apache.org/repos/asf/httpd/httpd/branches/httpd-2.2-proxy
Re: mod_proxy/mod_proxy_balancer bug
On Apr 22, 2009, at 5:16 AM, jean-frederic clere wrote: Rainer Jung wrote: On 20.04.2009 15:57, Jim Jagielski wrote: On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote: The same type of balancing decision algorithm was part of mod_jk between 1.2.7 and 1.2.15. I always had problems to understand, how it exactly behaves in case some workers are out of order. The algorithm is interesting, but I found it very hard to model its mathematics into formulas. We finally decided to switch to something else. For request, traffic or session based balancing we do count items (requests, bytes or new sessions), and divide the counters by two once a minute. That way load that happened in the past does count less. Furthermore a worker that was dead or deactivated some time gets the biggest current load number when being reactivated, so that it starts a smooth as possible. I expect porting this to mod_proxy in trunk will be easy, but I'm not sure what experience others have with the fairness of balancing in case you add dynamics to the workers (errors and administrative downtimes). I have some ideas on the "soft start" when a errored-out worker returns (or when a new worker is added *hint* *hint*) that I've been playing with. The main thing, for me at least, is low overhead, even if it means sacrificing accuracy to the nth decimal place... I used to think aging was not something we wanted to do in mod_proxy, but mostly it was based on complex aging, and the overhead associated with that. But I have some ideas there as well. The main thing I've been working on is trying to do all these things in trunk in a way that is "easily" backportable to 2.2... So that makes my answer to JFC partially obsolete. Sorry I read your post later. Yep I have also experimented in the area... We need to do something. +1... Maybe I'll branch off a 2.2-proxy branch as a sandbox to play around in... Then we can front-port to trunk and use the sandbox as the backport source :)
Re: mod_proxy/mod_proxy_balancer bug
Rainer Jung wrote: On 20.04.2009 15:57, Jim Jagielski wrote: On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote: The same type of balancing decision algorithm was part of mod_jk between 1.2.7 and 1.2.15. I always had problems to understand, how it exactly behaves in case some workers are out of order. The algorithm is interesting, but I found it very hard to model its mathematics into formulas. We finally decided to switch to something else. For request, traffic or session based balancing we do count items (requests, bytes or new sessions), and divide the counters by two once a minute. That way load that happened in the past does count less. Furthermore a worker that was dead or deactivated some time gets the biggest current load number when being reactivated, so that it starts a smooth as possible. I expect porting this to mod_proxy in trunk will be easy, but I'm not sure what experience others have with the fairness of balancing in case you add dynamics to the workers (errors and administrative downtimes). I have some ideas on the "soft start" when a errored-out worker returns (or when a new worker is added *hint* *hint*) that I've been playing with. The main thing, for me at least, is low overhead, even if it means sacrificing accuracy to the nth decimal place... I used to think aging was not something we wanted to do in mod_proxy, but mostly it was based on complex aging, and the overhead associated with that. But I have some ideas there as well. The main thing I've been working on is trying to do all these things in trunk in a way that is "easily" backportable to 2.2... So that makes my answer to JFC partially obsolete. Sorry I read your post later. Yep I have also experimented in the area... We need to do something. Cheers Jean-Frederic Regards, Rainer
Re: mod_proxy/mod_proxy_balancer bug
On 20.04.2009 15:57, Jim Jagielski wrote: > > On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote: >> >> The same type of balancing decision algorithm was part of mod_jk between >> 1.2.7 and 1.2.15. I always had problems to understand, how it exactly >> behaves in case some workers are out of order. The algorithm is >> interesting, but I found it very hard to model its mathematics into >> formulas. >> >> We finally decided to switch to something else. For request, traffic or >> session based balancing we do count items (requests, bytes or new >> sessions), and divide the counters by two once a minute. That way load >> that happened in the past does count less. >> >> Furthermore a worker that was dead or deactivated some time gets the >> biggest current load number when being reactivated, so that it starts a >> smooth as possible. >> >> I expect porting this to mod_proxy in trunk will be easy, but I'm not >> sure what experience others have with the fairness of balancing in case >> you add dynamics to the workers (errors and administrative downtimes). >> > > I have some ideas on the "soft start" when a errored-out worker > returns (or when a new worker is added *hint* *hint*) that I've > been playing with. The main thing, for me at least, is low overhead, > even if it means sacrificing accuracy to the nth decimal place... > I used to think aging was not something we wanted to do in > mod_proxy, but mostly it was based on complex aging, and the > overhead associated with that. But I have some ideas there as > well. > > The main thing I've been working on is trying to do all these > things in trunk in a way that is "easily" backportable to 2.2... So that makes my answer to JFC partially obsolete. Sorry I read your post later. Regards, Rainer
Re: mod_proxy/mod_proxy_balancer bug
On Apr 17, 2009, at 4:28 PM, Rainer Jung wrote: The same type of balancing decision algorithm was part of mod_jk between 1.2.7 and 1.2.15. I always had problems to understand, how it exactly behaves in case some workers are out of order. The algorithm is interesting, but I found it very hard to model its mathematics into formulas. We finally decided to switch to something else. For request, traffic or session based balancing we do count items (requests, bytes or new sessions), and divide the counters by two once a minute. That way load that happened in the past does count less. Furthermore a worker that was dead or deactivated some time gets the biggest current load number when being reactivated, so that it starts a smooth as possible. I expect porting this to mod_proxy in trunk will be easy, but I'm not sure what experience others have with the fairness of balancing in case you add dynamics to the workers (errors and administrative downtimes). I have some ideas on the "soft start" when a errored-out worker returns (or when a new worker is added *hint* *hint*) that I've been playing with. The main thing, for me at least, is low overhead, even if it means sacrificing accuracy to the nth decimal place... I used to think aging was not something we wanted to do in mod_proxy, but mostly it was based on complex aging, and the overhead associated with that. But I have some ideas there as well. The main thing I've been working on is trying to do all these things in trunk in a way that is "easily" backportable to 2.2...
Re: mod_proxy/mod_proxy_balancer bug
Rainer Jung wrote: The same type of balancing decision algorithm was part of mod_jk between 1.2.7 and 1.2.15. I always had problems to understand, how it exactly behaves in case some workers are out of order. The algorithm is interesting, but I found it very hard to model its mathematics into formulas. We finally decided to switch to something else. For request, traffic or session based balancing we do count items (requests, bytes or new sessions), and divide the counters by two once a minute. That way load that happened in the past does count less. Furthermore a worker that was dead or deactivated some time gets the biggest current load number when being reactivated, so that it starts a smooth as possible. I expect porting this to mod_proxy in trunk will be easy, but I'm not sure what experience others have with the fairness of balancing in case you add dynamics to the workers (errors and administrative downtimes). I'd be /_very_ /interested in such a port to mod_proxy_balancer -- in 2.2.x in my case. Any help/pointers/assistance would be appreciated. I could apply such a change as a patch to just my version, but I'd be a lot more interested in getting 2.2.x as a whole to a better place and not having to maintain my own fork of things. I get a strong impression that others haven't really pushed mod_proxy_balancer in this area. Overall having solid mod_proxy_balancer functionality obviously benefits more than just AJP and I like the idea of mod_proxy_ajp. That said, if mod_jk is going to move ahead and mod_proxy_ajp become a backwater at some point I'll need to move back to mod_jk, though I'd really want the ability to gracefully throttle requests in mod_jk first. [When mod_jk runs out of connections it gives a 503. mod_proxy can queue up requests instead.] -- Jess Holle
Re: mod_proxy/mod_proxy_balancer bug
On 14.04.2009 23:23, Jess Holle wrote: > Jess Holle wrote: >> Similarly, when retrying workers in various routines in >> mod_proxy_balancer.c those worker's lbstatus is incremented. If the >> retry fails, however, the lbstatus is never reset. This issue also >> leads to an lbstatus that increases without bound. Just because a >> worker was dead for 8 hours does not mean it can handle all the work >> load now. It needs to start fresh -- not 8 hours in the hole. This >> issue also creates an unduly huge impact when doing >> >> mycandidate->s->lbstatus -= total_factor; >> > Actually I'm offbase here. total_factor places undue emphasis on any > worker that satisfies a request when multiple dead workers are retried. > For instance, if there are 7 dead workers, all being retried, 2 healthy > workers, and all with an lbfactor of 1 the worker that gets the request > gets its lbstatus decremented by 9, whereas it really should only be > decremented by 2 -- else the weighting gets thrown way off. However, it > is /not/ thrown off more due to the huge lbstatus values that build up > in dead workers. That only becomes an issue when dead workers come to life. >> >> We're seeing the load balancing be thrown dramatically off in this case. >> >> Does anyone have suggestions for how this should be fixed? If not, >> again I can take a swing at this, e.g. reseting lbstatus to 0 in >> ap_proxy_retry_worker(). >> >> It *seems* like both of the issue center on handling of dead workers, >> especially having a multiple dead workers and/or workers that are dead >> for long periods of time. >> >> I've not yet checked whether mod_jk (where I believe these basic >> algorithms came from) has similar issues. The same type of balancing decision algorithm was part of mod_jk between 1.2.7 and 1.2.15. I always had problems to understand, how it exactly behaves in case some workers are out of order. The algorithm is interesting, but I found it very hard to model its mathematics into formulas. We finally decided to switch to something else. For request, traffic or session based balancing we do count items (requests, bytes or new sessions), and divide the counters by two once a minute. That way load that happened in the past does count less. Furthermore a worker that was dead or deactivated some time gets the biggest current load number when being reactivated, so that it starts a smooth as possible. I expect porting this to mod_proxy in trunk will be easy, but I'm not sure what experience others have with the fairness of balancing in case you add dynamics to the workers (errors and administrative downtimes). Regards, Rainer
Re: mod_proxy/mod_proxy_balancer bug
Jess Holle wrote: proxy_handler() calls ap_proxy_pre_request() inside a do loop over balanced workers. This in turn calls proxy_balancer_pre_request() which does (*worker)->s->busy++. Correspondingly proxy_balancer_post_request() does: if (worker && worker->s->busy) worker->s->busy--; Unfortunately, proxy_handler only calls proxy_run_post_request() and thus proxy_balancer_post_request() outside the do loop. Thus the "busy" count of workers which currently cannot take requests (e.g. that are currently dead) increases without bound due to retries -- and is never reset. Does anyone (i.e. who is more familiar with this code) have suggestions for how this should be fixed? If not, I can take a swing at it. Similarly, when retrying workers in various routines in mod_proxy_balancer.c those worker's lbstatus is incremented. If the retry fails, however, the lbstatus is never reset. This issue also leads to an lbstatus that increases without bound. Just because a worker was dead for 8 hours does not mean it can handle all the work load now. It needs to start fresh -- not 8 hours in the hole. This issue also creates an unduly huge impact when doing mycandidate->s->lbstatus -= total_factor; Actually I'm offbase here. total_factor places undue emphasis on any worker that satisfies a request when multiple dead workers are retried. For instance, if there are 7 dead workers, all being retried, 2 healthy workers, and all with an lbfactor of 1 the worker that gets the request gets its lbstatus decremented by 9, whereas it really should only be decremented by 2 -- else the weighting gets thrown way off. However, it is /not/ thrown off more due to the huge lbstatus values that build up in dead workers. That only becomes an issue when dead workers come to life. We're seeing the load balancing be thrown dramatically off in this case. Does anyone have suggestions for how this should be fixed? If not, again I can take a swing at this, e.g. reseting lbstatus to 0 in ap_proxy_retry_worker(). It *seems* like both of the issue center on handling of dead workers, especially having a multiple dead workers and/or workers that are dead for long periods of time. I've not yet checked whether mod_jk (where I believe these basic algorithms came from) has similar issues. -- Jess Holle