
At this moment I believe we should add flags and stop using the '0' value
in the config file.

Internally ( in the code ) - it doesn't matter, we can keep 0 or
use the flag ( I prefer the second ).

I'm waiting for your patch - it seems there is another bug that must 
be fixed before we can tag - but I hope we can finish all changes in
the next few days.


On Mon, 6 May 2002, Bernd Koecke wrote:

> thanks for commiting my patch :). After thinking about it, I found the same 
> problem like Mathias. It's a problem for my environment too. We have the same 
> problem with shutdown and recovering here. I'm on the way of looking in jk2. The 
> question for jk1 is, what want we do if the main worker fails because of an error?
> Because the normal intention of lb is to switch to another worker in such case. 
> But for the special use of a main worker we don't want that (at least it is an 
> error in my environment here :) ). My suggestion is to add an additional flag to 
> the lb_worker struct where we hold the information that we have a main worker, 
> e.g main_worker_mode. Because of this flag we send only requests with a session 
> id to one of the other worker. And we could change the behavior after an error 
> of an other worker and check his state only if we get a request with his session 
> route. This would be easy if we set the main worker at the begining of the 
> worker list and/or use the flag. But we need the flag if we want to use more the 
> one main worker.
> But what should happen if the main worker is in error state? In my patch some 
> weeks ago I added an additional flag which causes the module to reject a request 
> if it comes in without a session id and the main worker is down. If this flag 
> wasn't set or was not set to reject the module chooses one of the other worker. 
> For our environment here rejecting the request is ok, because if a request 
> without a session comes to a switched off node, we have a problem with our 
> separated load balancer. This should never happen. We could make this rejecting 
> be the standard if we have a main worker, but with a separate flag it would be 
> more flexible.
> I will build a patch against cvs to make my intention clearer.
> Bernd
> > Hi Mathias,
> > 
> > I think we understand your use case, it is not very uncommon.
> > In fact, as I mentioned few times, it is the 'main' use
> > case for Apache ( multi-process ) when using the JNI worker.
> > In this case Apache acts as a 'natural' load-balancer, with 
> > requests going to various processes ( more or less randomly ).
> > As in your case, requests without a session should allways go
> > to the worker that is in the same process.
> > 
> > The main reason for using '0' for the "local" worker is that
> > in jk2 I want to switch from float to int - there is no reason
> > ( AFAIK ) to do all the float computation, even a short int
> > will be enough for the purpose of implementing a round-roubin
> > with weitghs.
> > 
> > BTW, one extension I'm trying to make is support for multiple
> > local workers - I'm still thining on how to do that. This will
> > cover the case of few big boxes, each with several tomcat 
> > instances ( if you have many G of RAM and many processors, sometimes
> > is better to run more VMs instead of a single large process ) 
> > In this case you still want some remote tomcats, for failover,
> > but most load should go to the local workers.
> > 
> > For jk2 I already fixed the selection of the 'recovering' worker,
> > after timeout the worker will go through normal selection instead
> > of beeing automatically chosen.
> > 
> > For jk1 - I'm waiting for patches :-) I wouldn't do a big change -
> > the current fix seemed like a good one. 
> > 
> > I agree that changing the meaning of 0 may be confusing ( is it
> > documented ? my says it should never be used ).
> > We can fix that by using an additional flag - and not using 
> > special values.
> > 
> > Another special note - Jk2 will also support 'gracefull shutdown',
> > that means your case ( replacing a webapp ) will be handled
> > in a different way. You should be able to add/remove workers
> > without restarting apache ( and I hope mostly automated ). 
> > 
> > Let me know what you think - with patches if possible :-)
> > 
> > Costin
> > 
> > 
> >>The setup I use is the following, a load balancer (Alteon) is in front
> >>of several Apache servers, each hosted on a machine which also hosts a
> >>Tomcat.
> >>Let's call those Apache servers A1, A2 and A3 and the associated Tomcat
> >>servers T1, T2 and T3.
> >>
> >>I have been using Paul's patch which I modified so the lb_value field of
> >>fault tolerant workers would not be changed to a value other than INF.
> >>
> >>The basic setup is that Ai can talk to all Tj, but for requests not
> >>associated with a session, Ti will be used unless it is unavailable.
> >>Sessions belonging to Tk will be correctly routed. The load balancing
> >>worker definition is different for all three Ai, the lbfactor is set to
> >>0 for workers connecting to Tk for all k != i and set to 1.0 for the
> >>worker connecting to Ti.
> >>
> >>This setup allows to have sticky sessions independently of the Apache
> >>handling the request, which is a good thing since the Alteon cannot
> >>extract the ';jsessionid=.....' part from the URL in a way which allows
> >>the dispatching of the requests to the proper Ai (the cookie is dealed
> >>with correctly though).
> >>
> >>This works perfectly except when we roll out a new release of our
> >>webapps. In this case it would be ideal to be able to make the load
> >>balancer ignore one Apache server, deploy the new version of the webapp
> >>on this server, and switch this server back on and the other two off so
> >>the service interruption would be as short as possible for the
> >>customers. The immediate idea, if Ai/Ti is to be the first server to
> >>have the new webapp, is to stop Ti so Ai will not be selected by the
> >>load balancer. This does not work, indeed with Paul's patch Ti is the
> >>preferred server BUT if Ti fails then another Tk will be selected by Ai,
> >>therefore the load balancer will never declare Ai failed (even though we
> >>managed to make it behave like this by specifying a test URL which
> >>includes a jvmroute to Ti, but this uses lots of slb groups on the
> >>alteon) and it will continue to send requests to it.
> >>
> >>Bernd's patch allows Ai to reject requests if Ti is stopped, the load
> >>balancer will therefore quickly declare Ai inactive and will stop send
> >>it requests, thus allowing to roll out the new webapp very easily, just
> >>set up the new webapp, restart Ti, restart Ai, and as soon as the load
> >>balancer sees Ai, shut down the other two Ak, the current sessions will
> >>still be routed to the old webapp, and the new sessions will see the new
> >>version. When there are no more sessions on the old version, shut down
> >>Tk (k != i) and deploy the new webapp.
> >>
> >>My remark concerning the possible selection of recovering workers prior
> >>to the local worker (one with lb_value set to 0) deals with the load
> >>balancer not being able in this case to declare Ai inactive.
> >>
> >>I hope I have been clear enough, and that everybody got the point, if
> >>not I'd be glad to explain more thoroughly.
> >>
> >>Mathias.
> >>
> >>Paul Frieden wrote:
> >>
> >>>Hello,
> >>>
> >>>I'm afraid that I am no longer subscribed to the devel list.  I would be
> >>>happy to add my advice for this issue, but I don't have time to keep up
> >>>with the entire devel list.  If there is anything I can do, please just
> >>>mail me directly.
> >>>
> >>>I chose to use the value 0 for a worker because it used the inverse of
> >>>the value specified.  The value 0 then resulted in essentially infinite
> >>>preference.  I used that approach purely because it was the smallest
> >>>change possible, and the least likely to change the expected behavior
> >>>for anybody else.  The path of least astonishment and whatnot.  I would
> >>>be concerned about changing the current behavior now, because people
> >>>probably want a drop in replacement.  If there is going to be a change
> >>>in the algorithm and behavior, a different approach may be better.
> >>>
> >>>I would also like to make a note of how we were using this code.  In our
> >>>environment, we have an external dedicated load balancer, and three web
> >>>servers.  The main problem that we ran into was with AOL users.  AOL
> >>>uses a proxy that randomizes the source IP of requests.  That means that
> >>>you can no longer count on the source IP to tell the load balancer which
> >>>server to send future requests to.  We used this code to allow sessions
> >>>that arive on the wrong web server to be redirected to the tomcat on the
> >>>correct server.  This neatly side-steps the whole issue of changing IPs,
> >>>because apache is able to make the decision based on the session ID.
> >>>
> >>>The reliability issue was a nice side effect for us in that it caught a
> >>>failed server more quickly than the load balancer did, and prevented the
> >>>user from having a connection time out or seeing an error message.
> >>>
> >>>I hope this provides some insight into why I changed the code that I
> >>>did, and why that behavior worked well for us.
> >>>
> >>>Paul
> >>>
> >>>[EMAIL PROTECTED] wrote:
> >>>
> >>>
> >>>>Hi Mathias,
> >>>>
> >>>>I think it would be better to discuss this on tomcat-dev.
> >>>>
> >>>>The 'error' worker will not be choosen unless the
> >>>>timeout expires. When the timeout expires, we'll indeed
> >>>>select it ( in preference to the default ) - this is easy to fix
> >>>>if it creates problems, but I don't see why it would be a
> >>>>problem.
> >>>>
> >>>>If it is working, next request will be served normally by
> >>>>the default. If not, it'll go back to error state.
> >>>>
> >>>>In jk2 I removed that - error workers are no longer
> >>>>selected. But for jk1 I would rather leave the old
> >>>>behavior intact.
> >>>>
> >>>>Note that the reason for choosing 0 ( in jk2 ) as
> >>>>default is that I want to switch from float to ints,
> >>>>I'm not convinced floats are good for performance
> >>>>( or needed ).
> >>>>
> >>>>Again - I'm just learning and trying, if you have
> >>>>any idea I would be happy to hear them, patches
> >>>>are more than wellcome.
> >>>>
> >>>>Costin
> >>>>
> >>>>On Sat, 4 May 2002, Mathias Herberts wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hi,  I  just  joined  the  Tomcat-dev  list  and  saw  your  patch  to
> >>>>>jk_lb_worker.c (making it version 1.9).
> >>>>>
> >>>>>If I understand well your patch it offers the same behaviors as Paul's
> >>>>>patch  but with  an opposite  semantic for  a lbfactor  of 0.0  in the
> >>>>>worker's definition,  i.e. a  value of 0.0  now means ALWAYS  USE THIS
> >>>>>FOR REQUESTS WITH NO SESSIONS. This seems fine to me.
> >>>>>
> >>>>>What disturbs  me is  what is  happening when one  worker is  in error
> >>>>>state  and not  yet recovering.  In get_most_suitable  worker,  such a
> >>>>>worker will  be selected whatever  its lb_value, meaning  a recovering
> >>>>>worker will  have priority over  one with a  lb_value of 0.0  and this
> >>>>>seems to break the behavior we had achieved with your patch.
> >>>>>
> >>>>>Did I miss something or is this really a problem?
> >>>>>
> >>>>>Mathias.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> > 
> > 
