Re: mod_jk error detection

Scott McClanahan Wed, 25 Jul 2007 14:07:07 -0700

On Wed, 2007-07-25 at 22:40 +0200, Rainer Jung wrote:
> Scott McClanahan wrote:
> > Thanks, so much! I'd like to continue this thread a bit more because of
> > helpful I think it will be for everyone using mod_jk.
> > 
> >> That one, reply_timeout, is not really meant for high speed detection. 
> >> Usually you've got an ap, that every now and then needs 10 or 20 seconds 
> >> for an answer and you don't like to disable a worker automatically 
> >> because of those rare events. So normally one sets reply_timeout to 1, 2 
> >> or 3 minutes.
> > 
> > I don't understand what besides a timed out CPING/CPONG message would
> > render a backend tomcat disabled, especially in a default config since
> > reply_timeout is 0.
> 
> Default config: no CPing/CPong. But: after some time the TCP stack will 
> give up, when there is a network problem, or the backend is no longer 
> listening. So this case will even be handled in a default config, but 
> depending on the exact network situation, the error detection might take 
> a long time.
> 
> n case your backend simply eats your requests, but doesn't produce 
> answers, you will very fast eat up all connections and threads and the 
> whole system will hang - without configured timeouts.


I see your point.  I was thinking only within the context of mod_jk.
Meaning what in mod_jk other than CPING/CPONG message failures would
cause a worker to go into error state.  You answered that.

> 
> BTW: there is also a non-default config to make a worker fail on several 
> received HTTP status codes, "fail_on_status".
> 
> >> We have to strongly make a difference between retries of a non-lb worker 
> >> and of a load balancer worker. A normal worker has a simple retry 
> >> procedure, independant of the fact, if it is used directly or as part of 
> >> an lb. If it detects an error it uses another pool connection and by 
> >> default tries once more.
> > 
> > If that happens does the real worker officially change to an error state
> > which would subsequently kick off the retry logic of the load balancer
> > worker?
> 
> Without an lb a worker does not have an error state. It will be 
> continuously reused. Only an lb uses error states and temporarily 
> disables a failed worker. Even an lb will continuously reuse a worker, 
> if there is no other worker to failover.

I understand this bit now finally too.  It was a really good idea to
have the CPING/CPONG message timeout checks before individual requests
get forwarded to avoid several different problem scenarios here.  Good
thinking.

> 
> >> The maintenance uses a real request and handles it as if the backend 
> >> wouldn't have failed. If you enabled CPing/CPong this means, that it 
> >> would detect a still broken backend early and transparently send the 
> >> request to another member. Because no part of the request (the CPing 
> >> doesn't count) already has been send, the failover to another member 
> >> happens independently of recovery_options (i.e. even with 
> >> recovery_options 3).
> > 
> > Is the request used to test the health of the backend tomcat whichever
> > one comes first after a global maintenance run even if it has been
> > previously serviced by another healthy tomcat?  Is this request attempt
> > to a once errant worker only to test its healthiness and not to actually
> > have it fulfill the request?  I would hope it is only to test the health
> > of the backend tomcat and even if it is now willing to accept
> > connections, the request goes to whatever tomcat has been previously and
> > successfully responding to the session.
> 
> No, the first new request accepted by the web server and mapped to the 
> lb will be used (at least if it is free to be routed to any worker. If 
> the request belongs to a session located on another backend and the 
> default config with sticky sessions is active, it will of course be send 
> to its correct backend). It is a real user request. If the backend 
> works, OK. If it doesn't accept the request, we can still send it to 
> some other worker. If the backend accepts the requests, but processing 
> fails, depending on recovery_options the user gets an error.

Sounds great too.

> 
> >> If you like to improve the page about load balancing or the timeouts 
> >> page, or you want to add some parts about retries and recovery: 
> >> contributions are welcome.
> > 
> > After, we are done discussing I might have some recommendations.  Again,
> > you've been great.
> 
> Thanks. At least we improve the knowledge inside the mailing list archive.

One obvious thing that confuses me and could be changed is the "Advanced
worker directives" table.  It includes directives that are applicable to
both load balancer workers and real workers and only distinguishes which
directives are used for which worker when it is to be used for a load
balancer worker.  Does that mean the others are usable directives for
both real workers and load balancer workers or just real workers or in
some cases both.

I believe I know the answer to that but it somewhat misleading.

> 
> Regards,
> 
> Rainer
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: [email protected]
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To start a new topic, e-mail: [email protected]
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: mod_jk error detection

Reply via email to