RE: Weirdest Tomcat Behavior Ever?

Eric Robinson Mon, 26 Oct 2020 21:37:35 -0700

> > On 26/10/2020 10:26, Mark Thomas wrote:
> > > On 24/10/2020 01:32, Eric Robinson wrote:
> > >
> > > <snip/>
> > >
> > >>>> -----Original Message-----
> > >>>> From: Mark Thomas <ma...@apache.org>
> > >
> > > <snip/>
> > >
> > >>>> The failed request:
> > >>>> - Completes in ~6ms
> > >>>
> > >>> I think we've seen the failed requests take as much as 50ms.
> > >
> > > Ack. That is still orders of magnitude smaller that the timeout and
> > > consistent with generation time of some of the larger responses.
> > >
> > > I wouldn't sat it confirms any of my previous conclusions but it
> > > doesn't invalidate them either.
> > >
> > >>>> Follow-up questions:
> > >>>> - JVM
> > >>>>   - Vendor?
> > >>>>   - OS package or direct from Vendor?
> > >>>>
> > >>>
> > >>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> > >
> > > OK. That is post Java 8u202 so it should be a paid for, commercially
> > > supported version of Java 8.
> > >
> > > The latest Java 8 release from Oracle is 8u271.
> > >
> > > The latest Java 8 release from AdoptOpenJDK is 8u272.
> > >
> > > I don't think we are quite at this point yet but what is your view
> > > on updating to the latest Java 8 JDK (from either Oracle or
> AdoptOpenJDK).
> > >
> > >>>> - Tomcat
> > >>>>   - OS package, 3rd-party package or direct from ASF?
> > >>>>
> > >>>
> > >>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
> > >>>
> > >>
> > >> Drat, slight correction. I now recall that although we initially
> > >> installed 7.0.76
> > from the CentOS repo, the application vendor made us lower the version
> > to 7.0.72, and I DO NOT know where we got that. However, it has not
> > changed since October-ish of 2018.
> > >
> > > I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> > > relevant changes.
> > >
> > >>>> - Config
> > >>>>   - Any changes at all around the time the problems started? I'm
> > >>>>     thinking OS updates, VM restarted etc?
> > >>>>
> > >>>
> > >>> server.xml has not changed since 4/20/20, which was well before
> > >>> the problem manifested, and all the other files in the conf folder
> > >>> are even older than that. We're seeing this symptom on both
> > >>> production servers. One of them was rebooted a week ago, but the
> > >>> other has been up continuously for
> > >>> 258 days.
> > >
> > > OK. That rules a few things out which is good but it does make the
> > > trigger for this issue even more mysterious.
> > >
> > > Any changes in the Nginx configuration in the relevant timescale?
> > >
>
> The last change to the nginx config files was on 8/21. The first report of
> problems from the users in question was on 9/16. There is another set of
> users on a different tomcat instance who reported issues around 8/26, 5 days
> after nginx config change. It seems unlikely to be related. Also, I can't
> imagine what nginx could be sending that would induce the upstream tomcat
> to behave this way.
>
> > > Any updates to the application in the relevant timescale?
> > >
>
> Their application was patched to a newer version on 6/5.
>
> > > Any features users started using that hadn't been used before in
> > > that timescale?
>
> That one I couldn't answer, as we are only the hosting facility and we are not
> in the loop when it comes to the users' workflow, but it seems unlikely given
> the nature of their business.
>
> > >
> > > <snip/>
> > >
> > >>>> Recommendations:
> > >>>> - Switch back to the BIO connector if you haven't already. It has fewer
> > >>>>   moving parts than NIO so it is simpler debug.
> > >>>> - Add "%b" to the access log pattern for Tomcat's access log valve to
> > >>>>   record the number of body (excluding headers) bytes Tomcat
> > >>>> believes
> > it
> > >>>>   has written to the response.
> > >>>>
> > >>>>
> > >>>> Next steps:
> > >>>> - Wait for the issue to re-occur after the recommended changes
> > >>>> above
> > and
> > >>>>   depending on what is recorded in the access log for %b for a failed
> > >>>>   request, shift the focus accordingly.
> > >>>> - Answers to the additional questions would be nice but the access
> log
> > >>>>   %b value for a failed request is the key piece of information 
> > >>>> required
> > >>>>   at this point.
> > >>>>
> > >>>
> > >>> Good news! I enabled that parameter a few days ago and we have
> > >>> already caught some instances of the problem occurring.
> > >
> > > Excellent!
> > >
> > >>> Here is the logging format...
> > >>>
> > >>> <Valve className="org.apache.catalina.valves.AccessLogValve"
> > >>> directory="logs"
> > >>>                prefix="localhost_access." suffix=".log"
> > >>> pattern="%h %l %D %u %t %{JSESSIONID}c %{cookie}i %r %s %b %S
> %q"
> > >>> />
> > >>>
> > >>> Due to some sensitive content in the HTTP requests below, I have
> > >>> globally replaced certain words and addresses with random-ish
> > >>> strings, but I don't think I've done anything to impact the issue.
> > >>>
> > >>> Following is an example from Wednesday.
> > >>>
> > >>> This is a request being sent from the nginx proxy to the first of
> > >>> 2 upstream servers, 10.51.14.46
> > >>>
> > >>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream
> > >>> prematurely closed connection while reading response header from
> > upstream, client:
> > >>> 99.88.77.66, server: redacted.domain.com, request: "GET
> > >>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
> > >>>
> >
> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
> > >>>
> >
> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
> > >>> 5650846&timestamp=21102020155122.472656 HTTP/1.1", upstream:
> > >>> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eD
> > >>> at
> > >>> e=20
> > >>> 20-10-
> > >>>
> >
> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
> > >>>
> >
> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
> > >>> 5650846&timestamp=21102020155122.472656", host:
> > >>> "redacted.domain.com"
> > >>>
> > >>> Here is the matching localhost_access log entry from that server….
> > >>>
> > >>> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400]
> > >>> F405E25E49E3DCB81A36A87DED1FE573
> > >>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
> > >>> srv_id=dea8d61a7d725e980a6093cb78d8ec73;
> > >>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
> > >>> srv_id=dea8d61a7d725e980a6093cb78d8ec73 GET
> > >>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
> > >>>
> >
> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
> > >>>
> >
> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
> > >>> 5650846&timestamp=21102020155122.472656 HTTP/1.0 200 40423
> > >>> F405E25E49E3DCB81A36A87DED1FE573 ?eDate=2020-10-
> > >>>
> >
> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
> > >>>
> >
> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
> > >>> 5650846&timestamp=21102020155122.472656
> > >>>
> > >>> Tomcat appears to think it sent 40423 bytes. However, even though
> > >>> it shows an HTTP 200 response, WireShark shows the response did
> > >>> not actually go out on the wire.
> > >
> > > That is very interesting. That narrows down where things might be
> > > going wrong.
> > >
> > > It confirms that the application is generating the response so we
> > > can rule out a bug in that process.
> > >
> > > By default there is some buffering here. The buffer size is 9000 bytes.
> > > So with a response of 40423 I'd expect at least some of those bytes
> > > to be written to network. However, Wireshark is reporting no bytes
> written.
> > >
> > >
> > > At the moment, I see the following options:
> > >
> > > 1. Tomcat bug causing the writes to made to the wrong response.
> > >
> > > 2. Application bug (retaining references to request/response/writer
> > >    etc.) causing the writes to made to the wrong response.
> > >
> > > 3. JVM bug causing network writes to be dropped.
> > >
> > > 4. OS bug causing network writes to be dropped before they reach the
> > >    point where they are observable by Wireshark.
> > >
> > > To be frank, none of those options seem particularly likely.
> > >
> > > At this point I'd echo Konstantin's recommendation to add the
> > > following system property:
> > > org.apache.catalina.connector.RECYCLE_FACADES=true
> > >
> > > You'd normally do this in $CATALINA_HOME/bin/setenv.sh (creating
> > > that file if necessary) with a line like:
> > >
> > > CATALINA_OPTS="$CATALINA_OPTS
> > > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true"
> > >
> > > You can confirm that the setting has been applied by looking in the
> > > log for the start-up. You should see something like:
> > >
> > > Oct 26, 2020 10:18:45 AM
> > > org.apache.catalina.startup.VersionLoggerListener log
> > > INFO: Command line argument:
> > > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> > >
> > >
> > > That option reduces the re-use of request, response and related
> > > objects between requests and, if an application is retaining
> > > references it shouldn't, you usually see a bunch of
> > > NullPointerExceptions in the logs when the application tries to
> > > re-use those
> > objects.
> > >
> > > Meanwhile, I'm going to work on a custom patch for 7.0.72 to add
> > > some additional logging around the network writes.
> >
> > Patch files and instructions for use:
> >
> > http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v1/
> >
> > Mark
>
> Hi Mark,
>
> A couple of questions.
>
> 1. Now that you have provided this patch, should I still enable
> RECYCLE_FACADES=true?
>
> 2. The servers in question are multi-tenanted. There are 17 instances of
> tomcat, each running on a different set of ports and out of a separate
> directory tree, but they are all started with essentially the same init 
> script,
> which exports certain site-specific path variables and executes tomcat from
> the specified folder structure. Can you think of any potential issues where
> making this change for one instance could have a negative effect on any of
> the other instances? Probably not, but just being careful. I will have these
> changes implemented during our nightly maintenance window will begin to
> gather relevant logs first thing tomorrow!
>
> --Eric


Mark, the changes have been made per your instructions and tomcat has been 
restarted. The debug0.log, and debug0.log.lck files were created in the 
directory, but they currently both have 0 bytes.

--Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

Reply via email to