Hi Eric,
I have a some follow-up questions in-line. I have also read the other
messages in this thread and added a couple of additional questions based
on what I read in those threads.
On 26/05/2024 02:58, Eric Robinson wrote:
One of our hosting customers is a medical practice using a commercial EMR
running on tomcat+mysql. It has operated well for over a year, but users have
suddenly begun experiencing slowness for about an hour at the same time every
day.
What time does this problem start?
Does it occur every day of the week including weekends?
How does the slowness correlate to:
- request volume
- requests to any particular URL(s)?
- requests from any particular client IP?
- any other attribute of the request?
(I'm trying to see if there is something about the requests that
triggers the issue.)
During the slow times, we've done all the usual troubleshooting to catch the
problem in the act. The servers have plenty of power and are not overworked.
There are no slow database queries. Network connectivity is solid. Tomcat has
plenty of memory. The numbers of database connections, threads, questions,
queries, etc., remain steady, without spikes. There is no unusual disk latency.
We have not found any maintenance tasks running during that timeframe.
I would usually suggest taking three thread dumps approximately 5s apart
and then diffing them to try and spot "slow moving" threads.
I see you have scripted trigger a thread dump when the slowness hits. If
you haven't already, please configure it to capture (at least) 3 dumps
~5 seconds apart.
(If we can spot the slow moving threads we might be able to identify
what it is that makes them slow moving.)
The customer has another load-balanced tomcat instance on a different physical
server, and the problem happens on that one, too. The servers were upgraded
with a new kernel and packages on 4/5/24, but the issue did not appear until
5/6/24. The vendor enabled a new feature in the customer's software, and the
problem appeared the next day, but they subsequently disabled the feature, and
(reportedly) the problem did not go away.
Have you confirmed that the feature really is disabled? Or was it just
hidden?
Has this feature been enabled for any other customers? If yes, have they
experienced similar issues?
(It is suspicious that the issue occurred after the feature was
disabled. I wonder if some elements of that change (e.g. a database
change) are still in place and causing issues.)
It is worth mentioning that the servers are multi-tenanted, with other
customers running the same medical application, but the others do not
experience the slowdowns, even though they are on the same servers.
How does this customer compare, in terms of volume of requests, to other
customers that are not experiencing this issue.
Is there anything unique or special about the customer experiencing the
issue? Do they have some custom settings no-one else uses?
(I am trying to figure out if the issue is load related, customer
specific or something else).
There are no unusual errors in the tomcat or database server logs, EXCEPT this
one: Java.sql.DriverManager.getConnection
Can we see the full stack trace please.
During the periods of slowness, we see lots of those errors along with a large
spike in the number of stuck tomcat threads (from 1 or 2 to as high as 100). It
seems obvious that the threads are stuck because tomcat is waiting on a
connection to the database. However, tcpdump shows that connectivity to the
database is perfect at the network and application layers. There are no
unanswered SYNs, no retransmissions, no half-open connections, no failures to
allocate TCP ports, no conntrack messages, and no other indications of system
resource exhaustion. Every time tomcat requests a connection to the DB, it
completes in less than 1 ms. Ten thousand connection attempts completed
successfully in about 15 seconds, with zero failures.
It sounds like things might be getting stuck somewhere in or near the
JDBC driver.
Can you provide the exact version of the JDBC driver you are using?
Can you provide the full database configuration from context.xml (or
wherever it is configured). Please redact sensitive information such as
passwords.
We are forced to conclude that some database connection requests are being
initiated but are not being sent on the wire. The problem seems to be in the
interaction between tomcat and the database driver, or in the driver itself.
I agree.
Unfortunately, the application vendor is taking the "it's your infrastructure"
position without providing any evidence or offering suggestions for configuration changes,
I'm sorry to hear that. We'll do what we can to help.
other than to deploy more tomcat instances, which is just shooting in the dark.
They don't know why the software is throwing
java.sql.DriverManager.getConnection errors (even though it's their code), and
they've relegated the investigation to us.
I'd have to say that the evidence is pointing towards some sort of
application issue at this point. That said, just because the questions
are currently heading in that direction we aren't blind to the
possibility that the root cause might be in Tomcat. If the evidence
starts pointing that way then that is where we will look.
When we have answers to the questions above, we might have enough
evidence to start asking more pointed questions of the application vendor.
Any advice from the community would be greatly appreciated.
RHEL 8.9, kernel 4.18.0-513.18.1.el8_9.x86_64
Apache Tomcat/9.0.80, JVM 1.8.0_372-b07
Is that Tomcat 9.0.80 as provided by the ASF? If so, there are a number
of known security vulnerabilities you should be (and probably are) aware
of. There are steps you can take to mitigate those without an upgrade -
just wanted to make sure they are on your radar.
(The tomcat and JVM versions are the ones recommended by the vendor.)
We're standing by to provide whatever other information the community may need.
Finally, if you consider any of the debugging information too sensitive
to share on the public list, I am happy for you to send to directly to
me and I can share it with any interested Tomcat committers. If you do
need to do that, I'd encourage to to share a redacted version with the
list if you can. There are lots of very experienced folks on the users
list who can help who aren't Tomcat committers.
Mark
Thanks tons!
-Eric
Disclaimer : This email and any files transmitted with it are confidential and
intended solely for intended recipients. If you are not the named addressee you
should not disseminate, distribute, copy or alter this email. Any views or
opinions presented in this email are solely those of the author and might not
represent those of Physician Select Management. Warning: Although Physician
Select Management has taken reasonable precautions to ensure no viruses are
present in this email, the company cannot accept responsibility for any loss or
damage arising from the use of this email or attachments.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org