Hi, Our current setup is Tomcat 9.0.8 running SuSE Enterprise. This server is running a dozen web applications built with Struts 1.3.8 with some newer Spring applications on the horizon. There is a large user base with some applications seeing heavy usage. Applications are currently using Java 1.7 and 1.8.
We were originally running Tomcat 7.x but were having issues with perm gen maxing out very quickly for unknown reasons but possibly related to a buggy third party "enterprise-grade" reporting Java library. We had to restart the server nightly to try and keep perm gen from maxing out. Part of the reason was this third-party library spawned immortal threads that would prevent an application from unloading and being garbage collected when a newer build of an application was deployed (the developers behind it never expected the library would be run on a server with multiple applications....). So we upgraded Tomcat to 8.5.x first and then to 9.x recently. This fixed the perm gen issue. Our current issue we are having is that for some unknown reason and after seemingly random lengths of time, an application will get into a state and will start having issues which results in failed page loads or pages not loading correctly. According to Chrome's network tab in developer console, a random bunch of static resources (javascript, css, images) are returning 500 errors and not being served. Whether the page loads or not depends on exactly which resources were not returned. Every time you access any page in that application, another random bunch of resources have 500 errors. There's no indication in any of Tomcat's log files that an application is in this state. The application will stay in this unusable state until it is restarted or the server is restarted. We've resorted to once again scheduling the server to restart nightly which has cut down on the frequency of this happening which hints at this being related to usage, but it is still happening once a week and sometimes more. The applications that seem to experience this the most are I believe the more heavily used applications. No Spring application has experienced this issue on our other servers which leads me to tentatively say that Spring is not affected and/or is not a cause of the issue but upgrading all applications to Spring is not feasible at the moment. We've tried upgrading Struts in the most frequently affected applications to 1.3.10 but it did not solve the issue and actually afflicted us with another issue stemming from a bug in that Struts version. So we had to go back to 1.3.8. I spoke with a couple of people in Tomcat's IRC channel and they seemed to think it was a third-party library or a problem/race condition between the Struts and Tomcat servlets. While this may be important information, I have no idea what to do with it. I'm not sure debugging is a possibility because it's a remote server and I wouldn't even know what to look for. I also can't allow a production application to remain in this state for very long. I can't file a bug report because I can't reproduce it at will and I am unable to provide thread or heap dumps. I have a suspicion it may be caused by that third part library although I don't see how that library would affect Tomcat's serving of static resources. This issue has never happened to our test server or our local instances of tomcat. Since I suspect it's related to usage, this is not surprising. Any help would be greatly appreciated. Chris