Note: I kept dev@roller on CC, which was added by Dave.
Update inline.
On 15.07.2014 10:45, Rainer Jung wrote:
1) Trivial: Add response duration to Tomcat access log
Add "%D" at end of pattern string in server.xml access log configuration.
Done, also added "%I" (thread name) and "%S" (session id) to the access
log pattern in server.xml. Thread name useful to correlate requests with
e.g. thread dumps and log statements (if the log config also logs the
thread name).
File committed to svn.
2) Medium: Add GC log
To rule out GC problems (we don't expect them, but I would like to make
sure), add a couple of startup parameters that write sensible GC data to
a log.
Note that GC logs have no nice way of rotation. Even the new JVM flags
allowing GC log rotation will overwrite GC log files when the process
starts. So the theoretical risk is writing big log files, but since
roller is restarted relatively frequently, getting rid of the log by
removing the new startup parameters is easy.
Done by adding bin/setenv.sh to CATALINA_BASE which is automatically
read by the start script catalina.sh which in turn is used by the Ubuntu
platform TC underlying our roller installation.
Config used:
CATALINA_OPTS="
-verbose:gc
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-Xloggc:$CATALINA_BASE/logs/gc.log_`date +%Y%m%d_%H%M%S`
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=50M"
Committed to svn.
3) Medium: Add analysis code
I'd like to deploy an additional app inside the roller Tomcat, that
tracks long running requests and writes our stack information only for
the threads that have long running requests on them. The app is not open
source but my own IP, so no licensing issue here. The protocol is
written as a rotating log file, that can then be investigated.
The app is packaged as a normal webapp (war) but it doesn't actually
provide a useful web interface. Packaging as a war is just an easy way
of getting it deployed inside Tomcat.
I have that code being running in important and high-load applications
since a few years, so i see no risk of negatively ifluencing stability
or performance. It is not doing a tracing, it only collects snapshots of
threads doing long running since in regular time intervals like once per
minute.
Added as CATALINA_BASE/conf/perfdiag.properties and
CATALINA_BASE/webapps/kpdtperfidag.war. Writes logs
CATALINA_BASE/logs/*perfdiag* with daily rotation. But see below.
Config: Check once every 30 seconds for requests being processed for
longer than 10 seconds, and if found write out current stack.
Deliberately not added to svn.
4) Trivial: Turn off the https connector
The Tomcat https connector is currently configured only partially and
complains on startup. We don't currently use it, because the ssl
endpoint is on erebus-ssl which forward to the normal http port.
Done in server.xml, committed to svn.
5) Medium: Enabling Tomcat Manager Webapp with role manager-jmx
Enabling Tomcat Manager webapp with a user in role manager-jmx with a
hashed password in tomcat-users.xml.
This gives me the opportunity to collect JMX information on the fly via
simple lightweight scripts.
Done by
- adding the tomcat7-admin package to the system
- disabling the host-manager by renaming host-manager.xml (add trailing
underscore)
- committed manager.xml to svn
- added a user plus password for access to the manager
- changed realm config in server.xml to use hashed passwords
Committed to svn.
6) More complex: Caching
Since roller has only small content, caching would be technically cheap.
Unfortunately currently the response cache headers are not nice to help
caching. So if we want to go down this road, we would have to configura
or fix roller to set useful caching headers (or overwrite them by
Tomcat). Plus we would need to setup the cache, being it inside the
Apache reverse proxy in front of roller, or a separate component.
I'd like to execute quickly on 1-5 and investigate the collected data.
Note that recently the db connection pooling was again turned off,
because problems were observed.
Not done.
My current impressions:
- After full startup the instance is performing well. There were no long
running requests (>10 seconds) during the first few hours except during
startup. As long as this is unchanged, the added perfdiag will not log
info about long running requests, because there are none.
- there were even only very few requests above 2 seconds (a few per
hour), and not many above 1 second.
- request load is about 2-3 requests per second
- heap size is a bit small. Under current load GC still behaves
acceptable, but there is very little head room. Unfortunately the heap
settings for this platform Tomcat are in the rc script. I changed it
there from 128MB to 256MB for the next restart. That is still not much,
but will increase our head room a lot.
- the current configuration does not use a DB pool. Instead whenever a
DB connection is needed it is created on the fly and then thrown away
after use. I'm astonished how well this works. It does produce about 500
db connections in TIME_WAIT state (after they have been successfully
closed), which is not a problem per se. It would become a problem if
this number goes above roughly 5000, because then some TCP actions can
get slow.
- I still think the config should switch to a db pool, but it is said,
that problems occured after that was done. For instance there was an
observation, that after some time roller tried to open db connections to
a wrong host. I didn't have a chance to observe it myself.
- after I prepared the above changes, we actually had a situation were
our monitoring checks independently fired an alarm for roller. Humbedooh
did two thread dumps which are in the catalina.out file.
Those dumps show the following picture (db pool was off):
- 57 threads in the stack:
at com.mysql.jdbc.ResultSetImpl.getInstance(ResultSetImpl.java:355)
at com.mysql.jdbc.MysqlIO.buildResultSetWithUpdates(MysqlIO.java:2673)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2561)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1757)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2171)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2562)
...
- 23 threads in the stack
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:410)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2581)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1757)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2171)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2562 or 2568)
- locked <0xHEXADDR> (a java.lang.Object)
...
- Another 16 threads in the stack
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:307)
at
com.mysql.jdbc.ConnectionImpl.connectWithRetries(ConnectionImpl.java:2168)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2127)
- locked <0xHEXADDR> (a java.lang.Object)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:774)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:49)
...
- 12 threads in
at com.mysql.jdbc.ConnectionImpl.createStatement(ConnectionImpl.java:2429)
at com.mysql.jdbc.ConnectionImpl.createStatement(ConnectionImpl.java:2409)
at
com.mysql.jdbc.ConnectionImpl.getMetadataSafeStatement(ConnectionImpl.java:3011)
at
com.mysql.jdbc.ConnectionImpl.loadServerVariables(ConnectionImpl.java:3773)
at
com.mysql.jdbc.ConnectionImpl.initializePropsFromServer(ConnectionImpl.java:3365)
at
com.mysql.jdbc.ConnectionImpl.connectWithRetries(ConnectionImpl.java:2189)
...
and another 29 threads in different phases of request handling, of which
12 were again caught in various places inside jdbc or mysql jdbc driver.
That sums up to a total of about 120 threads being somewhere inside jdbc
code.
Whether the root cause is a communication problem with the db, or a db
related config issue, or whether the root cause lies outside, e.g. GC
problems, should be decidable the next time it happens.
The current jdbc config is based on the following connection string:
jdbc:mysql://***:***/***?autoReconnect=true&useUnicode=true&characterEncoding=utf-8&mysqlEncoding=utf8&InitialCapacity=10&MaxCapacity=25
I could find no indication, that "InitialCapacity=10&MaxCapacity=25" has
any functional meaning here. It looks like we should drop those to
reduce confusion.
For the autoReconnect, it is said to be deprecated and not really
recommended.
I would suggest to switch back using a connection pool with the
following config:
<Resource name="jdbc/rollerdb" auth="Container"
type="javax.sql.DataSource"
driverClassName="com.mysql.jdbc.Driver"
url="jdbc:mysql://***:***/***?useUnicode=true&characterEncoding=utf-8&mysqlEncoding=utf8"
username="***"
password="***"
initialSize="10"
maxActive="50"
maxIdle="5"
testWhileIdle="true"
testOnBorrow="true"
testOnReturn="false"
validationQuery="SELECT 1"
removeAbandoned="true"
removeAbandonedTimeout="60"
logAbandoned="true"
maxWait="3000"
/>
I have dropped the eviction config, because I have not yet checked how
old the dbcp+pool is and the eviction and abandoned features had locking
problems for a long time (no more for recent versions). We would like to
keep abandoned, but eviction is not that important.
I guess we also need to adjust propertis in lib/planet-custom.properties
and lib/roller-custom.properties if we want to switch to a pool defined
in the context.xml (ROOT.xml) once more.
I will add a few scripts during the next day, which will track some
data, e.g. the number of open db connections (to detect a possible
leak). I will also check, whether the service doc needs an update w.r.t.
my recent changes.
The update of the roller version proposed by Dave does not need to wait
for any of the additional actions proposed by me above. Whatever we are
ready to do should be executed upon.
Regards,
Rainer