A mini postmortem on snapshot failures

Zameer Manji Fri, 30 Sep 2016 17:34:52 -0700

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

````
W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
### Error rolling back transaction. Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
### Error rolling back transaction. Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at
org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:216)
at
org.apache.ibatis.session.SqlSessionManager.rollback(SqlSessionManager.java:299)
at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:116)
at
org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$0(DbStorage.java:175)
at
org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at
org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:521)
at
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:551)
at
org.apache.aurora.scheduler.storage.log.LogStorage.doSnapshot(LogStorage.java:489)
at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at
org.apache.aurora.scheduler.storage.log.LogStorage.snapshot(LogStorage.java:565)
at
org.apache.aurora.scheduler.storage.log.LogStorage.lambda$scheduleSnapshots$20(LogStorage.java:468)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at
org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(PooledConnection.java:254)
at
org.apache.ibatis.datasource.pooled.PooledConnection.invoke(PooledConnection.java:243)
at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
at
org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(JdbcTransaction.java:79)
at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
at
org.apache.ibatis.executor.CachingExecutor.rollback(CachingExecutor.java:119)
at
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:213)
... 19 common frames omitted
````

This failure is silent and can be observed only through the
`scheduler_log_snapshots` metric, if it isn't increasing then snapshots are
not being created. In this cluster, a snapshot was not taken for about 4
days.
For those unfamiliar with Aurora's replicated log storage system, snapshot
creation is important because it allows us to truncate the number of
entries in the replicated log to a single large entry. This is required
because the log recovery time is proportional to the number of entries in
the log. Operators can observe the amount of time it takes to recover the
log at startup via the `scheduler_log_recover_nanos_total` metric.

The largest value observed for `scheduler_log_recover_nanos_total` during
this period was 8 minutes. This means that recovery from a failover would
take at least 8 minutes. For reference, a system aiming to have 99.99%
uptime can only sustain 4 minutes of downtime a month.

The root cause of this can be found from the exception in the above stack
trace:
`Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.`
This originates from the MyBatis connection pool used to communicate with
the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
dump the entire database into the replicated log [1].

This exception is being thrown because we have a connection pool to
communicate with the H2 SQL database. By default the connection pool has
the following properties:
* Maximum 10 active connections
* Maximum connection time of 20s before being considered for eviction.

Under high read load, there can be many pending SQL queries for a
connection. If a single connection takes more than 20s it will likely be
evicted. In this case running one of the `SCRIPT` queries was taking more
than 20s and there were many pending queries which caused MyBatis to evict
the connection for the `SCRIPT` query, causing snapshot creation failure.

To fix this issue, operators used the `-db_max_active_connection_count` to
increase the maximum number of active connections for MyBatis to 100. Once
the scheduler was able to serve requests, operators used `aurora_admin
scheduler_snapshot` to force create a snapshot. Then a scheduler failover
was induced and it was observed that recovery time dropped to about 40
seconds.

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

I would like to raise three questions:
* Should we add a flag to tune the maximum connection time for MyBatis?
* Should a Snapshot creation failure be fatal?
* Should we change the default maximum connection time and maximum number
of active connections?

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

A mini postmortem on snapshot failures

Reply via email to