Hi All,
I was alerted to this problem recently and it's something that affects
developers so I want to bring it up. It is a design principle in CloudStack
that we do not make agent calls within database transactions. The reason is
because when you make a call to an external system, there's no guarantee on how
long the call takes or even whether the call returns. When a call takes a long
time, several bad things can happen:
- The MySQL DB Connection held opened due to the DB transaction goes
into idle. Eventually, a timeout in MySQL hits and the connection gets severed
and the transaction is rolled back. By default, this timeout is 45 seconds but
can be changed via a parameter in my.cnf. So it's problem that the agent call
completes just fine but the DB transaction rolls back and changes are undone.
- The rows locked in that transaction before the remote agent call
could be holding up other foreign key checks into the table. MySQL runs
foreign key checks in transactions to make sure the data modification and the
checks are done atomically. Therefore, these checks must wait for other
transactions to complete. Hence, an agent call that takes sometime can
severely slow down the system, particularly under scale.
We have two solutions to this:
- Drive agent interactions with states. There are many examples of
this in VM, Volume, etc.
- When the above cannot be done, acquire a lock in the lock table via a
DAO method call. Locks do not maintain DB transactions and therefore will not
run into this problem. However, you are responsible for releasing locks. It
used to be that if you forget to release the locks, the @DB annotation
automatically releases locks once it went out of the scope and asserts to alert
the developer. However, the @DB annotation has been removed in the Spring work
so I'm not sure if it's still done.
This is a tough problem to solve because
1. It usually works just fine during functional testing. During scale
testing, this problem surfaces and often in unexpected places due to the
foreign key check problem.
2. For developers, it is difficult for them to know if a method that
they're calling within a transaction ends up in an agent call.
There is an assert in AgentManager to ensure that there are no db transactions
before making a agent call. Apparently, since the conversion to Maven, no one
actually runs with assert on any more. Due to that, this design principle has
been lost in CloudStack and we're finding more and more calls being made in DB
transactions. To counter that, I decided to add a global parameter that turns
the assert to an actual exception. It is advised that all developers set this
global parameter, check.txn.before.sending.agent.commands, during their own
testing to make sure it doesn't call agent calls in transactions.
--Alex