Hi All. There is a blueprint ( https://blueprints.launchpad.net/nova/+spec/db-reconnect) by Devananda van der Veen, which goal is to implement reconnection to a database and retrying of the last operation if a db connection fails. I’m working on the implementation of this BP in oslo-incubator ( https://review.openstack.org/#/c/33831/).
Function _raise_if_db_connection_lost() was added to _wrap_db_error() decorator defined in openstack/common/db/sqlalchemy/session.py. This function catches sqlalchemy.exc.OperationalError and finds database error code in this exception. If this error code is on `database has gone away` error codes list, this function raises DBConnectionError exception. Decorator for db.api methods was added to openstack/common/db/api.py. We can apply this decorator to methods in db.sqlalchemy.api (not to individual queries). It catches DBConnectionError exception and retries the last query in a loop until it succeeds, or until the timeout is reached. The timeout value is configurable with min, max, and increment options. We suppose that all db.api methods are executed inside a single transaction, so retrying the whole method, when a connection is lost, should be safe. I would really like to receive some comments about the following suggestions: 1. I can’t imagine a situation when we lose connection to an SQLite DB. Also, as far as I know, SQLite is not used in production at the moment, so we don't handle this case. 2. Please, leave some comments about `database has gone away` error codes list in MySQL and PostgreSQL. 3. Johannes Erdfelt suggested that “retrying the whole method, even if it's in a transaction, is only safe the entire method is idempotent. A method could execute successfully in the database, but the connection could be dropped before the final status is sent to the client.” I agree, that this situation can cause data corruption in a database (e. g., if we try to insert something to a database), but I’m not sure, how RDBMS handle this. Also, I haven't succeeded in creation of a functional test case, that would allow to reproduce the described situation easily. Thanks, Victor
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
