hi all - I’ve been researching this cinder issue https://bugs.launchpad.net/cinder/+bug/1417018 and I’ve found something concerning.
Basically there’s a race condition here which is occurring because a single MySQL connection is shared between two green threads. This occurs while Cinder is doing its startup work in cinder/volume/manager.py -> init_host(), and at the same time a request comes in from a separate service call that seems to be part of the startup. The log output at http://paste.openstack.org/show/175571/ shows this happening. I can break it down: 1. A big query for volumes occurs as part of "self.db.volume_get_all_by_host(ctxt, self.host)”. To reproduce the error more regularly I’ve placed it into a loop of 100 calls. We can see that thread id is 68089648 MySQL connection is 3a9c5a0. 2015-02-16 19:32:47.236 INFO sqlalchemy.engine.base.Engine [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] tid: 68089648, connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT volumes.created_at AS 2015-02-16 19:32:47.237 INFO sqlalchemy.engine.base.Engine [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] ('localhost.localdomain@ceph', 'localhost.localdomain@ceph#%’) 2. A “ping” query comes in related to a different API call - different thread ID, *same* connection 2015-02-16 19:32:47.276 INFO sqlalchemy.engine.base.Engine [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] tid: 68081456, connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT 1 2015-02-16 19:32:47.279 INFO sqlalchemy.engine.base.Engine [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] () 3. The first statement is still in the middle of invocation, so we get a failure, either a mismatch of the statement to the cursor, or a MySQL lost connection (stack trace begins) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers … more stack trace 4. another thread id, *same* connection. 2015-02-16 19:32:47.290 INFO sqlalchemy.engine.base.Engine [req-f980de7c-151d-4fed-b45e-d12b133859a6 None None] tid: 61238160, connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT 1 rows = [process[0](row, None) for row in fetch] File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 363, in _instance tuple([row[column] for column in pk_cols]) File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 331, in _key_fallback expression._string_or_unprintable(key)) NoSuchColumnError: "Could not locate column in row for column 'volumes.id'" 2015-02-16 19:32:47.293 ERROR cinder.openstack.common.threadgroup [-] "Could not locate column in row for column 'volumes.id’" So I’m not really sure what’s going on here. Cinder seems to have some openstack greenlet code of its own in cinder/openstack/common/threadgroup.py, I don’t know the purpose of this code. SQLAlchemy’s connection pool has been tested a lot with eventlet / gevent and this has never been reported before. This is a very severe race and I’d think that this would be happening all over the place. Current status is that I’m continuing to try to determine why this is happening here, and seemingly nowhere else. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev