On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai <caij...@gmail.com> wrote: > Hi Michael, > > Thanks for your detailed answers. > > 1, We can see sereral transactions hang in the failover, including: > "select 1" > "select xxx from instances" > even, we see the transaction hang after "update services set xxxx" > So I am not sure whether the problem is related to the connection > invalidation in the pool. > > > 2, I have tried to disable the connection pool with the code "NULLPOOL". > And yes, the fail over is ok, and no long transaction in the DB side after > failover. > But, in my understanding, if we disable connection pool in this way, the db > will suffer with poor performance. > > > 3, For the nova worker, in the log, I did not see the failed connection > objected was shared between other nova processes(pid). > > > BTW, from the nova-conductor, we can see the socket receive queue is > increasing, and the response from DB was not consumed by nova-conductor: > > Proto Recv-Q Send-Q Local Address Foreign Address State > PID/Program name > tcp 69 0 nova-conductor_ip:60132 pg_ip:5432 > established 36365/python2.7-nova-conductor > > > And now , I just doubt the long transaction ,which is not > commited/rollbacked, was caused by the connection pool. > I am wondering which situration would cause the nova client could not > receive the response? > And which information do I need if dig it further.
this is really more of an openstack integration issue and it should be taken up by the Nova team first, as it would involve being able to reproduce your problem and digging into the architecture to see where the engine might be being misused. Unfortunately you might find it difficult to get support because this is Postgresql, even though it's historically been "supported" I'm sure you're aware they've been trying for some time to find a way for Postgresql not really be supported, and that may be the case today, even though IIUC they've still left the door open for alternative backends (because there is always pressure to support other backends). If you're truly looking for a stable and performant Openstack without being deeply involved in fixing issues I'd be switching to MySQL because that's really the only backend that has widespread testing and support. OTOH if you're involved with development of Nova against Postgresql then getting a launchpad issue with steps to reproduce would be a good start (e.g. rally job that you're running, something like that). > > Thanks for your help again. > --Jinrong. > > > 在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道: >> >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <cai...@gmail.com> wrote: >> > Hi Michael , >> > >> > I am using openstack with postgresql which sqlalchemy and oslo_db module >> > were used. >> > And there are some problems after my pg database switched over. >> > >> > Here is my switch over process: >> > 1. nova-conductor(python application) is running with DB connection >> > strings >> > point to vip , which is in primary site(A) of pg. >> > 2. switch VIP from primary(A) to new primary(B) >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new >> > primary. >> > 4. nova-conductor is running in the whole process. >> > >> > After some seconds, I found some nova-conductor processes are hang with >> > status futex_wait_queue_me, and the status of the query in DB is "idle >> > in >> > transaction", the transaction was not commited or rollbacked! >> > I think disconnection was handled in the oslo_db, which will send a >> > ping(select 1) to DB. >> > >> > If DB was switchd over, the connection in the pool would be set with >> > status >> > invalid, and reconnect after next check out. >> > >> > ###error messages from nova-conductor >> > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped >> > from >> > (psycopg2.ProgrammingError) execute cannot be used while an asynchronous >> > query is underway [SQL: 'SELECT 1'] >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call >> > last): >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in >> > _execute_context >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters context) >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in >> > do_execute >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters >> > cursor.execute(statement, >> > parameters) >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute >> > cannot be used while an asynchronous query is underway >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters >> > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] >> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error >> > closing >> > cursor >> > 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call >> > last): >> > 36365 ERROR sqlalchemy.pool.QueuePool File >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in >> > _safe_close_cursor >> > 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() >> > 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot >> > be >> > used while an asynchronous query is underway >> > 36365 ERROR sqlalchemy.pool.QueuePool >> > >> > ###ps status of nova-conductor >> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat >> > /proc/33316/stack >> > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 >> > [<ffffffff810e5799>] futex_wait+0x179/0x280 >> > [<ffffffff810e782e>] do_futex+0xfe/0x5b0 >> > [<ffffffff810e7d60>] SyS_futex+0x80/0x180 >> > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b >> > [<ffffffffffffffff>] 0xffffffffffffffff >> > >> > ### stack of the nova-conductor process >> > POD6-Mongodb03:/tmp # pstack 33316 >> > #0 0x00002b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0 >> > #1 0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 >> > #2 0x00002b8449e31c08 in pthread_mutex_lock () from >> > /lib64/libpthread.so.0 >> > #3 0x00002b84554c44ab in pq_abort () from >> > /python2.7/site-packages/psycopg2/_psycopg.so >> > #4 0x00002b84554c955e in psyco_conn_rollback () from >> > /python2.7/site-packages/psycopg2/_psycopg.so >> > #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from >> > /lib64/libpython2.7.so.1.0 >> > #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from >> > /lib64/libpython2.7.so.1.0 >> > >> > The psycopg2 was trying to close the cursor, and try to get the mutex >> > lock >> > "pthread_mutex_lock", but it seems that the cursor was used by other >> > session. >> > >> > >> > Questions: >> > >> > 1. What the error "ProgrammingError: close cannot be used while an >> > asynchronous query is underway" mean? >> > AFAIK, these caused by psycopg2, which means a asynchronous query was >> > executed in one connection. >> > But the I think the sqlalchemy was thread safe since it was patched by >> > eventlet, see details in eventlet/support/psycopg2_patcher.py >> >> it's kind of news to me that postgresql/eventlet is even doing the >> psyco-green monkeypatch, but this seems to have been the case for a >> long time since eventlet 0.9.8. So what we're seeing is just a poor >> failure mode which is helped along by the awkward limitations of the >> async connection, where the "SELECT 1" ping we do to test liveness of >> the connection is not able to raise a proper error code. >> >> Right off I will note that "moving the VIP" is not adequate for a >> clean HA switchover, we've had lots of problems doing things that way >> including specifically that a connection that is over the VIP has a >> very hard time detecting that it's no longer "open" when you switch >> the VIP under it. In the MySQL world we've had to modify the socket >> with an additional parameter "source_address" >> (https://docs.python.org/2/library/socket.html#socket.create_connection) >> to help with VIP switchovers, not sure what options psycopg2 offers in >> this regard. But overall we don't switch VIPs around, we use >> HAProxy to do smooth switchovers and load balancing. >> >> Another option here would be to not use connection pooling (which >> unfortunately is not an option that oslo.db exposes at this time, >> though it could/should be added), or what you could do is set >> pool_timeout (lifetime of a connection) to be very low, and assuming >> these errors involve stale connections only it would limit how long >> that would be. >> >> Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly >> recommend running it in this way. I've been pushing for years to get >> openstack off of eventlet entirely as it's entirely useless for the >> architecture of openstack. >> >> >> > And we can see different green thread number in the log, as: >> > [pid:36365] [MainThread] [tid:122397712] >> > [pid:36365] [MainThread] [tid:122397815] >> > So, I guess the connection pool in one process is safe. >> > >> > 2. The nova-conductor was a multi-thread python client, which forked >> > several >> > child process. >> > ps -elf|grep -i nova-conductor >> > 30878 1 pool_s /usr/bin/nova-conductor >> > 36364 1 ep_pol /usr/bin/nova-conductor >> > 36365 1 futex_ /usr/bin/nova-conductor >> > 36366 1 ep_pol /usr/bin/nova-conductor >> > 36367 1 ep_pol /usr/bin/nova-conductor >> > 36368 1 ep_pol /usr/bin/nova-conductor >> > >> > If the nova-conductor was started with only one child, the problem was >> > not >> > happen. >> > Does this mean the connection/engine CAN NOT shared in these child >> > processes? >> >> nova starts a lot of workers but they should not be starting up the >> engine in the parent process. if you can identify that nova is >> sending a filehandle from the parent process to multiple child >> processes then please report a nova issue on launchpad, that has to be >> fixed ASAP. >> >> >> >> > >> > Thanks. >> > >> > -- >> > SQLAlchemy - >> > The Python SQL Toolkit and Object Relational Mapper >> > >> > http://www.sqlalchemy.org/ >> > >> > To post example code, please provide an MCVE: Minimal, Complete, and >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> > description. >> > --- >> > You received this message because you are subscribed to the Google >> > Groups >> > "sqlalchemy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to sqlalchemy+...@googlegroups.com. >> > To post to this group, send email to sqlal...@googlegroups.com. >> > Visit this group at https://groups.google.com/group/sqlalchemy. >> > For more options, visit https://groups.google.com/d/optout. > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To post to this group, send email to sqlalchemy@googlegroups.com. > Visit this group at https://groups.google.com/group/sqlalchemy. > For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.