On Thu, Nov 16, 2017 at 11:02 PM, JinRong Cai <caij...@gmail.com> wrote: > Hi Michael, > > Thanks for the reply. > It was a pity that Openstack would only support Mysql ONLY. > And for this problem, I would dig into the code for more first. > Will let you know if I found more.
keep in mind that's based on contributors. If you are working on Nova / Postgresql stability, then Openstack *does* support Postgresql :). Openstack is huge and also has a really deep CI system that needs a lot of maintenance and the issue is strictly one of resources / interest. > > Thanks agagin! > --Jinrong > > 在 2017年11月13日星期一 UTC+8下午10:32:21,Mike Bayer写道: >> >> On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai <cai...@gmail.com> wrote: >> > Hi Michael, >> > >> > Thanks for your detailed answers. >> > >> > 1, We can see sereral transactions hang in the failover, including: >> > "select 1" >> > "select xxx from instances" >> > even, we see the transaction hang after "update services set xxxx" >> > So I am not sure whether the problem is related to the connection >> > invalidation in the pool. >> > >> > >> > 2, I have tried to disable the connection pool with the code "NULLPOOL". >> > And yes, the fail over is ok, and no long transaction in the DB side >> > after >> > failover. >> > But, in my understanding, if we disable connection pool in this way, the >> > db >> > will suffer with poor performance. >> > >> > >> > 3, For the nova worker, in the log, I did not see the failed connection >> > objected was shared between other nova processes(pid). >> > >> > >> > BTW, from the nova-conductor, we can see the socket receive queue is >> > increasing, and the response from DB was not consumed by nova-conductor: >> > >> > Proto Recv-Q Send-Q Local Address Foreign Address >> > State >> > PID/Program name >> > tcp 69 0 nova-conductor_ip:60132 pg_ip:5432 >> > established 36365/python2.7-nova-conductor >> > >> > >> > And now , I just doubt the long transaction ,which is not >> > commited/rollbacked, was caused by the connection pool. >> > I am wondering which situration would cause the nova client could not >> > receive the response? >> > And which information do I need if dig it further. >> >> this is really more of an openstack integration issue and it should be >> taken up by the Nova team first, as it would involve being able to >> reproduce your problem and digging into the architecture to see where >> the engine might be being misused. Unfortunately you might find it >> difficult to get support because this is Postgresql, even though it's >> historically been "supported" I'm sure you're aware they've been >> trying for some time to find a way for Postgresql not really be >> supported, and that may be the case today, even though IIUC they've >> still left the door open for alternative backends (because there is >> always pressure to support other backends). >> >> If you're truly looking for a stable and performant Openstack without >> being deeply involved in fixing issues I'd be switching to MySQL >> because that's really the only backend that has widespread testing and >> support. OTOH if you're involved with development of Nova against >> Postgresql then getting a launchpad issue with steps to reproduce >> would be a good start (e.g. rally job that you're running, something >> like that). >> >> >> > >> > Thanks for your help again. >> > --Jinrong. >> > >> > >> > 在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道: >> >> >> >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <cai...@gmail.com> wrote: >> >> > Hi Michael , >> >> > >> >> > I am using openstack with postgresql which sqlalchemy and oslo_db >> >> > module >> >> > were used. >> >> > And there are some problems after my pg database switched over. >> >> > >> >> > Here is my switch over process: >> >> > 1. nova-conductor(python application) is running with DB connection >> >> > strings >> >> > point to vip , which is in primary site(A) of pg. >> >> > 2. switch VIP from primary(A) to new primary(B) >> >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new >> >> > primary. >> >> > 4. nova-conductor is running in the whole process. >> >> > >> >> > After some seconds, I found some nova-conductor processes are hang >> >> > with >> >> > status futex_wait_queue_me, and the status of the query in DB is >> >> > "idle >> >> > in >> >> > transaction", the transaction was not commited or rollbacked! >> >> > I think disconnection was handled in the oslo_db, which will send a >> >> > ping(select 1) to DB. >> >> > >> >> > If DB was switchd over, the connection in the pool would be set with >> >> > status >> >> > invalid, and reconnect after next check out. >> >> > >> >> > ###error messages from nova-conductor >> >> > localhost nova-conductor ERROR [pid:36365] [MainThread] >> >> > [tid:122397712] >> >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] >> >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception >> >> > wrapped >> >> > from >> >> > (psycopg2.ProgrammingError) execute cannot be used while an >> >> > asynchronous >> >> > query is underway [SQL: 'SELECT 1'] >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent >> >> > call >> >> > last): >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in >> >> > _execute_context >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters context) >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in >> >> > do_execute >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters >> >> > cursor.execute(statement, >> >> > parameters) >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: >> >> > execute >> >> > cannot be used while an asynchronous query is underway >> >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters >> >> > localhost nova-conductor ERROR [pid:36365] [MainThread] >> >> > [tid:122397712] >> >> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error >> >> > closing >> >> > cursor >> >> > 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call >> >> > last): >> >> > 36365 ERROR sqlalchemy.pool.QueuePool File >> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in >> >> > _safe_close_cursor >> >> > 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() >> >> > 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close >> >> > cannot >> >> > be >> >> > used while an asynchronous query is underway >> >> > 36365 ERROR sqlalchemy.pool.QueuePool >> >> > >> >> > ###ps status of nova-conductor >> >> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # >> >> > cat >> >> > /proc/33316/stack >> >> > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 >> >> > [<ffffffff810e5799>] futex_wait+0x179/0x280 >> >> > [<ffffffff810e782e>] do_futex+0xfe/0x5b0 >> >> > [<ffffffff810e7d60>] SyS_futex+0x80/0x180 >> >> > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> > >> >> > ### stack of the nova-conductor process >> >> > POD6-Mongodb03:/tmp # pstack 33316 >> >> > #0 0x00002b8449e35f4d in __lll_lock_wait () from >> >> > /lib64/libpthread.so.0 >> >> > #1 0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 >> >> > #2 0x00002b8449e31c08 in pthread_mutex_lock () from >> >> > /lib64/libpthread.so.0 >> >> > #3 0x00002b84554c44ab in pq_abort () from >> >> > /python2.7/site-packages/psycopg2/_psycopg.so >> >> > #4 0x00002b84554c955e in psyco_conn_rollback () from >> >> > /python2.7/site-packages/psycopg2/_psycopg.so >> >> > #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from >> >> > /lib64/libpython2.7.so.1.0 >> >> > #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from >> >> > /lib64/libpython2.7.so.1.0 >> >> > >> >> > The psycopg2 was trying to close the cursor, and try to get the mutex >> >> > lock >> >> > "pthread_mutex_lock", but it seems that the cursor was used by other >> >> > session. >> >> > >> >> > >> >> > Questions: >> >> > >> >> > 1. What the error "ProgrammingError: close cannot be used while an >> >> > asynchronous query is underway" mean? >> >> > AFAIK, these caused by psycopg2, which means a asynchronous query was >> >> > executed in one connection. >> >> > But the I think the sqlalchemy was thread safe since it was patched >> >> > by >> >> > eventlet, see details in eventlet/support/psycopg2_patcher.py >> >> >> >> it's kind of news to me that postgresql/eventlet is even doing the >> >> psyco-green monkeypatch, but this seems to have been the case for a >> >> long time since eventlet 0.9.8. So what we're seeing is just a poor >> >> failure mode which is helped along by the awkward limitations of the >> >> async connection, where the "SELECT 1" ping we do to test liveness of >> >> the connection is not able to raise a proper error code. >> >> >> >> Right off I will note that "moving the VIP" is not adequate for a >> >> clean HA switchover, we've had lots of problems doing things that way >> >> including specifically that a connection that is over the VIP has a >> >> very hard time detecting that it's no longer "open" when you switch >> >> the VIP under it. In the MySQL world we've had to modify the socket >> >> with an additional parameter "source_address" >> >> >> >> (https://docs.python.org/2/library/socket.html#socket.create_connection) >> >> to help with VIP switchovers, not sure what options psycopg2 offers in >> >> this regard. But overall we don't switch VIPs around, we use >> >> HAProxy to do smooth switchovers and load balancing. >> >> >> >> Another option here would be to not use connection pooling (which >> >> unfortunately is not an option that oslo.db exposes at this time, >> >> though it could/should be added), or what you could do is set >> >> pool_timeout (lifetime of a connection) to be very low, and assuming >> >> these errors involve stale connections only it would limit how long >> >> that would be. >> >> >> >> Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly >> >> recommend running it in this way. I've been pushing for years to get >> >> openstack off of eventlet entirely as it's entirely useless for the >> >> architecture of openstack. >> >> >> >> >> >> > And we can see different green thread number in the log, as: >> >> > [pid:36365] [MainThread] [tid:122397712] >> >> > [pid:36365] [MainThread] [tid:122397815] >> >> > So, I guess the connection pool in one process is safe. >> >> > >> >> > 2. The nova-conductor was a multi-thread python client, which forked >> >> > several >> >> > child process. >> >> > ps -elf|grep -i nova-conductor >> >> > 30878 1 pool_s /usr/bin/nova-conductor >> >> > 36364 1 ep_pol /usr/bin/nova-conductor >> >> > 36365 1 futex_ /usr/bin/nova-conductor >> >> > 36366 1 ep_pol /usr/bin/nova-conductor >> >> > 36367 1 ep_pol /usr/bin/nova-conductor >> >> > 36368 1 ep_pol /usr/bin/nova-conductor >> >> > >> >> > If the nova-conductor was started with only one child, the problem >> >> > was >> >> > not >> >> > happen. >> >> > Does this mean the connection/engine CAN NOT shared in these child >> >> > processes? >> >> >> >> nova starts a lot of workers but they should not be starting up the >> >> engine in the parent process. if you can identify that nova is >> >> sending a filehandle from the parent process to multiple child >> >> processes then please report a nova issue on launchpad, that has to be >> >> fixed ASAP. >> >> >> >> >> >> >> >> > >> >> > Thanks. >> >> > >> >> > -- >> >> > SQLAlchemy - >> >> > The Python SQL Toolkit and Object Relational Mapper >> >> > >> >> > http://www.sqlalchemy.org/ >> >> > >> >> > To post example code, please provide an MCVE: Minimal, Complete, and >> >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> >> > description. >> >> > --- >> >> > You received this message because you are subscribed to the Google >> >> > Groups >> >> > "sqlalchemy" group. >> >> > To unsubscribe from this group and stop receiving emails from it, >> >> > send >> >> > an >> >> > email to sqlalchemy+...@googlegroups.com. >> >> > To post to this group, send email to sqlal...@googlegroups.com. >> >> > Visit this group at https://groups.google.com/group/sqlalchemy. >> >> > For more options, visit https://groups.google.com/d/optout. >> > >> > -- >> > SQLAlchemy - >> > The Python SQL Toolkit and Object Relational Mapper >> > >> > http://www.sqlalchemy.org/ >> > >> > To post example code, please provide an MCVE: Minimal, Complete, and >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> > description. >> > --- >> > You received this message because you are subscribed to the Google >> > Groups >> > "sqlalchemy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to sqlalchemy+...@googlegroups.com. >> > To post to this group, send email to sqlal...@googlegroups.com. >> > Visit this group at https://groups.google.com/group/sqlalchemy. >> > For more options, visit https://groups.google.com/d/optout. > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To post to this group, send email to sqlalchemy@googlegroups.com. > Visit this group at https://groups.google.com/group/sqlalchemy. > For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.