Re: [sqlalchemy] long transaction after database switched over

Mike Bayer Fri, 17 Nov 2017 05:41:44 -0800

On Thu, Nov 16, 2017 at 11:02 PM, JinRong Cai <caij...@gmail.com> wrote:
> Hi Michael,
>
> Thanks for the reply.
> It was a pity that Openstack would only support Mysql ONLY.
> And for this problem, I would dig into the code for more first.
> Will let you know if I found more.


keep in mind that's based on contributors. If you are working on Nova
/ Postgresql stability, then Openstack *does* support Postgresql :).
 Openstack is huge and also has a really deep CI system that needs a
lot of maintenance and the issue is strictly one of resources /
interest.


>
> Thanks agagin!
> --Jinrong
>
> 在 2017年11月13日星期一 UTC+8下午10:32:21，Mike Bayer写道：
>>
>> On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai <cai...@gmail.com> wrote:
>> > Hi  Michael,
>> >
>> > Thanks for your detailed answers.
>> >
>> > 1, We can see sereral transactions hang in the failover, including:
>> > "select 1"
>> > "select xxx from instances"
>> > even, we see the transaction hang after "update services set xxxx"
>> > So I am not sure whether the problem is related to the connection
>> > invalidation in the pool.
>> >
>> >
>> > 2, I have tried to disable the connection pool with the code "NULLPOOL".
>> > And yes, the fail over is ok, and no long transaction in the DB side
>> > after
>> > failover.
>> > But, in my understanding, if we disable connection pool in this way, the
>> > db
>> > will suffer with poor performance.
>> >
>> >
>> > 3, For the nova worker, in the log, I did not see the failed connection
>> > objected was shared between other nova processes(pid).
>> >
>> >
>> > BTW,  from the nova-conductor, we can see the socket receive queue is
>> > increasing, and the response from DB was not consumed by nova-conductor:
>> >
>> > Proto Recv-Q Send-Q Local Address           Foreign Address
>> > State
>> > PID/Program name
>> > tcp   69      0     nova-conductor_ip:60132  pg_ip:5432
>> > established 36365/python2.7-nova-conductor
>> >
>> >
>> > And now , I just doubt  the long transaction ,which is not
>> > commited/rollbacked, was caused by the connection pool.
>> > I am wondering which situration would cause the nova client could not
>> > receive the response?
>> > And which information do I need if dig it further.
>>
>> this is really more of an openstack integration issue and it should be
>> taken up by the Nova team first, as it would involve being able to
>> reproduce your problem and digging into the architecture to see where
>> the engine might be being misused.   Unfortunately you might find it
>> difficult to get support because this is Postgresql, even though it's
>> historically been "supported" I'm sure you're aware they've been
>> trying for some time to find a way for Postgresql not really be
>> supported, and that may be the case today, even though IIUC they've
>> still left the door open for alternative backends (because there is
>> always pressure to support other backends).
>>
>> If you're truly looking for a stable and performant Openstack without
>> being deeply involved in fixing issues I'd be switching to MySQL
>> because that's really the only backend that has widespread testing and
>> support.  OTOH if you're involved with development of Nova against
>> Postgresql then getting a launchpad issue with steps to reproduce
>> would be a good start (e.g. rally job that you're running, something
>> like that).
>>
>>
>> >
>> > Thanks for your help again.
>> > --Jinrong.
>> >
>> >
>> > 在 2017年11月13日星期一 UTC+8上午11:42:13，Mike Bayer写道：
>> >>
>> >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <cai...@gmail.com> wrote:
>> >> > Hi  Michael ,
>> >> >
>> >> > I am using openstack with postgresql which sqlalchemy and oslo_db
>> >> > module
>> >> > were used.
>> >> > And there are some problems after my pg database switched over.
>> >> >
>> >> > Here is my switch over process:
>> >> > 1. nova-conductor(python application) is running with DB connection
>> >> > strings
>> >> > point to vip , which is in primary site(A) of pg.
>> >> > 2. switch VIP from primary(A) to new primary(B)
>> >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new
>> >> > primary.
>> >> > 4. nova-conductor is running in the whole process.
>> >> >
>> >> > After some seconds, I found some nova-conductor processes are hang
>> >> > with
>> >> > status futex_wait_queue_me, and the status of the query in DB is
>> >> > "idle
>> >> > in
>> >> > transaction", the transaction was not commited or rollbacked!
>> >> > I think disconnection was handled in the oslo_db, which will send a
>> >> > ping(select 1) to DB.
>> >> >
>> >> > If DB was switchd over, the connection in the pool would be set with
>> >> > status
>> >> > invalid, and reconnect after next check out.
>> >> >
>> >> > ###error messages from nova-conductor
>> >> > localhost nova-conductor ERROR [pid:36365] [MainThread]
>> >> > [tid:122397712]
>> >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>> >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception
>> >> > wrapped
>> >> > from
>> >> > (psycopg2.ProgrammingError) execute cannot be used while an
>> >> > asynchronous
>> >> > query is underway [SQL: 'SELECT 1']
>> >> >  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent
>> >> > call
>> >> > last):
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
>> >> > _execute_context
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters     context)
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> >> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
>> >> > do_execute
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> >> > cursor.execute(statement,
>> >> > parameters)
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError:
>> >> > execute
>> >> > cannot be used while an asynchronous query is underway
>> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> >> >  localhost nova-conductor ERROR [pid:36365] [MainThread]
>> >> > [tid:122397712]
>> >> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error
>> >> > closing
>> >> > cursor
>> >> >   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call
>> >> > last):
>> >> >   36365 ERROR sqlalchemy.pool.QueuePool   File
>> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in
>> >> > _safe_close_cursor
>> >> >   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close()
>> >> >   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close
>> >> > cannot
>> >> > be
>> >> > used while an asynchronous query is underway
>> >> >   36365 ERROR sqlalchemy.pool.QueuePool
>> >> >
>> >> > ###ps status of nova-conductor
>> >> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log #
>> >> > cat
>> >> > /proc/33316/stack
>> >> > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120
>> >> > [<ffffffff810e5799>] futex_wait+0x179/0x280
>> >> > [<ffffffff810e782e>] do_futex+0xfe/0x5b0
>> >> > [<ffffffff810e7d60>] SyS_futex+0x80/0x180
>> >> > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> >
>> >> > ### stack of the nova-conductor process
>> >> > POD6-Mongodb03:/tmp # pstack 33316
>> >> > #0  0x00002b8449e35f4d in __lll_lock_wait () from
>> >> > /lib64/libpthread.so.0
>> >> > #1  0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0
>> >> > #2  0x00002b8449e31c08 in pthread_mutex_lock () from
>> >> > /lib64/libpthread.so.0
>> >> > #3  0x00002b84554c44ab in pq_abort () from
>> >> > /python2.7/site-packages/psycopg2/_psycopg.so
>> >> > #4  0x00002b84554c955e in psyco_conn_rollback () from
>> >> > /python2.7/site-packages/psycopg2/_psycopg.so
>> >> > #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from
>> >> > /lib64/libpython2.7.so.1.0
>> >> > #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from
>> >> > /lib64/libpython2.7.so.1.0
>> >> >
>> >> > The psycopg2 was trying to close the cursor, and try to get the mutex
>> >> > lock
>> >> > "pthread_mutex_lock", but it seems that the cursor was used by other
>> >> > session.
>> >> >
>> >> >
>> >> > Questions:
>> >> >
>> >> > 1. What the error "ProgrammingError: close cannot be used while an
>> >> > asynchronous query is underway" mean?
>> >> > AFAIK, these caused by psycopg2, which means a asynchronous query was
>> >> > executed in one connection.
>> >> > But the I think the sqlalchemy was thread safe since it was patched
>> >> > by
>> >> > eventlet, see details in  eventlet/support/psycopg2_patcher.py
>> >>
>> >> it's kind of news to me that postgresql/eventlet is even doing the
>> >> psyco-green monkeypatch, but this seems to have been the case for a
>> >> long time since eventlet 0.9.8.    So what we're seeing is just a poor
>> >> failure mode which is helped along by the awkward limitations of the
>> >> async connection, where the "SELECT 1" ping we do to test liveness of
>> >> the connection is not able to raise a proper error code.
>> >>
>> >> Right off I will note that "moving the VIP" is not adequate for a
>> >> clean HA switchover, we've had lots of problems doing things that way
>> >> including specifically that a connection that is over the VIP has a
>> >> very hard time detecting that it's no longer "open" when you switch
>> >> the VIP under it.    In the MySQL world we've had to modify the socket
>> >> with an additional parameter "source_address"
>> >>
>> >> (https://docs.python.org/2/library/socket.html#socket.create_connection)
>> >> to help with VIP switchovers, not sure what options psycopg2 offers in
>> >> this regard.    But overall we don't switch VIPs around, we use
>> >> HAProxy to do smooth switchovers and load balancing.
>> >>
>> >> Another option here would be to not use connection pooling (which
>> >> unfortunately is not an option that oslo.db exposes at this time,
>> >> though it could/should be added), or what you could do is set
>> >> pool_timeout (lifetime of a connection) to be very low, and assuming
>> >> these errors involve stale connections only it would limit how long
>> >> that would be.
>> >>
>> >> Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly
>> >> recommend running it in this way.   I've been pushing for years to get
>> >> openstack off of eventlet entirely as it's entirely useless for the
>> >> architecture of openstack.
>> >>
>> >>
>> >> > And we can see different green thread number in the log, as:
>> >> > [pid:36365] [MainThread] [tid:122397712]
>> >> > [pid:36365] [MainThread] [tid:122397815]
>> >> > So, I guess the connection pool in one process is safe.
>> >> >
>> >> > 2. The nova-conductor was a multi-thread python client, which forked
>> >> > several
>> >> > child process.
>> >> > ps -elf|grep -i nova-conductor
>> >> > 30878  1 pool_s /usr/bin/nova-conductor
>> >> > 36364  1 ep_pol /usr/bin/nova-conductor
>> >> > 36365  1 futex_ /usr/bin/nova-conductor
>> >> > 36366  1 ep_pol /usr/bin/nova-conductor
>> >> > 36367  1 ep_pol /usr/bin/nova-conductor
>> >> > 36368  1 ep_pol /usr/bin/nova-conductor
>> >> >
>> >> > If the nova-conductor was started with only one child, the problem
>> >> > was
>> >> > not
>> >> > happen.
>> >> > Does this mean the connection/engine CAN NOT shared in these child
>> >> > processes?
>> >>
>> >> nova starts a lot of workers but they should not be starting up the
>> >> engine in the parent process.   if you can identify that nova is
>> >> sending a filehandle from the parent process to multiple child
>> >> processes then please report a nova issue on launchpad, that has to be
>> >> fixed ASAP.
>> >>
>> >>
>> >>
>> >> >
>> >> > Thanks.
>> >> >
>> >> > --
>> >> > SQLAlchemy -
>> >> > The Python SQL Toolkit and Object Relational Mapper
>> >> >
>> >> > http://www.sqlalchemy.org/
>> >> >
>> >> > To post example code, please provide an MCVE: Minimal, Complete, and
>> >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> >> > description.
>> >> > ---
>> >> > You received this message because you are subscribed to the Google
>> >> > Groups
>> >> > "sqlalchemy" group.
>> >> > To unsubscribe from this group and stop receiving emails from it,
>> >> > send
>> >> > an
>> >> > email to sqlalchemy+...@googlegroups.com.
>> >> > To post to this group, send email to sqlal...@googlegroups.com.
>> >> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> >> > For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > SQLAlchemy -
>> > The Python SQL Toolkit and Object Relational Mapper
>> >
>> > http://www.sqlalchemy.org/
>> >
>> > To post example code, please provide an MCVE: Minimal, Complete, and
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> > description.
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "sqlalchemy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to sqlalchemy+...@googlegroups.com.
>> > To post to this group, send email to sqlal...@googlegroups.com.
>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] long transaction after database switched over

Reply via email to