Re: [sqlalchemy] Re: long transaction after database switched over

jiajunsu . zju Tue, 08 May 2018 07:28:08 -0700

It's not a problem with lock. I tried to get the _pool.mutex in 
create_connection but helpless.


Maybe we create the connection, put it into pool, but use it without get it 
out?

在 2018年5月8日星期二 UTC+8下午10:18:52，jiajun...@gmail.com写道：
>
> I added logs in sqlalchemy/pool.py
>
> QueuePool.checkout
>
> @classmethod
> def checkout(cls, pool):
>     rec = pool._do_get()
>     try:
>         dbapi_connection = rec.get_connection()
>         pool.logger.debug("jjs get conn success %s", dbapi_connection)
>     except:
>         pool.logger.debug("jjs get error in get_connection")
>         with util.safe_reraise():
>             rec.checkin()
>     echo = pool._should_log_debug()
>     fairy = _ConnectionFairy(dbapi_connection, rec, echo)
>     rec.fairy_ref = weakref.ref(
>         fairy,
>         lambda ref: _finalize_fairy and
>         _finalize_fairy(
>             dbapi_connection,
>             rec, pool, ref, echo)
>     )
>     _refs.add(rec)
>     if echo:
>         pool.logger.debug("jjs conn %s, pool status %s", dbapi_connection, 
> pool.status())
>         pool.logger.debug("Connection %r checked out from pool",
>                           dbapi_connection)
>     return fairy
>
> Then I got logs below:
>
> coroutine A:
>
> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:197123760] [log.py:109 info] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
> <connection object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success <connection 
> object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object at 
> 0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection object at 
> 0x903f8c0; closed: 0> checked out from pool 
>
> coroutine B:
>
> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:151519312] [log.py:109 info] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success <connection 
> object at 0x903f8c0;  closed: 0> 
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object at 
> 0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection object at 
> 0x903f8c0; closed: 0> checked out from pool 
>
>
> It seems that when we create_connection in QueuePool._do_get, we may not 
> get the lock _pool.mutex(~Queue.mutex).
>
>
>
> 在 2018年5月7日星期一 UTC+8下午8:41:14，Mike Bayer写道：
>>
>>
>>
>> On Mon, May 7, 2018, 7:01 AM <jiajun...@gmail.com> wrote:
>>
>>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>>> same connection before the pthread_mutex_lock released.
>>>
>>> Maybe something in the connection pool goes wrong?
>>> [pid] [coroutine_id] msg
>>>
>>
>>
>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with 
>> eventlet patching in conjunction with the MySQL drivers and nothing like 
>> that happens.   The biggest issue SQLAlchemy had with greenlets is that a 
>> timed out greenlet would throw GreenletExit and cause a dirty connection to 
>> go back into the pool, that was fixed in 1.2.  
>>
>>
>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>> cb 0x23ad320
>>>
>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>> status = 2
>>>
>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>
>>> Below is the logs grep by coroutine id:
>>>
>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>>
>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>> cb 0x23ad320
>>>
>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: poll writing
>>>
>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>>
>>> -----
>>>
>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection 
>>> at 0x8de2d30
>>>
>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>>
>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 
>>> 0x8de2d30
>>>
>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt 
>>> = 1
>>>
>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>> status = 2
>>>
>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>
>>> 在 2018年4月28日星期六 UTC+8下午4:07:34，jiajun...@gmail.com写道：
>>>>
>>>> We reproduced this problem and added logs in psycopg2, found a 
>>>> confusing thing and reported it to psycopg2.
>>>>
>>>> It seems sqlalchemy & psycopg2 stack flow is:
>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit 
>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>>> 3.get exception in sqlalchemy, and do_rollback
>>>> 4.from sqlalchemy to psycopg: pq_abort
>>>> 5.psycopg get pthread_mutex_lock deadlock
>>>>
>>>> What we have done to reproduce it is: stop master pg-server and promote 
>>>> the slave pg-server to master, with moving FIP from old master to slave. 
>>>> At 
>>>> the same time, let nova-conductor do quite a lot db query requests.
>>>>
>>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>>
>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>>>>
>>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <jiajun...@gmail.com> wrote: 
>>>>> > Sorry for reply on this topic. 
>>>>> > 
>>>>> > We recently get the same problem in our production environment. 
>>>>> > 
>>>>> > I found a patch in other lib [1], and they added conn.close() in 
>>>>> exception 
>>>>> > psycopg2.ProgrammingError. 
>>>>> > 
>>>>> > Shall we do the same in [2] ? 
>>>>>
>>>>> SQLAlchemy does things much more carefully than that, we parse the 
>>>>> error message for specific ones that correspond to "connection is no 
>>>>> longer usable", we call these "is_disconnect" but it can be any 
>>>>> invalidating condition. 
>>>>>
>>>>> You can make these yourself, and they can also be made to be part of 
>>>>> oslo.db, using the handle_error event: 
>>>>>
>>>>>
>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>>>  
>>>>>
>>>>> within oslo.db you would want to propose a change here: 
>>>>>
>>>>>
>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>>>  
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > 
>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
>>>>> > 
>>>>> > [2] 
>>>>> > 
>>>>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>>>>  
>>>>> > 
>>>>> > 
>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道： 
>>>>> >> 
>>>>> >> Hi  Michael , 
>>>>> >> 
>>>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
>>>>> module 
>>>>> >> were used. 
>>>>> >> And there are some problems after my pg database switched over. 
>>>>> >> 
>>>>> >> Here is my switch over process: 
>>>>> >> 1. nova-conductor(python application) is running with DB connection 
>>>>> >> strings point to vip , which is in primary site(A) of pg. 
>>>>> >> 2. switch VIP from primary(A) to new primary(B) 
>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
>>>>> primary. 
>>>>> >> 4. nova-conductor is running in the whole process. 
>>>>> >> 
>>>>> >> After some seconds, I found some nova-conductor processes are hang 
>>>>> with 
>>>>> >> status futex_wait_queue_me, and the status of the query in DB is 
>>>>> "idle in 
>>>>> >> transaction", the transaction was not commited or rollbacked! 
>>>>> >> I think disconnection was handled in the oslo_db, which will send a 
>>>>> >> ping(select 1) to DB. 
>>>>> >> 
>>>>> >> If DB was switchd over, the connection in the pool would be set 
>>>>> with 
>>>>> >> status invalid, and reconnect after next check out. 
>>>>> >> 
>>>>> >> ###error messages from nova-conductor 
>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>> [tid:122397712] 
>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
>>>>> wrapped from 
>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an 
>>>>> asynchronous 
>>>>> >> query is underway [SQL: 'SELECT 1'] 
>>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
>>>>> call 
>>>>> >> last): 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
>>>>> >> _execute_context 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context) 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, 
>>>>> in 
>>>>> >> do_execute 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     
>>>>> cursor.execute(statement, 
>>>>> >> parameters) 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: 
>>>>> execute 
>>>>> >> cannot be used while an asynchronous query is underway 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
>>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>> [tid:122397712] 
>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
>>>>> closing 
>>>>> >> cursor 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
>>>>> last): 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
>>>>> >> _safe_close_cursor 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close() 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close 
>>>>> cannot be 
>>>>> >> used while an asynchronous query is underway 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool 
>>>>> >> 
>>>>> >> ###ps status of nova-conductor 
>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # 
>>>>> cat 
>>>>> >> /proc/33316/stack 
>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 
>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 
>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 
>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 
>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b 
>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff 
>>>>> >> 
>>>>> >> ### stack of the nova-conductor process 
>>>>> >> POD6-Mongodb03:/tmp # pstack 33316 
>>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from 
>>>>> /lib64/libpthread.so.0 
>>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from 
>>>>> /lib64/libpthread.so.0 
>>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from 
>>>>> >> /lib64/libpthread.so.0 
>>>>> >> #3  0x00002b84554c44ab in pq_abort () from 
>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from 
>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from 
>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from 
>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>> >> 
>>>>> >> The psycopg2 was trying to close the cursor, and try to get the 
>>>>> mutex lock 
>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by 
>>>>> other 
>>>>> >> session. 
>>>>> >> 
>>>>> >> 
>>>>> >> Questions: 
>>>>> >> 
>>>>> >> 1. What the error "ProgrammingError: close cannot be used while an 
>>>>> >> asynchronous query is underway" mean? 
>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous query 
>>>>> was 
>>>>> >> executed in one connection. 
>>>>> >> But the I think the sqlalchemy was thread safe since it was patched 
>>>>> by 
>>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py 
>>>>> >> And we can see different green thread number in the log, as: 
>>>>> >> [pid:36365] [MainThread] [tid:122397712] 
>>>>> >> [pid:36365] [MainThread] [tid:122397815] 
>>>>> >> So, I guess the connection pool in one process is safe. 
>>>>> >> 
>>>>> >> 2. The nova-conductor was a multi-thread python client, which 
>>>>> forked 
>>>>> >> several child process. 
>>>>> >> ps -elf|grep -i nova-conductor 
>>>>> >> 30878  1 pool_s /usr/bin/nova-conductor 
>>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36365  1 futex_ /usr/bin/nova-conductor 
>>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 
>>>>> >> If the nova-conductor was started with only one child, the problem 
>>>>> was not 
>>>>> >> happen. 
>>>>> >> Does this mean the connection/engine CAN NOT shared in these child 
>>>>> >> processes? 
>>>>> >> 
>>>>> >> Thanks. 
>>>>> > 
>>>>> > -- 
>>>>> > SQLAlchemy - 
>>>>> > The Python SQL Toolkit and Object Relational Mapper 
>>>>> > 
>>>>> > http://www.sqlalchemy.org/ 
>>>>> > 
>>>>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a 
>>>>> full 
>>>>> > description. 
>>>>> > --- 
>>>>> > You received this message because you are subscribed to the Google 
>>>>> Groups 
>>>>> > "sqlalchemy" group. 
>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>> send an 
>>>>> > email to sqlalchemy+...@googlegroups.com. 
>>>>> > To post to this group, send email to sqlal...@googlegroups.com. 
>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>>>>> > For more options, visit https://groups.google.com/d/optout. 
>>>>>
>>>> -- 
>>> SQLAlchemy - 
>>> The Python SQL Toolkit and Object Relational Mapper
>>>  
>>> http://www.sqlalchemy.org/
>>>  
>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>> description.
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "sqlalchemy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to sqlalchemy+...@googlegroups.com.
>>> To post to this group, send email to sqlal...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to