Hi all,

Our production system is experiencing a very strange behavior. At peak 
load, we see some of our python processes were stuck due to: * <built-in 
method __enter__ of _thread.lock object at remote 0x7f8e021156c0>  in* *with 
self._exec_once_mutex in sqlalchemy/event/attr.py*, (here is the full stack 
trace <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31>that 
I got via gdb with python extension).  

Here are some high level information about production setup.

Python:                           3.7.9
SQLAlchemy                 1.3.6
PyMySQL                       1.0.2
Connection str: *mysql+pymysql*
://HOST:PORT/DB?connect_timeout=30&charset=utf8
Docker container where the python process are running: Ubuntu 18.04.6
VM host version where docker runs: amzn2

When creating the engine, we use NullPool as the pool_class:
    engine_args['poolclass'] = NullPool
    engine = create_engine(SQL_ALCHEMY_CONN, **engine_args)
    setup_event_handlers 
<https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136>
(engine)
    Session = scoped_session(
        sessionmaker(autocommit=False,
                     autoflush=False,
                     bind=engine,
                     expire_on_commit=False))

# here is the setup_event_handlers code 
<https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136>
.

There are 5 processes running running in the container. please see this link 
<https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451>
 for 
the python process model running in the docker container. The stuck python 
process is created by its parent process (which is also python) via: 
`subprocess.Popen`, right after `subprocess.Popen`, there is a thread 
created to read the log of the subprocess(please see here for more info 
<https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451>).
 
Reading the source code 
<https://github.com/sqlalchemy/sqlalchemy/blob/rel_1_3_6/lib/sqlalchemy/event/attr.py#L277>
 of 
that stack trace, it looks like it tries to grab a lock, but i am not sure 
why it would fail to grab the lock given that we only have one process uses 
the session.

At the peak load, there are around ~200 docker containers running on a 
single amzn2 host. From the metric from database side, we do see some 
connections were dropped at the peak. I am not sure whether this would 
cause that threading.lock issue but they correlate. *Or is it possible due 
to the  event_handlers  
<https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136>of
 
the engine and the session query in the stack trace try to grab this lock 
at the same time*?

Potential related issues that I have checked: 

   - Race condition in dialect initialization 
   <https://github.com/sqlalchemy/sqlalchemy/issues/6337> 
   - A possible race condition between dialect initialization and query 
   execution <https://github.com/sqlalchemy/sqlalchemy/issues/6337>

but i could not figure out why it happens in our system. We also tried to 
reproduce in our stress test env, however, we could not reproduce it.

Could you please share some insights? Any suggestions would be really 
appreciated.

Thanks,

Ping


-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/2979ec01-e3be-4c5d-b4aa-50a6f4fceee4n%40googlegroups.com.

Reply via email to