Hi all, Our production system is experiencing a very strange behavior. At peak load, we see some of our python processes were stuck due to: * <built-in method __enter__ of _thread.lock object at remote 0x7f8e021156c0> in* *with self._exec_once_mutex in sqlalchemy/event/attr.py*, (here is the full stack trace <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31>that I got via gdb with python extension).
Here are some high level information about production setup. Python: 3.7.9 SQLAlchemy 1.3.6 PyMySQL 1.0.2 Connection str: *mysql+pymysql* ://HOST:PORT/DB?connect_timeout=30&charset=utf8 Docker container where the python process are running: Ubuntu 18.04.6 VM host version where docker runs: amzn2 When creating the engine, we use NullPool as the pool_class: engine_args['poolclass'] = NullPool engine = create_engine(SQL_ALCHEMY_CONN, **engine_args) setup_event_handlers <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136> (engine) Session = scoped_session( sessionmaker(autocommit=False, autoflush=False, bind=engine, expire_on_commit=False)) # here is the setup_event_handlers code <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136> . There are 5 processes running running in the container. please see this link <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451> for the python process model running in the docker container. The stuck python process is created by its parent process (which is also python) via: `subprocess.Popen`, right after `subprocess.Popen`, there is a thread created to read the log of the subprocess(please see here for more info <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451>). Reading the source code <https://github.com/sqlalchemy/sqlalchemy/blob/rel_1_3_6/lib/sqlalchemy/event/attr.py#L277> of that stack trace, it looks like it tries to grab a lock, but i am not sure why it would fail to grab the lock given that we only have one process uses the session. At the peak load, there are around ~200 docker containers running on a single amzn2 host. From the metric from database side, we do see some connections were dropped at the peak. I am not sure whether this would cause that threading.lock issue but they correlate. *Or is it possible due to the event_handlers <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136>of the engine and the session query in the stack trace try to grab this lock at the same time*? Potential related issues that I have checked: - Race condition in dialect initialization <https://github.com/sqlalchemy/sqlalchemy/issues/6337> - A possible race condition between dialect initialization and query execution <https://github.com/sqlalchemy/sqlalchemy/issues/6337> but i could not figure out why it happens in our system. We also tried to reproduce in our stress test env, however, we could not reproduce it. Could you please share some insights? Any suggestions would be really appreciated. Thanks, Ping -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/2979ec01-e3be-4c5d-b4aa-50a6f4fceee4n%40googlegroups.com.