Thanks Mike for your kind response. so to look for deadlocks we would need to see tracebacks for all threads.
Please see this link <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4453633#gistcomment-4453633>, there are 48 threads, lots of them are `blas_thread_server`, which I am not sure about where it comes from. For the engine management, it looks like if we are using the NullPool, then it should be good <https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork>. I will double check that as well. Regarding the airflow, it starts to support SQLAlchemy 1.4 and above <https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork> . Thanks, Ping On Sat, Jan 28, 2023 at 8:23 AM Mike Bayer < mike_not_on_goo...@zzzcomputing.com> wrote: > that particular routine runs exactly once when the engine is used the very > first time. I only see one traceback here that does not seem to indicate > the lock is being called in a re-entrant fashion, so to look for deadlocks > we would need to see tracebacks for all threads. > > if there is some unusual effect of creating processes here, ensure this > routine runs in the parent process exactly once. if you have a > create_engine() that is then being spun out into separate processes, do > this init *before* you make any of those processes: > > engine = create_engine(...) > engine.connect().close() # <--- engine is initialized > > span_new_processes() # now span sub processes > > next, within each child process, set up the engine for all new connections > to start up before the child process does its work by calling > .dispose(close=False), so that within the child process, any connections > inherited from the parent are forgotten: > > def child_process(): > engine.dispose(close=False) > > > > https://docs.sqlalchemy.org/en/20/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork > > next w/ airflow, I have not used airflow but the code you point to there > looks like the old "pre-ping" recipe which is no longer needed, as > SQLAlchemy now includes the pre-ping option. Report an issue to apache > airflow that they should support SQLAlchemy 1.4 and above and remove this > old event handler. > > Finally, when the lock does actually occur youd also want to look at what > the actual database and/or proxy servers are up to, if these servers are > blocking new connections from proceeding, that would also be a potential > cause of what you are seeing. > > On Sat, Jan 28, 2023, at 1:02 AM, Ping Zhang wrote: > > Hi all, > > Our production system is experiencing a very strange behavior. At peak > load, we see some of our python processes were stuck due to: * <built-in > method __enter__ of _thread.lock object at remote 0x7f8e021156c0> in* *with > self._exec_once_mutex in sqlalchemy/event/attr.py*, (here is the full > stack trace > <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31>that I > got via gdb with python extension). > > Here are some high level information about production setup. > > Python: 3.7.9 > SQLAlchemy 1.3.6 > PyMySQL 1.0.2 > Connection str: *mysql+pymysql* > ://HOST:PORT/DB?connect_timeout=30&charset=utf8 > Docker container where the python process are running: Ubuntu 18.04.6 > VM host version where docker runs: amzn2 > > When creating the engine, we use NullPool as the pool_class: > engine_args['poolclass'] = NullPool > engine = create_engine(SQL_ALCHEMY_CONN, **engine_args) > setup_event_handlers > <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136> > (engine) > Session = scoped_session( > sessionmaker(autocommit=False, > autoflush=False, > bind=engine, > expire_on_commit=False)) > > # here is the setup_event_handlers code > <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136> > . > > There are 5 processes running running in the container. please see this > link > <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451> > for > the python process model running in the docker container. The stuck python > process is created by its parent process (which is also python) via: > `subprocess.Popen`, right after `subprocess.Popen`, there is a thread > created to read the log of the subprocess(please see here for more info > <https://gist.github.com/pingzh/8c98afa5421f03dc82af7de8910d4a31?permalink_comment_id=4451451#gistcomment-4451451>). > Reading the source code > <https://github.com/sqlalchemy/sqlalchemy/blob/rel_1_3_6/lib/sqlalchemy/event/attr.py#L277> > of > that stack trace, it looks like it tries to grab a lock, but i am not sure > why it would fail to grab the lock given that we only have one process uses > the session. > > At the peak load, there are around ~200 docker containers running on a > single amzn2 host. From the metric from database side, we do see some > connections were dropped at the peak. I am not sure whether this would > cause that threading.lock issue but they correlate. *Or is it possible > due to the event_handlers > <https://github.com/apache/airflow/blob/1.10.4/airflow/utils/sqlalchemy.py#L42-L136>of > the engine and the session query in the stack trace try to grab this lock > at the same time*? > > Potential related issues that I have checked: > > - Race condition in dialect initialization > <https://github.com/sqlalchemy/sqlalchemy/issues/6337> > - A possible race condition between dialect initialization and query > execution <https://github.com/sqlalchemy/sqlalchemy/issues/6337> > > but i could not figure out why it happens in our system. We also tried to > reproduce in our stress test env, however, we could not reproduce it. > > Could you please share some insights? Any suggestions would be really > appreciated. > > Thanks, > > Ping > > > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/sqlalchemy/2979ec01-e3be-4c5d-b4aa-50a6f4fceee4n%40googlegroups.com > <https://groups.google.com/d/msgid/sqlalchemy/2979ec01-e3be-4c5d-b4aa-50a6f4fceee4n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/sqlalchemy/14fccc71-d2c0-49cc-8e99-a22e0ce54d03%40app.fastmail.com > <https://groups.google.com/d/msgid/sqlalchemy/14fccc71-d2c0-49cc-8e99-a22e0ce54d03%40app.fastmail.com?utm_medium=email&utm_source=footer> > . > -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/CAO7yb8nvrFajA3JdXkqiW%2Bqfj%2BQbbxfUqeRUNB6EeQ%2BmNFsW6Q%40mail.gmail.com.