gr8web commented on issue #36920: URL: https://github.com/apache/airflow/issues/36920#issuecomment-1981817654
Hello people. Sorry, it looks like I was wrong. I just saw it again, connections getting stuck in `idle in transaction` state in the database and jobs not progressing. This is the query I used to lookup the connections in the db: ``` select pid, state, usename, query, query_start, client_addr, client_port, wait_event, wait_event_type from pg_stat_activity where pid in ( select pid from pg_locks l join pg_class t on l.relation = t.oid and t.relkind = 'r' where t.relname = 'task_instance' --and state = 'idle in transaction' ); ``` ![idle_conn](https://github.com/apache/airflow/assets/1122741/ddd2b9cc-1d07-44dd-af9d-c29bd0aea300) We did not see the issue for some time but now its just back, so my assumption before that was that we had a misconfigured pgbouncer. Since its back now I dont have an idea what could cause the problem, it doesnt seem to be related to the db/pgbouncer settings at all. we currently have 3 scheduler instances 32 workers 1 webserver ``` worker_concurrency = 16 parallelism = 64 max_active_runs_per_dag = 16 pgbouncer: [databases] * = host=dbhost port=5432 auth_user=user [pgbouncer] pool_mode = transaction listen_port = 5432 listen_addr = * auth_type = md5 auth_file = /etc/pgbouncer/auth_file.txt admin_users = airflow stats_users_prefix = robot_ auth_query = SELECT * FROM pooler.user_lookup($1) logfile = /var/log/pgbouncer/pgbouncer.log pidfile = /var/run/pgbouncer/pgbouncer.pid server_tls_sslmode = require server_tls_ca_file = /etc/ssl/certs/pgbouncer.crt server_tls_protocols = secure client_tls_sslmode = require client_tls_key_file = /etc/ssl/certs/pgbouncer.key client_tls_cert_file = /etc/ssl/certs/pgbouncer.crt log_connections = 0 log_disconnections = 0 # How many server connections to allow per user/database pair. default_pool_size = 15 # How many additional connections to allow to a pool reserve_pool_size = 7 # Maximum number of client connections allowed. max_client_conn = 10000 # Do not allow more than this many connections per database (regardless of # pool, i.e. user) max_db_connections = 30 # If a client has been in "idle in transaction" state longer, it will be # disconnected. [seconds] idle_transaction_timeout = 600 # If login failed, because of failure from connect() or authentication that # pooler waits this much before retrying to connect. Default is 15. [seconds] server_login_retry = 5 # To ignore extra parameter in startup packet. By default only 'database' and # 'user' are allowed, all others raise error. This is needed to tolerate # overenthusiastic JDBC wanting to unconditionally set 'extra_float_digits=2' # in startup packet. ignore_startup_parameters = extra_float_digits,options ``` Almost everything are the defaults from the helm chart. Worker and scheduler connect to the pgbouncer. The only thing what happened to the jobs is that the tasks were cleared for few days in the past. I dont really have experience to debug things like that but maybe I can still try to help somehow. Let me know if I maybe missed something or you need to know something else. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org