gr8web commented on issue #36920:
URL: https://github.com/apache/airflow/issues/36920#issuecomment-1981817654

   Hello people.
   
   Sorry, it looks like I was wrong. I just saw it again, connections getting 
stuck in `idle in transaction` state in the database 
   and jobs not progressing.
   
   This is the query I used to lookup the connections in the db:
   
   ```
   
   select pid, state, usename, query, query_start, client_addr, client_port, 
wait_event, wait_event_type
   from pg_stat_activity
   where pid in (
     select pid from pg_locks l
     join pg_class t on l.relation = t.oid
     and t.relkind = 'r'
     where t.relname = 'task_instance'
     --and state = 'idle in transaction'
   );
   ```
   
![idle_conn](https://github.com/apache/airflow/assets/1122741/ddd2b9cc-1d07-44dd-af9d-c29bd0aea300)
   
   
   We did not see the issue for some time but now its just back, so my 
assumption before that was that we had a misconfigured pgbouncer. 
   Since its back now I dont have an idea what could cause the problem, it 
doesnt seem to be related to the db/pgbouncer settings at all.
   
   we currently have 
   3 scheduler instances
   32 workers
   1 webserver
   ```
   worker_concurrency = 16
   parallelism = 64
   max_active_runs_per_dag = 16
   
   pgbouncer:
   
   [databases]
   * = host=dbhost port=5432 auth_user=user
   
   [pgbouncer]
   pool_mode = transaction
   listen_port = 5432
   listen_addr = *
   auth_type = md5
   auth_file = /etc/pgbouncer/auth_file.txt
   admin_users = airflow
   stats_users_prefix = robot_
   auth_query = SELECT * FROM pooler.user_lookup($1)
   logfile = /var/log/pgbouncer/pgbouncer.log
   pidfile = /var/run/pgbouncer/pgbouncer.pid
   
   server_tls_sslmode = require
   server_tls_ca_file = /etc/ssl/certs/pgbouncer.crt
   server_tls_protocols = secure
   client_tls_sslmode = require
   client_tls_key_file = /etc/ssl/certs/pgbouncer.key
   client_tls_cert_file = /etc/ssl/certs/pgbouncer.crt
   
   log_connections = 0
   log_disconnections = 0
   
   # How many server connections to allow per user/database pair.
   default_pool_size = 15
   
   # How many additional connections to allow to a pool
   reserve_pool_size = 7
   
   # Maximum number of client connections allowed.
   max_client_conn = 10000
   
   # Do not allow more than this many connections per database (regardless of
   # pool, i.e. user)
   max_db_connections = 30
   
   # If a client has been in "idle in transaction" state longer, it will be
   # disconnected. [seconds]
   idle_transaction_timeout = 600
   
   # If login failed, because of failure from connect() or authentication that
   # pooler waits this much before retrying to connect. Default is 15. [seconds]
   server_login_retry = 5
   
   # To ignore extra parameter in startup packet. By default only 'database' and
   # 'user' are allowed, all others raise error.  This is needed to tolerate
   # overenthusiastic JDBC wanting to unconditionally set 'extra_float_digits=2'
   # in startup packet.
   ignore_startup_parameters = extra_float_digits,options
   
   ```
   Almost everything are the defaults from the helm chart.
   Worker and scheduler connect to the pgbouncer.
   
   The only thing what happened to the jobs is that the tasks were cleared for 
few days in the past.
   
   I dont really have experience to debug things like that but maybe I can 
still try to help somehow.
   Let me know if I maybe missed something or you need to know something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to