gr8web commented on issue #36920:
URL: https://github.com/apache/airflow/issues/36920#issuecomment-1981817654
Hello people.
Sorry, it looks like I was wrong. I just saw it again, connections getting
stuck in `idle in transaction` state in the database
and jobs not progressing.
This is the query I used to lookup the connections in the db:
```
select pid, state, usename, query, query_start, client_addr, client_port,
wait_event, wait_event_type
from pg_stat_activity
where pid in (
select pid from pg_locks l
join pg_class t on l.relation = t.oid
and t.relkind = 'r'
where t.relname = 'task_instance'
--and state = 'idle in transaction'
);
```

We did not see the issue for some time but now its just back, so my
assumption before that was that we had a misconfigured pgbouncer.
Since its back now I dont have an idea what could cause the problem, it
doesnt seem to be related to the db/pgbouncer settings at all.
we currently have
3 scheduler instances
32 workers
1 webserver
```
worker_concurrency = 16
parallelism = 64
max_active_runs_per_dag = 16
pgbouncer:
[databases]
* = host=dbhost port=5432 auth_user=user
[pgbouncer]
pool_mode = transaction
listen_port = 5432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/auth_file.txt
admin_users = airflow
stats_users_prefix = robot_
auth_query = SELECT * FROM pooler.user_lookup($1)
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
server_tls_sslmode = require
server_tls_ca_file = /etc/ssl/certs/pgbouncer.crt
server_tls_protocols = secure
client_tls_sslmode = require
client_tls_key_file = /etc/ssl/certs/pgbouncer.key
client_tls_cert_file = /etc/ssl/certs/pgbouncer.crt
log_connections = 0
log_disconnections = 0
# How many server connections to allow per user/database pair.
default_pool_size = 15
# How many additional connections to allow to a pool
reserve_pool_size = 7
# Maximum number of client connections allowed.
max_client_conn = 10000
# Do not allow more than this many connections per database (regardless of
# pool, i.e. user)
max_db_connections = 30
# If a client has been in "idle in transaction" state longer, it will be
# disconnected. [seconds]
idle_transaction_timeout = 600
# If login failed, because of failure from connect() or authentication that
# pooler waits this much before retrying to connect. Default is 15. [seconds]
server_login_retry = 5
# To ignore extra parameter in startup packet. By default only 'database' and
# 'user' are allowed, all others raise error. This is needed to tolerate
# overenthusiastic JDBC wanting to unconditionally set 'extra_float_digits=2'
# in startup packet.
ignore_startup_parameters = extra_float_digits,options
```
Almost everything are the defaults from the helm chart.
Worker and scheduler connect to the pgbouncer.
The only thing what happened to the jobs is that the tasks were cleared for
few days in the past.
I dont really have experience to debug things like that but maybe I can
still try to help somehow.
Let me know if I maybe missed something or you need to know something else.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]