Hi there,

Issue:
Would love to get pointers on an issue we have been seeing after we upgraded 
our airflow installation from 1.8.0 to 1.10.1. The configuration we use is the 
same across these versions but we see task failures due to number of DB 
connections being used up. The failures are mainly when the scheduler tries to 
build a new DAG. The exceptions that we see are (attached sample stack trace):

- psycopg2.OperationalError: FATAL: too many connections for role xxx
- sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining 
connection slots are reserved for non-replication superuser connections

Info:
Below are the settings that seem relevant to this behavior (also attaching our 
config file):
--------------------
sql_alchemy_pool_size = 5
sql_alchemy_pool_recycle = 3600
sql_alchemy_reconnect_timeout = 300
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = True
non_pooled_task_slot_count = 128
max_active_runs_per_dag = 16
workers = 4
scheduler_zombie_task_threshold = 300
-----------

Setup:
We use postgres as the DB backend and connection limit for Airflow user has 
been set to 100. Below is how airflow components are setup:

Node 1: Worker(8), webserver, scheduler 
Node 2: Worker(8), webserver
Node 3: Worker(8)
Node 4: Worker(8)

We could not find anything in commits, JIRA and dev mailing list which could 
point to why Airflow 1.10.1 would start using more connections vs Airflow 
1.8.0. The only commit that seemed related in 1.10.2 is 
https://github.com/apache/airflow/commit/959dd619d19223db3709fa4abcf52e8ee98bc079.
 Since, we don't know the root cause of this behavior, not sure if upgrading to 
1.10.2 is going to help. Is there a way to estimate the number of connections 
that can be used based on the configuration and setup? Or perhaps identifying 
the settings that can significantly affect it. Any help is greatly appreciated.
Regards,
Kiran

Reply via email to