anandsummer28 commented on issue #57470:
URL: https://github.com/apache/airflow/issues/57470#issuecomment-3909028849
> Is there a way to reproduce the error instead of waiting to reappear after
few days ?
Posting this in case it helps others hit by the same error. We saw
`PendingRollbackError` on login/UI refresh/logout with **Airflow 3.1.1** (API
server). Below: how we reproduced it and what fixed it.
**Environment:** Airflow 3.1.1, API server on Kubernetes (same behavior
should apply to non-K8s if you can set pool recycle and terminate DB
connections).
---
## How to reproduce
1. **Shorten the DB pool recycle** so connections age quickly (replace
`default` with your namespace if needed):
```bash
kubectl set env deployment/airflow-api-server \
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE=30 \
-n default
```
Wait for the rollout to finish.
2. **Generate load and "idle in transaction" connections** by calling the
auth token API in a loop (replace `YOUR_AIRFLOW_URL` with your API base URL):
```bash
AIRFLOW_URL="YOUR_AIRFLOW_URL" # e.g. https://your-elb.amazonaws.com
for i in $(seq 1 80); do
CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST
"$AIRFLOW_URL/auth/token" \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "YOURPASSWORD"}')
echo "$i $CODE"
sleep 0.5
done
```
You’ll typically see `201` for most requests.
3. **Connect to the Airflow metadata database** (Postgres) and list
connections:
```sql
SELECT pid, usename, application_name, state, state_change
FROM pg_stat_activity
WHERE datname = 'airflow';
```
You should see some `idle` and/or `idle in transaction` rows from the app.
4. **Terminate both idle and idle-in-transaction backends** (this leaves
your current psql session alone):
```sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'airflow'
AND state IN ('idle', 'idle in transaction')
AND pid <> pg_backend_pid();
```
5. **Trigger the error:**
- Run the token loop again and/or
- In the Airflow UI (while already logged in), **refresh the page** or
click **Logout**.
You should see **500 Internal Server Error** and in the API server logs:
`PendingRollbackError` and stack traces in `airflow.providers.fab.www.app` /
`security_manager` / `get_user_by_id`. The UI can stay broken until the pod is
restarted.
**Note:** Terminating only `idle in transaction` was not enough for us;
terminating **both** `idle` and `idle in transaction` made the failure
reproducible.
---
## Fix that worked: FAB 3.0.2 + pool settings
1. **Pin the FAB provider** to `apache-airflow-providers-fab==3.0.2` and
ensure it is used at API server (and webserver, if used) startup.
2. **Set DB pool options:**
`AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING=True` and
`AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE=1800`
After applying this, the same reproduction steps either no longer hit the
error or, when a 500 occurred, the UI recovered on the next request instead of
staying stuck.
**Example for the official Airflow Helm chart** (override `apiServer.args` /
`apiServer.env` and same for `webserver` if you use it):
```yaml
airflow:
config:
fab:
enable_proxy_fix: "True"
session_lifetime_minutes: "1440"
apiServer:
args:
- bash
- "-c"
- |
pip install --no-cache-dir "apache-airflow-providers-fab==3.0.2" \
&& exec airflow api-server
env:
- name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING
value: "True"
- name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE
value: "1800"
webserver: # if you use the webserver
args:
- bash
- "-c"
- |
pip install --no-cache-dir "apache-airflow-providers-fab==3.0.2" \
&& exec airflow webserver
env:
- name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING
value: "True"
- name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE
value: "1800"
```
The session-cleanup fixes for this issue are in
**apache-airflow-providers-fab 3.1.2+**. If problems persist on 3.0.2, try
`>=3.1.2` (e.g. `3.1.2` or `3.2.0`) with the same pool settings.
---
## Summary
| Step | Action |
|------|--------|
| **Reproduce** | Set `POOL_RECYCLE=30`, hammer `/auth/token`, then in
Postgres terminate **idle** and **idle in transaction** backends, then refresh
UI or logout. |
| **Fix** | Use `apache-airflow-providers-fab==3.0.2` (or `>=3.1.2`) at
startup + `PRE_PING=True` and `POOL_RECYCLE=1800`. |
Hope this helps anyone else hitting the same error.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]