mwisnicki opened a new issue, #68699:
URL: https://github.com/apache/airflow/issues/68699
### Under which category would you file this issue?
Airflow Core
### Apache Airflow version
3.2.2
### What happened and how to reproduce it?
Apologies for AI slop but the issue is clearly reproducible.
`<🤖> `
---
Sending multiple concurrent `POST /api/v2/backfills` requests for the same
DAG results in HTTP 500 with empty response body, but those failed requests
partially commit `backfill_dag_run` rows to SQLite. The partially-created
backfill then blocks future creation with "already running backfill".
Root cause visible in API server logs:
sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is
locked
`_create_runs_non_partitioned` inserts rows in a loop using `begin_nested()`
savepoints. When SQLite's busy timeout fires mid-loop the exception propagates
as HTTP 500, but rows flushed by earlier savepoints survive — leaving a
partially-initialised backfill.
My mise.toml
```tomls
[tools]
python = "3.12"
[env]
_.path = [".venv/bin"]
AIRFLOW_HOME = "{{config_root}}"
AIRFLOW__API__PORT = "48001"
AIRFLOW__CORE__EXECUTION_API_SERVER_URL = "http://localhost:48001/execution/"
AIRFLOW__CORE__LOAD_EXAMPLES = "false"
AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL = "1"
AIRFLOW__LOGGING__WORKER_LOG_SERVER_HOST = "127.0.0.1"
AIRFLOW__SCHEDULER__MAX_DAGRUNS_TO_CREATE_PER_LOOP = "15"
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG = "15"
AIRFLOW__CORE__PARALLELISM = "50"
[tasks.setup]
description = "Install Airflow into venv"
run = """
pip install "apache-airflow==3.2.2" \
--constraint
"https://raw.githubusercontent.com/apache/airflow/constraints-3.2.2/constraints-no-providers-3.12.txt"
"""
```
**To reproduce:**
1. Create a minimal DAG (`test_backfill_bug.py`):
from airflow.sdk import dag, task
from datetime import datetime
@dag(
dag_id="test_backfill_bug",
schedule="@daily",
start_date=datetime(2020, 1, 1),
end_date=datetime(2022, 12, 31),
catchup=False,
)
def test_backfill_bug():
@task
def noop():
pass
noop()
test_backfill_bug()
2. Fire 5 concurrent backfill creation requests:
TOKEN=$(curl -s -X POST "http://localhost:8080/auth/token" \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "admin"}' \
| python3 -c "import sys,json;
print(json.load(sys.stdin)['access_token'])")
for i in 1 2 3 4 5; do
curl -s -o /dev/null -w "[%{http_code}] " \
-X POST "http://localhost:8080/api/v2/backfills" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d
'{"dag_id":"test_backfill_bug","from_date":"2020-01-01T00:00:00Z","to_date":"2022-12-31T00:00:00Z","max_active_runs":1,"reprocess_behavior":"none"}'
&
done
wait
3. Check SQLite for partial data:
python3 -c "
import sqlite3, os
conn = sqlite3.connect(os.path.expanduser('~/airflow/airflow.db'))
cur = conn.cursor()
cur.execute('SELECT id, created_at FROM backfill WHERE
dag_id=\"test_backfill_bug\" ORDER BY id')
for b_id, created_at in cur.fetchall():
cur.execute('SELECT COUNT(*) FROM backfill_dag_run WHERE
backfill_id=?', (b_id,))
print(f'backfill {b_id}: {cur.fetchone()[0]}/1096 runs
created_at={created_at}')
conn.close()
"
Observed:
[201] [500] [500] [409] [409]
backfill 1: 1096/1096 runs created_at=...
backfill 2: 2/1096 runs created_at=... <- HTTP 500 but partial
data committed
### What you think should happen instead?
Either full success (201) or a clean failure with zero rows written. A
request that returns HTTP 500 should never leave partial data in the database.
Currently the partially-created backfill cannot be retried (409 "already
running") and must be manually cleaned up from SQLite.
### Operating System
macOS
### Deployment
Virtualenv installation
### Apache Airflow Provider(s)
_No response_
### Versions of Apache Airflow Providers
_No response_
### Official Helm Chart version
Not Applicable
### Kubernetes Version
_No response_
### Helm Chart configuration
_No response_
### Docker Image customizations
_No response_
### Anything else?
Distinct from #61375 (premature completion, fixed in 3.2.2). This is a data
integrity issue in the creation path — `_create_runs_non_partitioned` in
`airflow/models/backfill.py` has no protection against partial commit on SQLite
lock timeout.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]