mwisnicki opened a new issue, #68699:
URL: https://github.com/apache/airflow/issues/68699

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.2
   
   ### What happened and how to reproduce it?
   
   Apologies for AI slop but the issue is clearly reproducible.
   
   `<🤖> `
   ---
   
   
   Sending multiple concurrent `POST /api/v2/backfills` requests for the same 
DAG results in HTTP 500 with empty response body, but those failed requests 
partially commit `backfill_dag_run` rows to SQLite. The partially-created 
backfill then blocks future creation with "already running backfill".
   
   Root cause visible in API server logs:
   
       sqlite3.OperationalError: database is locked
       sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is 
locked
   
   `_create_runs_non_partitioned` inserts rows in a loop using `begin_nested()` 
savepoints. When SQLite's busy timeout fires mid-loop the exception propagates 
as HTTP 500, but rows flushed by earlier savepoints survive — leaving a 
partially-initialised backfill.
   
   My mise.toml
   
   ```tomls
   [tools]
   python = "3.12"
   
   [env]
   _.path = [".venv/bin"]
   AIRFLOW_HOME = "{{config_root}}"
   AIRFLOW__API__PORT = "48001"
   AIRFLOW__CORE__EXECUTION_API_SERVER_URL = "http://localhost:48001/execution/";
   AIRFLOW__CORE__LOAD_EXAMPLES = "false"
   AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL = "1"
   AIRFLOW__LOGGING__WORKER_LOG_SERVER_HOST = "127.0.0.1"
   AIRFLOW__SCHEDULER__MAX_DAGRUNS_TO_CREATE_PER_LOOP = "15"
   AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG = "15"
   AIRFLOW__CORE__PARALLELISM = "50"
   
   [tasks.setup]
   description = "Install Airflow into venv"
   run = """
     pip install "apache-airflow==3.2.2" \
       --constraint 
"https://raw.githubusercontent.com/apache/airflow/constraints-3.2.2/constraints-no-providers-3.12.txt";
   """
   ```
   
   
   **To reproduce:**
   
   1. Create a minimal DAG (`test_backfill_bug.py`):
   
           from airflow.sdk import dag, task
           from datetime import datetime
   
           @dag(
               dag_id="test_backfill_bug",
               schedule="@daily",
               start_date=datetime(2020, 1, 1),
               end_date=datetime(2022, 12, 31),
               catchup=False,
           )
           def test_backfill_bug():
               @task
               def noop():
                   pass
               noop()
   
           test_backfill_bug()
   
   2. Fire 5 concurrent backfill creation requests:
   
           TOKEN=$(curl -s -X POST "http://localhost:8080/auth/token"; \
             -H "Content-Type: application/json" \
             -d '{"username": "admin", "password": "admin"}' \
             | python3 -c "import sys,json; 
print(json.load(sys.stdin)['access_token'])")
   
           for i in 1 2 3 4 5; do
             curl -s -o /dev/null -w "[%{http_code}] " \
               -X POST "http://localhost:8080/api/v2/backfills"; \
               -H "Content-Type: application/json" \
               -H "Authorization: Bearer $TOKEN" \
               -d 
'{"dag_id":"test_backfill_bug","from_date":"2020-01-01T00:00:00Z","to_date":"2022-12-31T00:00:00Z","max_active_runs":1,"reprocess_behavior":"none"}'
 &
           done
           wait
   
   3. Check SQLite for partial data:
   
           python3 -c "
           import sqlite3, os
           conn = sqlite3.connect(os.path.expanduser('~/airflow/airflow.db'))
           cur = conn.cursor()
           cur.execute('SELECT id, created_at FROM backfill WHERE 
dag_id=\"test_backfill_bug\" ORDER BY id')
           for b_id, created_at in cur.fetchall():
               cur.execute('SELECT COUNT(*) FROM backfill_dag_run WHERE 
backfill_id=?', (b_id,))
               print(f'backfill {b_id}: {cur.fetchone()[0]}/1096 runs  
created_at={created_at}')
           conn.close()
           "
   
   Observed:
   
       [201] [500] [500] [409] [409]
       backfill 1: 1096/1096 runs  created_at=...
       backfill 2:    2/1096 runs  created_at=...   <- HTTP 500 but partial 
data committed
   
   
   ### What you think should happen instead?
   
   Either full success (201) or a clean failure with zero rows written. A 
request that returns HTTP 500 should never leave partial data in the database. 
Currently the partially-created backfill cannot be retried (409 "already 
running") and must be manually cleaned up from SQLite.
   
   ### Operating System
   
   macOS
   
   ### Deployment
   
   Virtualenv installation
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   Distinct from #61375 (premature completion, fixed in 3.2.2). This is a data 
integrity issue in the creation path — `_create_runs_non_partitioned` in 
`airflow/models/backfill.py` has no protection against partial commit on SQLite 
lock timeout.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to