SavingFrame opened a new issue, #57344:
URL: https://github.com/apache/airflow/issues/57344

   ### Apache Airflow version
   
   3.1.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   The DagRuns page takes 20-30 seconds to load when there are approximately 7 
million dag runs in the database. The performance bottleneck is caused by 
ordering results by start_date, which lacks a database index.
   
   ### What you think should happen instead?
   
   The DagRuns page should load in a reasonable time (under a few seconds) even 
with millions of records.
   
   ### How to reproduce
   
   1. Have a database with ~7 million dag runs
   2. Navigate to the DagRuns page in the Airflow UI
   3. Also you can see performance problem with other pages in the Airflow UI
   
   ### Operating System
   
   Debian
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   This is an example of a query that we use to get all DagRuns (I removed 
filtering by dag_id to make the SQL shorter; it doesn't affect performance):
   ```sql
   SELECT dag_run.state,
          dag_run.id,
          dag_run.dag_id,
          dag_run.queued_at,
          dag_run.logical_date,
          dag_run.start_date,
          dag_run.end_date,
          dag_run.run_id,
          dag_run.creating_job_id,
          dag_run.run_type,
          dag_run.triggered_by,
          dag_run.triggering_user_name,
          dag_run.conf,
          dag_run.data_interval_start,
          dag_run.data_interval_end,
          dag_run.run_after,
          dag_run.last_scheduling_decision,
          dag_run.log_template_id,
          dag_run.updated_at,
          dag_run.clear_number,
          dag_run.backfill_id,
          dag_run.bundle_version,
          dag_run.scheduled_by_job_id,
          dag_run.context_carrier,
          dag_run.span_status,
          dag_run.created_dag_version_id
   FROM dag_run
   ORDER BY CASE WHEN (dag_run.start_date IS NOT NULL) THEN 0 ELSE 1 END, 
dag_run.start_date DESC, dag_run.id DESC
   LIMIT 50 OFFSET 0
   ```
   Explain analyze:
   ```
   Limit  (cost=525453.24..525459.07 rows=50 width=1960) (actual 
time=11889.455..11907.444 rows=50 loops=1)
     ->  Gather Merge  (cost=525453.24..1196094.70 rows=5747954 width=1960) 
(actual time=11889.454..11907.439 rows=50 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           ->  Sort  (cost=524453.22..531638.16 rows=2873977 width=1960) 
(actual time=11855.090..11855.095 rows=50 loops=3)
   "              Sort Key: (CASE WHEN (start_date IS NOT NULL) THEN 0 ELSE 1 
END), start_date DESC, id DESC"
                 Sort Method: top-N heapsort  Memory: 62kB
                 Worker 0:  Sort Method: top-N heapsort  Memory: 56kB
                 Worker 1:  Sort Method: top-N heapsort  Memory: 50kB
                 ->  Parallel Seq Scan on dag_run  (cost=0.00..428981.77 
rows=2873977 width=1960) (actual time=0.023..7671.922 rows=2290684 loops=3)
   Planning Time: 0.118 ms
   Execution Time: 11907.490 ms
   ```
   It takes 12 seconds to retrieve 50 records.
   
   Let's remove ordering by start_date and compare the result:
   Query:
   ```sql
   EXPLAIN ANALYSE
   (SELECT dag_run.state,
           dag_run.id,
           dag_run.dag_id,
           dag_run.queued_at,
           dag_run.logical_date,
           dag_run.start_date,
           dag_run.end_date,
           dag_run.run_id,
           dag_run.creating_job_id,
           dag_run.run_type,
           dag_run.triggered_by,
           dag_run.triggering_user_name,
           dag_run.conf,
           dag_run.data_interval_start,
           dag_run.data_interval_end,
           dag_run.run_after,
           dag_run.last_scheduling_decision,
           dag_run.log_template_id,
           dag_run.updated_at,
           dag_run.clear_number,
           dag_run.backfill_id,
           dag_run.bundle_version,
           dag_run.scheduled_by_job_id,
           dag_run.context_carrier,
           dag_run.span_status,
           dag_run.created_dag_version_id
    FROM dag_run
    ORDER BY dag_run.id DESC
    LIMIT 50 OFFSET 0)
   ```
    Explain analyze:
   ```
   Limit  (cost=0.43..4.40 rows=50 width=1956) (actual time=13.976..14.019 
rows=50 loops=1)
     ->  Index Scan Backward using dag_run_pkey on dag_run  
(cost=0.43..547347.15 rows=6897544 width=1956) (actual time=13.975..14.013 
rows=50 loops=1)
   Planning Time: 0.088 ms
   Execution Time: 14.043 ms
   ```
   Execution time reduced from 12 seconds to 14ms.
   
   The problem here is that we're ordering by a column that doesn't have an 
index. 
   Also, I think the main and dag page use ordering/filtration by start_date as 
well. All these pages are quite slow in case if you have many dagruns. 
   
   Should we add an index for the single start_date column? I don't see any 
other column that we can use for ordering.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to