SavingFrame opened a new issue, #57344:
URL: https://github.com/apache/airflow/issues/57344
### Apache Airflow version
3.1.0
### If "Other Airflow 2/3 version" selected, which one?
_No response_
### What happened?
The DagRuns page takes 20-30 seconds to load when there are approximately 7
million dag runs in the database. The performance bottleneck is caused by
ordering results by start_date, which lacks a database index.
### What you think should happen instead?
The DagRuns page should load in a reasonable time (under a few seconds) even
with millions of records.
### How to reproduce
1. Have a database with ~7 million dag runs
2. Navigate to the DagRuns page in the Airflow UI
3. Also you can see performance problem with other pages in the Airflow UI
### Operating System
Debian
### Versions of Apache Airflow Providers
_No response_
### Deployment
Virtualenv installation
### Deployment details
_No response_
### Anything else?
This is an example of a query that we use to get all DagRuns (I removed
filtering by dag_id to make the SQL shorter; it doesn't affect performance):
```sql
SELECT dag_run.state,
dag_run.id,
dag_run.dag_id,
dag_run.queued_at,
dag_run.logical_date,
dag_run.start_date,
dag_run.end_date,
dag_run.run_id,
dag_run.creating_job_id,
dag_run.run_type,
dag_run.triggered_by,
dag_run.triggering_user_name,
dag_run.conf,
dag_run.data_interval_start,
dag_run.data_interval_end,
dag_run.run_after,
dag_run.last_scheduling_decision,
dag_run.log_template_id,
dag_run.updated_at,
dag_run.clear_number,
dag_run.backfill_id,
dag_run.bundle_version,
dag_run.scheduled_by_job_id,
dag_run.context_carrier,
dag_run.span_status,
dag_run.created_dag_version_id
FROM dag_run
ORDER BY CASE WHEN (dag_run.start_date IS NOT NULL) THEN 0 ELSE 1 END,
dag_run.start_date DESC, dag_run.id DESC
LIMIT 50 OFFSET 0
```
Explain analyze:
```
Limit (cost=525453.24..525459.07 rows=50 width=1960) (actual
time=11889.455..11907.444 rows=50 loops=1)
-> Gather Merge (cost=525453.24..1196094.70 rows=5747954 width=1960)
(actual time=11889.454..11907.439 rows=50 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=524453.22..531638.16 rows=2873977 width=1960)
(actual time=11855.090..11855.095 rows=50 loops=3)
" Sort Key: (CASE WHEN (start_date IS NOT NULL) THEN 0 ELSE 1
END), start_date DESC, id DESC"
Sort Method: top-N heapsort Memory: 62kB
Worker 0: Sort Method: top-N heapsort Memory: 56kB
Worker 1: Sort Method: top-N heapsort Memory: 50kB
-> Parallel Seq Scan on dag_run (cost=0.00..428981.77
rows=2873977 width=1960) (actual time=0.023..7671.922 rows=2290684 loops=3)
Planning Time: 0.118 ms
Execution Time: 11907.490 ms
```
It takes 12 seconds to retrieve 50 records.
Let's remove ordering by start_date and compare the result:
Query:
```sql
EXPLAIN ANALYSE
(SELECT dag_run.state,
dag_run.id,
dag_run.dag_id,
dag_run.queued_at,
dag_run.logical_date,
dag_run.start_date,
dag_run.end_date,
dag_run.run_id,
dag_run.creating_job_id,
dag_run.run_type,
dag_run.triggered_by,
dag_run.triggering_user_name,
dag_run.conf,
dag_run.data_interval_start,
dag_run.data_interval_end,
dag_run.run_after,
dag_run.last_scheduling_decision,
dag_run.log_template_id,
dag_run.updated_at,
dag_run.clear_number,
dag_run.backfill_id,
dag_run.bundle_version,
dag_run.scheduled_by_job_id,
dag_run.context_carrier,
dag_run.span_status,
dag_run.created_dag_version_id
FROM dag_run
ORDER BY dag_run.id DESC
LIMIT 50 OFFSET 0)
```
Explain analyze:
```
Limit (cost=0.43..4.40 rows=50 width=1956) (actual time=13.976..14.019
rows=50 loops=1)
-> Index Scan Backward using dag_run_pkey on dag_run
(cost=0.43..547347.15 rows=6897544 width=1956) (actual time=13.975..14.013
rows=50 loops=1)
Planning Time: 0.088 ms
Execution Time: 14.043 ms
```
Execution time reduced from 12 seconds to 14ms.
The problem here is that we're ordering by a column that doesn't have an
index.
Also, I think the main and dag page use ordering/filtration by start_date as
well. All these pages are quite slow in case if you have many dagruns.
Should we add an index for the single start_date column? I don't see any
other column that we can use for ordering.
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]