henry3260 commented on PR #67549:
URL: https://github.com/apache/airflow/pull/67549#issuecomment-4553244404

   > Thanks for the review @henry3260!
   > 
   > Re: _misleading percentages when capped_ — The per-state counts are still 
capped at 1000, and the frontend already shows "N+" when a count reaches that 
cap (see `state_count_limit` in the response). The uncapped total just provides 
a better denominator so that non-capped states get accurate percentages. Capped 
states will show an understated percentage, but the "1000+" indicator signals 
the uncertainty. This is strictly better than the current behavior where all 
percentages are wrong (even for non-capped states) because the denominator 
itself is wrong.
   > 
   > Re: _expensive COUNT(_) queries* — Both `dag_run_total_count` and 
`task_instance_total_count` are simple `COUNT(*)` queries that hit the same 
indexed filters (`dag_id`, `start_date`, `end_date`) already used by the capped 
counting subqueries. In practice these are index-only scans that execute in 
single-digit milliseconds, adding negligible overhead to the overall endpoint.
   > 
   > An alternative would be to remove the caps entirely, but that was 
intentionally avoided to keep the endpoint fast for deployments with millions 
of task instances. The current approach is a minimal, pragmatic improvement.
   
   IMO, we cannot assume these queries are cheap just because they are simple 
`COUNT(*)` queries. Am I missing something, or do we not have a composite index 
that matches this filter shape, such as `(dag_id, start_date, end_date)`?
   
   My concern is that the existing capped queries are bounded by 
`STATE_COUNT_CAP`, while these new total count queries need to count every 
matching Dag run / task instance. For example, if there are 100000 matching Dag 
runs in the selected time range, and each Dag run has 50 task instances, then 
`ti_total` may need to count 5000000 task instance rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to