JWShuff opened a new issue, #8726:
URL: https://github.com/apache/incubator-devlake/issues/8726
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## What and why to refactor
The current `CalculateChangeLeadTime` implementation in
`backend/plugins/dora/tasks/change_lead_time_calculator.go` suffers from an N+1
query problem. For every pull request in a project's scope, three sequential
database queries are executed:
1. `getFirstCommit(pr.Id, db)` fetches the earliest commit for the PR
1. `getFirstReview(pr.Id, pr.AuthorId, db)` fetches the first non-author
review comment
1. `getDeploymentCommit(pr.MergeCommitSha, projectName, db)` fetches the
deployment associated with the merge commit
For a project with N pull requests, this results in 3N + 1 database
round-trips. In our environment, this drives change lead time calculation into
the 15+ hour range across our applications, making the DORA metrics pipeline a
significant bottleneck.
## Describe the solution you'd like
Replace the per-PR queries with three upfront batch queries that return maps
keyed by PR ID (or merge commit SHA for deployments), enabling O(1) lookups
during the per-PR iteration loop. Specifically:
1. `batchFetchFirstCommits(projectName, db)` Uses a subquery with
MIN(commit_authored_date) grouped by `pull_request_id`, joined back to
`pull_request_commits` to retrieve full commit records. Scoped to the project
via project_mapping. Returns `map[string]*code.PullRequestCommit`.
1. `batchFetchFirstReviews(projectName, db)` Uses a subquery with
MIN(created_date) grouped by `pull_request_id` (excluding the PR author's own
comments), joined back to `pull_request_comments`. Returns
`map[string]*code.PullRequestComment`.
1. `batchFetchDeployments(projectName, db)` Queries
`cicd_deployment_commits` joined with `commits_diffs` to map merge commit SHAs
to their first successful production deployment. Returns
`map[string]*devops.CicdDeploymentCommit`.
This looks like:
```go
// Before (3 queries per PR):
firstCommit, err := getFirstCommit(pr.Id, db)
firstReview, err := getFirstReview(pr.Id, pr.AuthorId, db)
deployment, err := getDeploymentCommit(pr.MergeCommitSha, projectName, db)
// After (map lookups):
firstCommit := firstCommitsMap[pr.Id]
firstReview := firstReviewsMap[pr.Id]
deployment := deploymentsMap[pr.MergeCommitSha]
```
This reduces the query pattern from 3N + 1 to a constant 4 queries
regardless of PR count. The existing calculation logic (coding time, pickup
time, review time, deploy time) is unchanged; only the data fetching strategy
changes. I'm not as familiar with the testing coverage in this part of the
application, but that the tests still pass with this implementation gives me
hope.
The implementation also adds timing logs for observability, we can remove if
it is preferred.
## Related issues
Issue #8361 from previous discoveries as we've extended the usage of this
tool.
## Additional context
Our implementation is available at [PR
#8714](https://github.com/apache/incubator-devlake/pull/8714). The change is
isolated to a single file (change_lead_time_calculator.go) with no changes to
the data model, API surface, or calculation logic. The batch query SQL mirrors
the existing per-PR query logic exactly; it's the same filtering and join
conditions, just grouped and executed once rather than per-row.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]