JWShuff opened a new issue, #8726:
URL: https://github.com/apache/incubator-devlake/issues/8726

   <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
   -->
   
   ## What and why to refactor
   The current `CalculateChangeLeadTime` implementation in 
`backend/plugins/dora/tasks/change_lead_time_calculator.go` suffers from an N+1 
query problem. For every pull request in a project's scope, three sequential 
database queries are executed:
   1. `getFirstCommit(pr.Id, db)` fetches the earliest commit for the PR
   1. `getFirstReview(pr.Id, pr.AuthorId, db)` fetches the first non-author 
review comment
   1. `getDeploymentCommit(pr.MergeCommitSha, projectName, db)` fetches the 
deployment associated with the merge commit
   
   For a project with N pull requests, this results in 3N + 1 database 
round-trips. In our environment, this drives change lead time calculation into 
the 15+ hour range across our applications, making the DORA metrics pipeline a 
significant bottleneck.
   
   ## Describe the solution you'd like
   Replace the per-PR queries with three upfront batch queries that return maps 
keyed by PR ID (or merge commit SHA for deployments), enabling O(1) lookups 
during the per-PR iteration loop. Specifically:
   1. `batchFetchFirstCommits(projectName, db)` Uses a subquery with 
MIN(commit_authored_date) grouped by `pull_request_id`, joined back to 
`pull_request_commits` to retrieve full commit records. Scoped to the project 
via project_mapping. Returns `map[string]*code.PullRequestCommit`.
   1. `batchFetchFirstReviews(projectName, db)` Uses a subquery with 
MIN(created_date) grouped by `pull_request_id` (excluding the PR author's own 
comments), joined back to `pull_request_comments`. Returns 
`map[string]*code.PullRequestComment`.
   1. `batchFetchDeployments(projectName, db)` Queries 
`cicd_deployment_commits` joined with `commits_diffs` to map merge commit SHAs 
to their first successful production deployment. Returns 
`map[string]*devops.CicdDeploymentCommit`.
   
   This looks like:
   ```go
   // Before (3 queries per PR):
   firstCommit, err := getFirstCommit(pr.Id, db)
   firstReview, err := getFirstReview(pr.Id, pr.AuthorId, db)
   deployment, err := getDeploymentCommit(pr.MergeCommitSha, projectName, db)
   
   // After (map lookups):
   firstCommit := firstCommitsMap[pr.Id]
   firstReview := firstReviewsMap[pr.Id]
   deployment := deploymentsMap[pr.MergeCommitSha]
   ```
   
   This reduces the query pattern from 3N + 1 to a constant 4 queries 
regardless of PR count. The existing calculation logic (coding time, pickup 
time, review time, deploy time) is unchanged; only the data fetching strategy 
changes. I'm not as familiar with the testing coverage in this part of the 
application, but that the tests still pass with this implementation gives me 
hope.
   
   The implementation also adds timing logs for observability, we can remove if 
it is preferred.
   
   ## Related issues
   Issue #8361 from previous discoveries as we've extended the usage of this 
tool.
   
   ## Additional context
   Our implementation is available at [PR 
#8714](https://github.com/apache/incubator-devlake/pull/8714). The change is 
isolated to a single file (change_lead_time_calculator.go) with no changes to 
the data model, API surface, or calculation logic. The batch query SQL mirrors 
the existing per-PR query logic exactly; it's the same filtering and join 
conditions, just grouped and executed once rather than per-row.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to