Nickcw6 opened a new pull request, #7986: URL: https://github.com/apache/incubator-devlake/pull/7986
### Summary What does this PR do? Incremental data collection for the CircleCI plugin. Haven't swapped the extractor or converter to be incremental as this is enough for now to make a substantial difference in pipeline duration, but something to consider. ### Does this close any open issues? N/A ### Screenshots Full data collection: <img width="1273" alt="Screenshot 2024-08-29 at 18 16 16" src="https://github.com/user-attachments/assets/f5ba528a-ee8d-42f6-976e-71e3dd444217"> Incremental data collection ran afterwards: <img width="1253" alt="Screenshot 2024-08-29 at 18 28 47" src="https://github.com/user-attachments/assets/421aac1c-eeba-415d-bce5-062e7826b5a1"> ### Other Information Had to make some adjustments to the existing `NewStatefulApiCollectorForFinalizableEntity` function, namely: - Original implementation assumed all results on the page were invalid (ie. created before `createdAfter`) or valid (created after `createdAfter`), extended this to check each item in turn if required to prevent duplicate records. - For CircleCI jobs API there is no nice `created_at` property - we have to use `started_at` which may be null, therefore also added checks for `time.IsZero()`. We still collect the data if this is zero. There are some annoying design choices around the jobs collection worth noting, we use the `workflow/{id}/jobs` [endpoint](https://circleci.com/docs/api/v2/index.html#operation/listWorkflowJobs) to collect all jobs for a workflow on a single endpoint - as mentioned above this lacks a definitive `created_at` property, so we use `started_at` which may be null. Couple of ways round this: 1. We swap to use the `job/{id}` [endpoint](https://circleci.com/docs/api/v2/index.html#operation/getJobDetails). - This returns different fields so would be more work to change this for the extractor etc - This returns a 404 for some statuses so we'd never collect this data - Would drastically increase the duration of what is an already slow collection (e.g. instead of making 1 request for 5 jobs, we'd need to make 5 separate requests). 2. If `started_at` is null for a selection of statuses we care about (e.g. `running`), then we give this a temp timestamp of `time.Now()` - Records would be recollected & corrected on the next run. - We're interfering with raw data in the DB which is dirty & could confuse users. 3. We check for `IsZero()` and collect these records anyway - Data collected exactly as per the API - Some of the jobs we collect have minimal value (eg. we'd collect `blocked` or `unauthorized` jobs) - Potentially unintended side-effects for other plugins 4. We use the `created_at` of the workflow (`created_date` on `_tool_circleci_workflows`) and extend `_tool_circleci_jobs` with a `workflow_created_date` column - As we collect all jobs for a workflow simultaneously we would only need to consider this timestamp - Data collected as per the API I couldn't get 4) to work nicely in the `GetCreated` function, so have opted for 3), but open to other ideas/suggestions on how this could be implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
