[PR] feat(circleci-plugin): incremental data collection [incubator-devlake]

via GitHub Thu, 29 Aug 2024 11:19:18 -0700


Nickcw6 opened a new pull request, #7986:
URL: https://github.com/apache/incubator-devlake/pull/7986


   ### Summary
   What does this PR do?
   
   Incremental data collection for the CircleCI plugin. 
   
   Haven't swapped the extractor or converter to be incremental as this is 
enough for now to make a substantial difference in pipeline duration, but 
something to consider.
   
   ### Does this close any open issues?
   N/A
   
   ### Screenshots
   Full data collection:
   <img width="1273" alt="Screenshot 2024-08-29 at 18 16 16" 
src="https://github.com/user-attachments/assets/f5ba528a-ee8d-42f6-976e-71e3dd444217";>
   
   Incremental data collection ran afterwards:
   <img width="1253" alt="Screenshot 2024-08-29 at 18 28 47" 
src="https://github.com/user-attachments/assets/421aac1c-eeba-415d-bce5-062e7826b5a1";>
   
   
   ### Other Information
   Had to make some adjustments to the existing 
`NewStatefulApiCollectorForFinalizableEntity` function, namely:
   -  Original implementation assumed all results on the page were invalid (ie. 
created before `createdAfter`) or valid (created after `createdAfter`), 
extended this to check each item in turn if required to prevent duplicate 
records.
   - For CircleCI jobs API there is no nice `created_at` property - we have to 
use `started_at` which may be null, therefore also added checks for 
`time.IsZero()`. We still collect the data if this is zero.
   
   There are some annoying design choices around the jobs collection worth 
noting, we use the `workflow/{id}/jobs` 
[endpoint](https://circleci.com/docs/api/v2/index.html#operation/listWorkflowJobs)
 to collect all jobs for a workflow on a single endpoint - as mentioned above 
this lacks a definitive `created_at` property, so we use `started_at` which may 
be null. Couple of ways round this:
   1. We swap to use the `job/{id}` 
[endpoint](https://circleci.com/docs/api/v2/index.html#operation/getJobDetails).
 
       - This returns different fields so would be more work to change this for 
the extractor etc
       - This returns a 404 for some statuses so we'd never collect this data
       - Would drastically increase the duration of what is an already slow 
collection (e.g. instead of making 1 request for 5 jobs, we'd need to make 5 
separate requests).
   2. If `started_at` is null for a selection of statuses we care about (e.g. 
`running`), then we give this a temp timestamp of `time.Now()`
       - Records would be recollected & corrected on the next run.
       - We're interfering with raw data in the DB which is dirty & could 
confuse users.
   3. We check for `IsZero()` and collect these records anyway
       - Data collected exactly as per the API
       - Some of the jobs we collect have minimal value (eg. we'd collect 
`blocked` or `unauthorized` jobs)
       - Potentially unintended side-effects for other plugins
   4. We use the `created_at` of the workflow (`created_date` on 
`_tool_circleci_workflows`) and extend `_tool_circleci_jobs` with a 
`workflow_created_date` column
       - As we collect all jobs for a workflow simultaneously we would only 
need to consider this timestamp 
       - Data collected as per the API 
   
   
   I couldn't get 4) to work nicely in the `GetCreated` function, so have opted 
for 3), but open to other ideas/suggestions on how this could be implemented. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(circleci-plugin): incremental data collection [incubator-devlake]

Reply via email to