[PR] fix: add pagination to job_collector task [incubator-devlake]

via GitHub Thu, 12 Dec 2024 05:01:25 -0800


ClaudioMascaro opened a new pull request, #8240:
URL: https://github.com/apache/incubator-devlake/pull/8240


   <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
   -->
   ### ⚠️ Pre Checklist
   
   > Please complete _ALL_ items in this checklist, and remove before submitting
   
   - [x] I have read through the [Contributing 
Documentation](https://devlake.apache.org/community/).
   - [ ] I have added relevant tests.
   - [ ] I have added relevant documentation.
   - [x] I will add labels to the PR, such as `pr-type/bug-fix`, 
`pr-type/feature-development`, etc.
   
   <!--
   Thanks for submitting a pull request!
   
   We appreciate you spending the time to work on these changes.
   Please fill out as many sections below as possible.
   -->
   
   ### Summary
   Add pagination in github_graphql job collector task.
   
   ### Does this close any open issues?
   Closes https://github.com/apache/incubator-devlake/issues/8028
   
   ### Screenshots
   
   We needed to extract data from a large and complex repository, which has 
over 30000 workflow runs, and some of that can have more than 200 job runs.
   
   Whenever the Collect Job task started, It simply wouldn't finish the query 
in time, entering in the retry flow:
   
   ![Screenshot from 2024-12-09 
16-56-00](https://github.com/user-attachments/assets/62bf4cbb-a8e5-40a2-96bd-3f88b9c8bc4f)
   
   we have tried to reduce api timeout, but it would only increase the number 
of unsuccessful retries:
   
   ![Screenshot from 2024-12-06 
15-38-14](https://github.com/user-attachments/assets/3cf91526-cf45-44ee-8bb8-b72fe5a94786)
   
   So, after implementing the solution, it would solve our case, and after 17 
hours, it was able to collect all data:
   
   
![image](https://github.com/user-attachments/assets/1c73626d-4884-406c-a24f-d233cc677f82)
   (don't mind the log I added locally to debug)
   
   
![image](https://github.com/user-attachments/assets/405e4f24-6cc3-4f0d-8e06-3a9d9ab1a117)
   
   
   as for comparison purposes, we have extracted data from a much less complex 
repository, which before the implementation, took the following time:
   
   ![Screenshot from 2024-12-11 
08-11-06](https://github.com/user-attachments/assets/d2af3d2a-9185-4ef3-a2e1-4073ade65612)
   
   and, after bringing off the solution (of course, rerunning in hard refresh 
mode), there was no change in pipeline overall time:
   
   ![Screenshot from 2024-12-11 
08-18-53](https://github.com/user-attachments/assets/28ad236b-1dc1-49ed-b752-e572f89bef86)
   
   
   
   ### Other Information
   
   Already merged on main branch, reopening as requested: 
https://github.com/apache/incubator-devlake/pull/8233#pullrequestreview-2497839137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix: add pagination to job_collector task [incubator-devlake]

Reply via email to