This is an automated email from the ASF dual-hosted git repository.

simbit18 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nuttx.git


The following commit(s) were added to refs/heads/master by this push:
     new 12e8f92a282 CI: Retry build upon failure
12e8f92a282 is described below

commit 12e8f92a282fac58e0dfff587ea3d9502e4804c0
Author: Lup Yuen Lee <[email protected]>
AuthorDate: Sat Apr 4 17:33:05 2026 +0800

    CI: Retry build upon failure
    
    In Jan-Feb 2026: NuttX CI hit a [record high usage of GitHub 
Runners](https://github.com/apache/nuttx/issues/17914), exceeding the limit 
enforced by ASF Infrastructure Team. We analysed the PRs and discovered that 
most GitHub Runners were wasted on __(1) Failure to Download the Build 
Dependencies__ for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger, 
MCUBoot Bootloader, NimBLE Bluetooth, etc __(2) Resubmitting PR Commits__:
    
    - [Video: Analysing the Most Expensive PR](https://youtu.be/swFaxaTCEQg)
    - [Video: Second Most Expensive PR](https://youtu.be/uSpQkzBogEw)
    - [Video: Third Most Expensive PR](https://youtu.be/J7w1gyjwZ1w)
    - [Video: Most Expensive Apps PR](https://youtu.be/182h8cRpfvI)
    - [Spreadsheet: Most Expensive 
PRs](https://docs.google.com/spreadsheets/d/1HY7fIZzd_fs3QPyA0TX7vsYOjL86m1fNOf1Wls93luI/edit?gid=70515654#gid=70515654)
    
    Why would __Download Failures__ waste GitHub Runners? That's because 
Download Failures will terminate the Entire CI Build (across All CI Jobs), 
requiring a restart of the CI Build. And the CI Build isn't terminated 
immediately upon failure: NuttX CI waits for the CI Job to complete (e.g. 
arm-01), before terminating the CI Build. Which means that CI Builds can get 
terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x [7.4 
parallel processes](https://lupyuen.org/articles/c [...]
    
    This PR proposes to __Retry the Build for Each CI Target__. NuttX CI shall 
rebuild each CI Target (e.g. `sim:nsh`), upon failure, up to 3 times (total 4 
builds). Each rebuild will be attempted after a Randomised Delay with 
Exponential
    Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The 
rebuilds will mitigate the effects of Intermittent Download Failures that occur 
in GitHub Actions. (And eliminate developer frustration)
    
    If the build fails after 3 retries: Subsequent CI Targets will __not be 
allowed to rebuild__ upon failure. This is to prevent cascading build failures 
from overloading GitHub Actions, and consuming too many GitHub Runners.
    
    Note that NuttX CI shall retry the build for __Any Kind of Build Failure__, 
including Download Failures, Compile Errors and Config Errors. We designed it 
simplistically due to our current constraints: (1) Lack of CI Expertise (2) 
NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex. To 
prevent Compile Errors and Config Errors: We expect NuttX Devs to [Build and 
Test PRs in Our Own Repos](https://github.com/apache/nuttx/issues/18568), 
before submitting to NuttX.
    
    What about __Resubmitting PR Commits__ and its wastage of GitHub Runners? 
We also require NuttX Devs to [Build and Test PRs in Our Own 
Repos](https://github.com/apache/nuttx/issues/18568), before resubmitting to 
NuttX. GitHub Runners will then be charged to the developer's quota, without 
affecting the GitHub Runners quota for Apache NuttX Project. We plan to [Kill 
All CI Jobs](https://youtu.be/182h8cRpfvI?si=MmAuwLISZPPMoqDq&t=1479) for PRs 
that have been switched to Draft Mode. We'll [...]
    
    Modified Files:
    
    `tools/testbuild.sh`: We introduce a New Wrapper Function `retrytest` that 
will call the Existing Function `dotest`, to build the CI Target and retry on 
error.
    
    `Documentation/components/tools/testbuild.rst`: Updated the `testbuild.sh` 
doc with the Retry Logic.
    
    Signed-off-by: Lup Yuen Lee <[email protected]>
---
 Documentation/components/tools/testbuild.rst | 11 ++++++-
 tools/testbuild.sh                           | 48 ++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/Documentation/components/tools/testbuild.rst 
b/Documentation/components/tools/testbuild.rst
index ea568ced6db..ee40c7f0edf 100644
--- a/Documentation/components/tools/testbuild.rst
+++ b/Documentation/components/tools/testbuild.rst
@@ -23,7 +23,7 @@ option shows the usage:
      -a <appsdir> provides the relative path to the apps/ directory.  Default 
../apps
      -t <topdir> provides the absolute path to top nuttx/ directory.  Default 
../nuttx
      -p only print the list of configs without running any builds
-     -A store the build executable artifact in ARTIFACTDIR (defaults to 
../buildartifacts
+     -A store the build executable artifact in ARTIFACTDIR (defaults to 
../buildartifacts)
      -C Skip tree cleanness check.
      -G Use "git clean -xfdq" instead of "make distclean" to clean the tree.
         This option may speed up the builds. However, note that:
@@ -73,3 +73,12 @@ The prefix ``-`` can be used to skip a configuration::
 or skip a configuration on a specific host(e.g. Darwin)::
 
   -Darwin,sim:rpserver
+
+This script will rebuild each configuration, upon failure, up to 3 times.
+Each rebuild will be attempted after a randomised delay with exponential
+backoff, initially set to 60 seconds. The rebuilds will mitigate the
+effects of intermittent download failures that occur in GitHub Actions.
+
+If the build fails after 3 retries, subsequent configurations will not
+be allowed to rebuild upon failure.  This is to prevent cascading build
+failures from overloading GitHub Actions.
diff --git a/tools/testbuild.sh b/tools/testbuild.sh
index 6d80903155b..16bbeeae8ee 100755
--- a/tools/testbuild.sh
+++ b/tools/testbuild.sh
@@ -24,6 +24,7 @@ nuttx=$WD/../nuttx
 
 progname=$0
 fail=0
+maxbuilds=4  # Retry 3 times on failure
 APPSDIR=$WD/../apps
 if [ -z $ARTIFACTDIR ]; then
   ARTIFACTDIR=$WD/../buildartifacts
@@ -580,6 +581,49 @@ function dotest {
   fi
 }
 
+# Build one entry from the test list file. Retry on failure.
+function retrytest {
+  # Remember the Fail Status and clear it for each build
+  local line=$1
+  local prevfail=$fail
+  local backoff=60  # Initial Exponential Backoff, in seconds
+
+  # Build and retry on failure, with Random Exponential Backoff
+  for ((i = 1; i <= $maxbuilds; i++)); do
+    echo "Build Attempt $i of $maxbuilds"
+    fail=0
+    dotest $line
+
+    # Don't retry if the build succeeded
+    if [ ${fail} -eq 0 ]; then
+      break
+    else
+      # Build Failed: Clean up any corrupted downloads, don't reuse
+      git -C $nuttx clean -fd
+      git -C $APPSDIR clean -fd
+      pushd $nuttx ; git status ; popd
+      pushd $APPSDIR ; git status ; popd
+    fi
+
+    # If this is Final Retry: Don't retry subsequent builds
+    if [ $i -eq $maxbuilds ]; then
+                       maxbuilds=1
+      break
+    fi
+
+    # Wait for Random Exponential Backoff, then retry
+    delay=$(( (RANDOM % $backoff) + 1 ))
+    echo "Wait $delay seconds ($backoff backoff)"
+    backoff=$(($backoff * 2))
+    sleep $delay
+  done
+
+  # Return the Previous Fail Status, unless this build failed
+  if [ ${fail} -eq 0 ]; then
+    fail=$prevfail
+  fi
+}
+
 # Perform the build test for each entry in the test list file
 
 for line in $testlist; do
@@ -588,10 +632,10 @@ for line in $testlist; do
     dir=`echo $line | cut -d',' -f1`
     list=`find boards$dir -name defconfig | cut -d'/' -f4,6`
     for i in ${list}; do
-      dotest $i${line/"$dir"/}
+      retrytest $i${line/"$dir"/}
     done
   else
-    dotest $line
+    retrytest $line
   fi
 done
 

Reply via email to