This is an automated email from the ASF dual-hosted git repository.
simbit18 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nuttx.git
The following commit(s) were added to refs/heads/master by this push:
new 12e8f92a282 CI: Retry build upon failure
12e8f92a282 is described below
commit 12e8f92a282fac58e0dfff587ea3d9502e4804c0
Author: Lup Yuen Lee <[email protected]>
AuthorDate: Sat Apr 4 17:33:05 2026 +0800
CI: Retry build upon failure
In Jan-Feb 2026: NuttX CI hit a [record high usage of GitHub
Runners](https://github.com/apache/nuttx/issues/17914), exceeding the limit
enforced by ASF Infrastructure Team. We analysed the PRs and discovered that
most GitHub Runners were wasted on __(1) Failure to Download the Build
Dependencies__ for DTC Device Tree, OpenAMP Messaging, MicroADB Debugger,
MCUBoot Bootloader, NimBLE Bluetooth, etc __(2) Resubmitting PR Commits__:
- [Video: Analysing the Most Expensive PR](https://youtu.be/swFaxaTCEQg)
- [Video: Second Most Expensive PR](https://youtu.be/uSpQkzBogEw)
- [Video: Third Most Expensive PR](https://youtu.be/J7w1gyjwZ1w)
- [Video: Most Expensive Apps PR](https://youtu.be/182h8cRpfvI)
- [Spreadsheet: Most Expensive
PRs](https://docs.google.com/spreadsheets/d/1HY7fIZzd_fs3QPyA0TX7vsYOjL86m1fNOf1Wls93luI/edit?gid=70515654#gid=70515654)
Why would __Download Failures__ waste GitHub Runners? That's because
Download Failures will terminate the Entire CI Build (across All CI Jobs),
requiring a restart of the CI Build. And the CI Build isn't terminated
immediately upon failure: NuttX CI waits for the CI Job to complete (e.g.
arm-01), before terminating the CI Build. Which means that CI Builds can get
terminated 2.5 hours into the CI Build, wasting 2.5 elapsed hours x [7.4
parallel processes](https://lupyuen.org/articles/c [...]
This PR proposes to __Retry the Build for Each CI Target__. NuttX CI shall
rebuild each CI Target (e.g. `sim:nsh`), upon failure, up to 3 times (total 4
builds). Each rebuild will be attempted after a Randomised Delay with
Exponential
Backoff, initially set to 60 seconds, then 120 seconds, 240 seconds. The
rebuilds will mitigate the effects of Intermittent Download Failures that occur
in GitHub Actions. (And eliminate developer frustration)
If the build fails after 3 retries: Subsequent CI Targets will __not be
allowed to rebuild__ upon failure. This is to prevent cascading build failures
from overloading GitHub Actions, and consuming too many GitHub Runners.
Note that NuttX CI shall retry the build for __Any Kind of Build Failure__,
including Download Failures, Compile Errors and Config Errors. We designed it
simplistically due to our current constraints: (1) Lack of CI Expertise (2)
NuttX CI is Mission Critical (3) Legacy CI Scripts are Highly Complex. To
prevent Compile Errors and Config Errors: We expect NuttX Devs to [Build and
Test PRs in Our Own Repos](https://github.com/apache/nuttx/issues/18568),
before submitting to NuttX.
What about __Resubmitting PR Commits__ and its wastage of GitHub Runners?
We also require NuttX Devs to [Build and Test PRs in Our Own
Repos](https://github.com/apache/nuttx/issues/18568), before resubmitting to
NuttX. GitHub Runners will then be charged to the developer's quota, without
affecting the GitHub Runners quota for Apache NuttX Project. We plan to [Kill
All CI Jobs](https://youtu.be/182h8cRpfvI?si=MmAuwLISZPPMoqDq&t=1479) for PRs
that have been switched to Draft Mode. We'll [...]
Modified Files:
`tools/testbuild.sh`: We introduce a New Wrapper Function `retrytest` that
will call the Existing Function `dotest`, to build the CI Target and retry on
error.
`Documentation/components/tools/testbuild.rst`: Updated the `testbuild.sh`
doc with the Retry Logic.
Signed-off-by: Lup Yuen Lee <[email protected]>
---
Documentation/components/tools/testbuild.rst | 11 ++++++-
tools/testbuild.sh | 48 ++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/Documentation/components/tools/testbuild.rst
b/Documentation/components/tools/testbuild.rst
index ea568ced6db..ee40c7f0edf 100644
--- a/Documentation/components/tools/testbuild.rst
+++ b/Documentation/components/tools/testbuild.rst
@@ -23,7 +23,7 @@ option shows the usage:
-a <appsdir> provides the relative path to the apps/ directory. Default
../apps
-t <topdir> provides the absolute path to top nuttx/ directory. Default
../nuttx
-p only print the list of configs without running any builds
- -A store the build executable artifact in ARTIFACTDIR (defaults to
../buildartifacts
+ -A store the build executable artifact in ARTIFACTDIR (defaults to
../buildartifacts)
-C Skip tree cleanness check.
-G Use "git clean -xfdq" instead of "make distclean" to clean the tree.
This option may speed up the builds. However, note that:
@@ -73,3 +73,12 @@ The prefix ``-`` can be used to skip a configuration::
or skip a configuration on a specific host(e.g. Darwin)::
-Darwin,sim:rpserver
+
+This script will rebuild each configuration, upon failure, up to 3 times.
+Each rebuild will be attempted after a randomised delay with exponential
+backoff, initially set to 60 seconds. The rebuilds will mitigate the
+effects of intermittent download failures that occur in GitHub Actions.
+
+If the build fails after 3 retries, subsequent configurations will not
+be allowed to rebuild upon failure. This is to prevent cascading build
+failures from overloading GitHub Actions.
diff --git a/tools/testbuild.sh b/tools/testbuild.sh
index 6d80903155b..16bbeeae8ee 100755
--- a/tools/testbuild.sh
+++ b/tools/testbuild.sh
@@ -24,6 +24,7 @@ nuttx=$WD/../nuttx
progname=$0
fail=0
+maxbuilds=4 # Retry 3 times on failure
APPSDIR=$WD/../apps
if [ -z $ARTIFACTDIR ]; then
ARTIFACTDIR=$WD/../buildartifacts
@@ -580,6 +581,49 @@ function dotest {
fi
}
+# Build one entry from the test list file. Retry on failure.
+function retrytest {
+ # Remember the Fail Status and clear it for each build
+ local line=$1
+ local prevfail=$fail
+ local backoff=60 # Initial Exponential Backoff, in seconds
+
+ # Build and retry on failure, with Random Exponential Backoff
+ for ((i = 1; i <= $maxbuilds; i++)); do
+ echo "Build Attempt $i of $maxbuilds"
+ fail=0
+ dotest $line
+
+ # Don't retry if the build succeeded
+ if [ ${fail} -eq 0 ]; then
+ break
+ else
+ # Build Failed: Clean up any corrupted downloads, don't reuse
+ git -C $nuttx clean -fd
+ git -C $APPSDIR clean -fd
+ pushd $nuttx ; git status ; popd
+ pushd $APPSDIR ; git status ; popd
+ fi
+
+ # If this is Final Retry: Don't retry subsequent builds
+ if [ $i -eq $maxbuilds ]; then
+ maxbuilds=1
+ break
+ fi
+
+ # Wait for Random Exponential Backoff, then retry
+ delay=$(( (RANDOM % $backoff) + 1 ))
+ echo "Wait $delay seconds ($backoff backoff)"
+ backoff=$(($backoff * 2))
+ sleep $delay
+ done
+
+ # Return the Previous Fail Status, unless this build failed
+ if [ ${fail} -eq 0 ]; then
+ fail=$prevfail
+ fi
+}
+
# Perform the build test for each entry in the test list file
for line in $testlist; do
@@ -588,10 +632,10 @@ for line in $testlist; do
dir=`echo $line | cut -d',' -f1`
list=`find boards$dir -name defconfig | cut -d'/' -f4,6`
for i in ${list}; do
- dotest $i${line/"$dir"/}
+ retrytest $i${line/"$dir"/}
done
else
- dotest $line
+ retrytest $line
fi
done