[gem5-dev] [gem5 on GitHub] Explaining our new testing infrastructure: GitHub Actions, Runners, and current limitations

Bobby Bruce via gem5-dev Thu, 08 Jun 2023 16:03:51 -0700

Dear all,

As you are likely aware, we will be migrating to GitHub, form Gerrit, soon. 
Some have reached out to me with questions about the testing infrastructure on 
GitHub and how it may be used and improved. I shall do so in this email.

GitHub allows for testing via its “Actions” infrastructure. GitHub Actions
consists of “Workflows” specifies a set of jobs to be run that is executed on a
specific event.
GitHub Actions supports a wide range of these events from pull request
creation, through to specific keywords appearing comments, simple scheduled
runs and many more. The jobs the workflow specify run commands in a series
steps and it is highly flexible for a large number of automation tasks. Some
repos automatically moderate discussion threads, and others use it to build and
deploy their service. Right now we are just using them to run tests. GitHub
looks for GitHub Actions specified in a yaml format in “./github/workflows".
Our current workload yaml file can be found here:
https://github.com/gem5/gem5/blob/ad0a2d1beaa043c03c0e43406078b3a09a3861ac/.github/workflows/.
There are many resources online explaining how to create yaml files to
specific jobs and triggered on specific events so I won’t go into further
details than this high-level description.

There is one small limitation with GitHub Actions which we will need to change
procedures for. GitHub only reads the yaml files on the repository’s main
branch. This means if we want to update which tests are run, or how they are
run, we need to update the stable branch. After some discussion we believe the
best policy will be to permit patches to be submitted to the stable branch
between releases for changes to these yaml files. Since these files do not
affect the compilation or running of gem5, the stable branch is still “stable”
with respect to the end user's interaction with gem5.

Jobs run on “runners". A runner is just a server which accepts GitHub jobs to
run. They run one job at one time. Typically you would pay GitHub to use their
runners as most actions complete in a matter of seconds so incur little cost.
That won’t work for us as some of our tests take days to complete. Fortunately
GitHub allows for “self-hosted runners”. With tooling provided by GitHub you
can setup a runner on any machine you want and point it towards the git
repository it is to accept job requests from. There is one big problem with
this: A self-hosted runner is not secure. With the right job specification you
can execute whatever you want on the host hardware. A smaller annoyance is
GitHub makes it hard, but not impossible, to run more than one runner per
machine, which is annoying when ideally you want several runners to be
executing jobs in parallel on machines that can handle them.

Our solution to this is runners setup in virtual machines. We attempted to
utilize Kubernetes for this for us but found it’s more tailored towards large
cloud-based clusters where as we want to utilize a smaller number of servers at
our disposal. After some trial and error we decided it wasn’t the right tool
for the job. Moving on from this we opted to use Vagrant to create VMs to host
the runners. I have documented all the scripts I used to do this here:
https://gem5-review.googlesource.com/c/public/gem5/+/71098. You can consult the
“README.md” on procedures to setup your own runners. Though I have created some
scripts to semi-automated the process, it’s still quite manual. It would be
nice if there was a more “push button” way to do deploy runners. In a similar
vain, if they break we have to manually go in and restart them. There’s room
for improvement here.

Right now we have two types of VM’s: “builders” and “runners”. Builders are
4-core 16GB VMs with their primarily purpose being to build gem5. Runners are
single-core 6GB VMs with their purpose being to run instances of gem5. Aiming
for a rough 6 to 1 ratio we have 26 runners and 4 builders spread over 3
machines though this is very lopsided as 1 of our machines hosts 20 runners. In
the yaml file the jobs are distributed to either a runner or builder based on
the “run-on” field.

Though this setup is currently functional, it does have some restrictions and
pain-points. Of note:

- We do not have a runner which can run KVM tests. For the meantime these are
skipped. We’re not sure how feasible putting a runner in a VM which will allow
KVM is.
- Due to the Weekly GPU tests needing a special docker container built in the
tests, we need more time to figure out how to do this. At present we get errors
but are working finding a solution.
- We do not have good tools to orchestrate these VMs. If they go down and they
need restarted, or new VMs need created, it requires manual effort.
- 20 of our runners are on a single machine. It’d be much better to have a more
distributed set of runners.
- All our machines are X86. It it may be of value to have some ARM hosts too.
Particularly to run ARM KVM.

If anyone reading this wants to help with development of this infrastructure
then I’d be happy to accommodate their input. I realize there are many parts
explained that can be improved. Using the scripts I provide here:
https://gem5-review.googlesource.com/c/public/gem5/+/71098 you can setup your
own runner and test out different setups on your own forks of the gem5
repository. We’d also welcome improvements to our yaml scripts to better
utilize what we have and run better tests.

Kind regards,
Bobby
--
Dr. Bobby R. Bruce
Room 3050,
Kemper Hall, UC Davis
Davis,
CA, 95616

web: https://www.bobbybruce.net
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[gem5-dev] [gem5 on GitHub] Explaining our new testing infrastructure: GitHub Actions, Runners, and current limitations

Reply via email to