Summary: We currently have 5 CI systems: Travis, Azure, Gitlab, AppVeyor, and DrDr. I explain what I have done so far in Gitlab and propose unifying this into a single solution in the future. Request concerns, suggestions and comments.
Long version: A few years ago GitHub CI through Travis was lacking important features (for a start it only supported two OS images and one architecture) and many, many other systems provided much better solutions. I had worked in industry with many systems including Gitlab[1], buildbot[2] and Jenkins[3]. Jenkins-pre2 was pretty buggy - I went through the pain of setting up a large CI system for internal development tools in one of my previous clients and it scarred me for life. Jenkins2 by then had just come out and looked better and shiny but I never gave it a go. Buildbot OTOH is not a CI system per se but more of a framework from which you create your system in Python. I wrote a prototype of a buildbot-based CI system for GCC around 2017[3] and many other systems have been using buildbot like llvm[4], webkit[5] and gdb[6] to name just a few. A little later, Gitlab integrated a CI solution and I started to use it with Racket in a Gitlab fork and later when Gitlab released CI for Github projects, I worked with SamTH to get a Gitlab solution for the official Racket tree and nowadays you see it working under gitlab.com/racket/racket with its configuration in https://github.com/racket/racket/blob/master/.gitlab-ci.yml. Here's what it does at this point (currently all configurations are running on Linux): On x86_64: 1. Builds RacketCGC -> Racket3M -> RacketCS 2. Runs tests on all of the three variants (similar to those ran by Travis) 3. Builds once again all the variants and tests with --enable-ubsan keeping a record of the runtime errors (for example: https://gitlab.com/racket/racket/-/jobs/350271624/artifacts/file/runtime-errors.log) 4. Builds once again all the variants with the llvm static analyser and keeps a record of the failures (for example: https://gitlab.com/racket/racket/-/jobs/350271564/artifacts/file/scan-report_mmm/2019-11-14-032030-5552-1/index.html). This step requires us to build LLVM with Z3 enabled so we can use the work from ICSE'19[7]. On armv7l (arm32 with hard float): 1. Build RacketCGC -> Racket3M 2. Runs tests The above pipeline takes 1h10m[8] to run through on my machines. Every night besides the above we extend the pipeline with: 1. Emulate the build of CGC and 3M on arm64, armel (32bits little endian - soft floating point), armhf (32bits little endian - hard floating point), i386, mips (32bits big endian), mips64el (64bits, little endian), mipsel (32bits little endian), 2. Same as above but configured with --enable-generations=no (since https://github.com/racket/racket/commit/7c3a207f36dc25baaac4afdf7ecedc18bf9ff49c). The above pipeline takes 4h to run since it also compiles QEmu 3.1.0 (debian container qemu is too old) beforehand. Both this build and the LLVM build are cached so it only really builds once until the versions are changed. QEmu 4.1.0 shows good results[9] so I will upgrade this soon. The biggest problem with the Gitlab pipeline is that it worked _really_ well until I started wanting to optimize the pipeline. For example to have a stage-less pipeline where jobs only need a few jobs running in previous stages instead of waiting for all of them. Gitlab is finally catching up to this with the `needs` keyword but the interface becomes a mess. For all the time I spent with Gitlab CI, I spent almost as much time configuring Racket, as I spent trying to figure out why some things break or don't work[10, 11]. Gitlab CI for simple projects is great but it just gets harder and harder as the pipeline complexity grows. I have slowly been gathering a few machines to test Racket and other work related projects so I have quite a few machines/boards of varied architectures (arm32, arm64, x86_64, mipsel). I also got a machine 2 months ago with Windows to test Racket on Windows10 (also something coming soon [12]). For you to do Gitlab CI of Racket on your machine, you simply need to install `gitlab-runner`, and connect it to the project appropriately. However, I just got a rpi4 with 4Gbs and just found out I cannot use it because gitlab-runner doesn't run on arm64 yet (Go apparently doesn't support arm64 yet). So that's another bummer. Lately I noticed that Gitlab CI for Github projects (what we use for Racket) doesn't support afaict, running the pipelines on PRs. And if it did, it probably wouldn't support running a special faster pipeline so the PR author understand if it's breaking something. All in all, we have outgrown Gitlab CI and I would like to spend more of my free time working on an improved GC for RacketCS than on fixing GitlabCI or working around it. I also think it is a waste of resources to run so many CI systems simultaneously sometimes doing the same thing. My next CI project was to support benchmarking and develop a Racket Dashboard Webapp that displays the important results of CI in a visually appealing way that's easy to understand. I take some time every day to look at the Gitlab CI pipeline and ensure that all the yellows (expected failures) are what we already knew that was going to fail instead of some new failure that ends up being categorized in the same way - again the interface doesn't help when you have 20+ jobs. My proposal is to rewrite the current Gitlab CI pipeline in using Buildbot and take it from there. This means writing Python but maybe with some luck parts of it can be written in Racket and interfaced with Python if necessary (can Pycket help here)? Buildbot runs on all the architectures Python runs - all the ones we are interested in and deploying it is as easy as it is with gitlab. Granted that the code won't look like a yaml file anymore but I am pretty sure that by now the Python code might be more readable than the current ~1300 line yaml file we have to configure our pipeline. Once Buildbot has the same features as Gitlab CI, I will extend it to ensure architectures tested with Azure, Travis, and AppVeyor are covered. At this point we could potentially switch off other systems. DrDr seems to be a different beast, much harder to replace so before I go there, I will sync with the rest of the team but I still think that having a unified system and interface could be the way to go. If you are happy with my proposal, I will go ahead and start a new project on GitHub: racket-buildbot. Once we get this to a stable point, we could merge this into the racket tree and remove .gitlab, etc. At this point, I welcome any comments and suggestions. Having good CI means that in the long term we'll have ensured that Racket keeps running on all supported platforms (and once benchmarking is done - how Racket's performance changes over time). So having good CI is important. However, it is only relevant if it is useful to the racket team and contributors. It would be great if everyone involved could chime in with what they would like to have/see. Feel free to request whatever you want, I cannot promise implementing all of this but I can make a list. Refs: [1] https://gitlab.com [2] https://buildbot.net [3] https://jenkins.io [4] http://lab.llvm.org:8011/ [5] https://build.webkit.org/ [6] https://gdb-buildbot.osci.io/#/ [7] https://dl.acm.org/citation.cfm?id=3339673 [8] https://gitlab.com/racket/racket/pipelines/95803835 [9] https://github.com/LinkiTools/racket/tree/pmatos-qemu-410 [10] https://gitlab.com/gitlab-org/gitlab/issues?scope=all&utf8=%E2%9C%93&state=opened&author_username=pmatos [11] https://gitlab.com/gitlab-org/gitlab-runner/issues?scope=all&utf8=%E2%9C%93&state=opened&author_username=pmatos [12]https://github.com/LinkiTools/racket/tree/pmatos-ci-win10 Thanks for reading this, -- Paulo Matos -- You received this message because you are subscribed to the Google Groups "Racket Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/racket-dev/87mucy6fyo.fsf%40linki.tools.
