Re: Codecov and CyberShadow failure

Vladimir Panteleev via Digitalmars-d Sun, 12 Feb 2017 07:37:32 -0800

On Thursday, 9 February 2017 at 17:41:09 UTC, Jack Stouffer wrote:

On Wednesday, 8 February 2017 at 21:05:45 UTC, Jack Stoufferwrote:
...
Still can't find the root cause. I'm also unable to recreatethe problem locally using the same commands as the doc builder.
We currently have nine PRs in the pipe ready to be merged oncethis error is nailed down. If anyone could lend a hand here, itwould be very helpful.

Apologies for that. I made the documentation tester mandatory awhile ago, so extended downtime like this is unacceptable.

In the interest of public disclosure, here is the timeline andproblems encountered:

- In response to some complaints about forum performance, Iinvestigated sources of high I/O on the server, and identifiedthe documentation tester as a major culprit. On 2017-02-06, Imoved the working directory to a tmpfs (/dev/shm), which resultedin a dramatic improvement of I/O operations:https://dump.thecybershadow.net/d41c095b6a0dcdb7b827499a487b7c65/16%3A42%3A10-upload.png

- I've begun receiving reports on the autotester malfunctioning.In the process of debugging this problem, I've discovered asecond problem: some files on the tmpfs would periodicallydisappear. This is what caused intermittent "file not found"errors.

- After some trial and error, I've identified the source of thesecond problem (an unusual systemd behaviour). I've adjusted theserver configuration on 2017-02-09 to disable the behaviour.

- However, the first problem persisted (which manifested ascompilation errors in the 2.073.0 version of Phobos). Finally,yesterday (2017-02-11) with some experimentation I've discoveredthat the root problem was a latent DMD bug which manifested onlywhen the Phobos source files were being passed to it in a certainorder, which happened to be the file iteration order on tmpfs.Details in the pull request:https://github.com/dlang/dlang.org/pull/1568


- Now that the PR is merged, master and stable are green again.

I accept that this shouldn't have taken a week to fix, and theinitial change in question (tmpfs move) would have been betterdone in a test environment. FWIW, in parallel I've been workingon a full-disk backup strategy to prepare for having one of theserver's HDDs replaced. (We already have backups of criticaldata, but rebuilding from backups and reinstalling the systemwould result in downtime that can be avoided. The HDDs arealready in RAID1 configuration, so the full disk backup is aprecaution.)

Re: Codecov and CyberShadow failure

Reply via email to