Dear All,

I would like to do a quick postmortem for yesterday’s Jenkins weekly release 
outage that lasted for about 6h30.  The weekly release Jenkins 2.263 was listed 
on www.jenkins.io as available but was not available to download.

Since April 2020, the weekly release is fully automated and triggered every 
Tuesday by this job 
<https://github.com/jenkins-infra/release/blob/8a3db509247948cf8effc6925fd7872b8e77fa40/Jenkinsfile.d/core/weekly#L13>
 

It runs two Jenkins jobs from a specific jenkins instance:
1) Build maven artifacts then publish them on repo.jenkins-ci.org, like on 
repo.jenkins-ci.org 
<https://repo.jenkins-ci.org/releases/org/jenkins-ci/main/jenkins-war/>
2) Build distribution packages using the jenkins.war from the maven repository 
then update our mirror infrastructure

Yesterday, the second stage failed on the window package step which resulted in 
no distribution packages published at all.

But because a new version has been published on our maven repository by the 
first job, every Jenkins instance was notified that a new weekly version was 
available. And because we didn't update our mirror infrastructure, nobody was 
able to fetch the update. It took us 6h30 before fixing it, fortunately enough, 
the second stage is pretty quick, +-15min versus the 2h needed for the first 
stage, so we rerun the job without windows packaging.

Remark: At the moment the windows package is still not published due to a 
Windows issue in the infrastructure

This outage reminded us that we still have work to do and help is definitely 
more than welcome :)

*Issues**
*

* [INFRA-2538 <https://issues.jenkins-ci.org/browse/INFRA-2538>] -> To fix the 
windows packaging issue
* We wrote a python script to detect the latest version from 
maven-metadata.xml, for some reason the metadata 
<https://repo.jenkins-ci.org/releases/org/jenkins-ci/main/jenkins-war/maven-metadata.xml>
 file we rely on, still references the previous weekly release 2.262 while all 
the other maven-metadata.xml are correct. :/

*Monitoring*

6h30 is way too long to detect such issue, fatigue habit is a thing and we must 
detect when something went wrong as fast as possible     
[INFRA-2027] -> I started working on a python script that we could use with 
Datadog but I haven't had the time to finish it yet

*Artifact Promotion*
 
While it would have not solved the current problem, we could have published the 
maven release to a temporary maven repository then only promote the artifacts 
once every distribution package is available.
So people would not have been notified, considering that we mainly rely on 
people monitoring this would have probably delayed even more the release. We 
already have that logic in place as it's needed for the security release 
anyway, we just have to agree on a staging repository.

We’ll be working on those improvements and will share our progress as the 
improvements become available.

Cheers,

Olblak

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/4c4cc1b0-ee5e-4faf-ac7e-677e50dd36f5%40www.fastmail.com.

Reply via email to