Thanks Ed for the detailed time-line. I also can confirm that (from the
point of a simple comitter POV) the outage was not over at Aug 2 (maybe
for me 'core services' are just others than from the infra-POV) but has
last far to 4 Aug and I could continue the work on my issues.
So for me the summary "The outage was extensive, and for core services,
lasted for approximately 18 hours. Non-core services were degraded for
an additional 12 hours." does not feels quite right but as said before I
can't 'proof' that, its jsut that actually I was only able to resume my
work at Aug 4 (120hrs later!) at laest until the tycho-ci server was
restarted ...
so for me it seems a check "are all build servers running and have
executors" is missing from the status page.
Am 13.08.21 um 15:22 schrieb Ed Willink:
Hi
Thank you all for hitting problems quite quickly once you were engaged.
Perhaps this 'bystander's' perspective may help to understand the need
to communicate better.
I first became aware of the problem after receiving notification a
little after 2:42 EDT 1-Aug that a weekly OCL rebuild had failed.
Investigation of the log pointed a finger at the GIT repo and
eclipsestatus.io indicated that a major outage was in progress with an
'investigating' tweet. Clearly someone was on the case and so the
bystander effect took over and I didn't raise any reports or emails to
distract.
'investigating' status advanced to 'fix-in-progress' after an hour.
But then nothing for a further 5 hours, at which point we got 'it will
take 13 hours'. On twitter someone asked when the 13 hours started; one
might have hoped that it would be from the 'fix-in-progress' time. This
tweet and an 'ETA?' tweet were never answered.
17 hours later we got 'most websites' back, which might be true but with
important services down, it was misleading. It took a further perhaps 4
hours forhttps://download.eclipse.org/tools/orbit/downloads/latest-I
<https://download.eclipse.org/tools/orbit/downloads/latest-I> to return,
and 50 hours before projects-storage.eclipse.org
<mailto:genie.modi...@projects-storage.eclipse.org> was back and another
couple of hours to get /shared/common/apache-ant-latest/bin/ant back.
IMHO the outage lasted until at least the restoration of
projects-storage.eclipse.org
<mailto:genie.modi...@projects-storage.eclipse.org> at Aug 4 8:50 and so
one of the issues to be addressed by the postmortem must be why the
status page still reports no incidents or outage on the whole of the 3rd
Aug when, for committers at least, there was no useable service all day.
I must thank the team again for their hard work with a very difficult
problem, but must also stress that the communication was very poor. So
much so that at 3:07 EDT on 4th Aug I sent a private email to Ed Merks
speculating that:
/The total silence from the team is now way beyond
incompetence/discourtesy/embarrassment; there must be another reason. //
////
//Paranoia sets in. //
////
//Is some government / hostile agency intervening to prevent
communication? //
////
//Are the team voluntarily maintaining silence to contain a security
issue? /
Please ensure that whenever possible the status updates are much more
informative.
Regards
Ed Willink
On 09/08/2021 21:45, Denis Roy wrote:
I very much appreciate the sympathy and the support. In the end, the
Infra team can do better than this. We'll lick our wounds and go back
to the drawing board to make sure we don't repeat the same mistakes twice.
Postmortem is written, pending review with my team.
Denis
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@eclipse.org
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@eclipse.org
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev