Hi all,
This is not a final RFO (reason for outage) as we need to continue
observations, but we have solved a major problem and hg is looking
healthy now.
= tl;dr =
Since July 17, hg.mozilla.org has been experiencing a series of outages,
resulting in sheriffs needing to close the trees. This issue has been
tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1040308.
The outages became increasingly more frequent and harder to recover from
over the last month.
We have now identifiedareason for,and at least temporarily resolved,the
issue.
In the past, responsibility for hg and related systems has been poorly
owned and passed like a hot potato betweendifferent parts of Mozilla.
This work has now been consolidated intoa single team. The investigation
detailed below is something that we didn't have the resources to do in
the past. We've run into this sort of downtime before, reset try and
moved on without learning how to prevent this in the future. People
could notdebug the issues in depth due to time constraints,
organizationalpriorities, and so on. In this instance,I pulled together
resources from various teams inside the broader Engineering Operations
group to finally diagnose the problem.
= Background =
The way that we use the hg try repository involves unbounded growth in
the number of heads, and we haveperiodicallysolved this with a try
reset. Try resets are undesirable to developers,as they involve a loss
of history.
Until this week,the correlation between the number of heads and poor
performance has been anecdotal.
Over the last two weeks,we added a lot of instrumentation to hg. We knew
outages were signified by a spike in CPU utilization (to 100% on all the
hg webheads)for each request. We isolated the cause to spinning
hgprocesses. When yesterday's outage occurred we were able to attach to
running processes and discover what was occurring:hg was building its
branchcache.
hg updatesits caches whennew commits are added or accessed for the first
time. When an hg repository has a very large number of heads, certain
cache operations, especially full rebuilds,become intractable and will
not complete.
We solved this last night by doing a try reset as we have done in the
past, but now we know *why that works*.
We have also preserved try history this time in the following manners:
1. Using the revision number, you can access the change sets at an
experimental repository:
http://hg.stage.mozaws.net/mirrors/generaldelta/try/
Note: this repository is not scaled for heavy usage.
2. A tarball of old-try is available here:
http://people.mozilla.org/~hwine/try_history/try-reset-2014-08-13-1826.tar.bz2
<http://people.mozilla.org/%7Ehwine/try_history/try-reset-2014-08-13-1826.tar.bz2>(3.6Gb)
The Dev Services team (Kendall Libby and Ben Kero) and Hal Wine from
Release Engineering worked on getting better instrumentation. This
showed us what was happening when CPUs spiked: a spinning hg process.
Erik Rose from Web Engineering burrowed into the spinning processes to
extract both C and Python tracebacks and profiles, localizing the
problem.Greg Szorc from the Stability team, who is an hg contributor,
could then reason out the higher-level hg flow that was leading to the
spins, givingus our solution.
This approach--both looking at hg internals rather than treating it as a
black box, and having developers and operations people work closely
together--is new. It's part of a strategy to prioritize Developer
Services and give these critical parts of our infrastructure the
attention they deserve. I want to thank all of the individuals named
above for working hard and well together to solve the problem.
= Next steps =
A postmortem to go over the outage and plan our strategies going forward
will be held in the next few days. Please contact me if you would like
to be invited to the postmortem, or would like to receive notes after
the meeting.
We also have plans to improve the architecture of hg to support Mozilla
as we continue to grow, and solve these issues in a more permanent fashion.
Let me know if you have questions or concerns.
Best,
Laura Thomson
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform