Hi all,

This is not a final RFO (reason for outage) as we need to continue observations, but we have solved a major problem and hg is looking healthy now.

= tl;dr =
Since July 17, hg.mozilla.org has been experiencing a series of outages, resulting in sheriffs needing to close the trees. This issue has been tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1040308.

The outages became increasingly more frequent and harder to recover from over the last month. We have now identifiedareason for,and at least temporarily resolved,the issue. In the past, responsibility for hg and related systems has been poorly owned and passed like a hot potato betweendifferent parts of Mozilla. This work has now been consolidated intoa single team. The investigation detailed below is something that we didn't have the resources to do in the past. We've run into this sort of downtime before, reset try and moved on without learning how to prevent this in the future. People could notdebug the issues in depth due to time constraints, organizationalpriorities, and so on. In this instance,I pulled together resources from various teams inside the broader Engineering Operations group to finally diagnose the problem.
= Background =
The way that we use the hg try repository involves unbounded growth in the number of heads, and we haveperiodicallysolved this with a try reset. Try resets are undesirable to developers,as they involve a loss of history.

Until this week,the correlation between the number of heads and poor performance has been anecdotal.

Over the last two weeks,we added a lot of instrumentation to hg. We knew outages were signified by a spike in CPU utilization (to 100% on all the hg webheads)for each request. We isolated the cause to spinning hgprocesses. When yesterday's outage occurred we were able to attach to running processes and discover what was occurring:hg was building its branchcache.

hg updatesits caches whennew commits are added or accessed for the first time. When an hg repository has a very large number of heads, certain cache operations, especially full rebuilds,become intractable and will not complete.

We solved this last night by doing a try reset as we have done in the past, but now we know *why that works*.

We have also preserved try history this time in the following manners:
1. Using the revision number, you can access the change sets at an experimental repository:
http://hg.stage.mozaws.net/mirrors/generaldelta/try/
Note: this repository is not scaled for heavy usage.

2. A tarball of old-try is available here: http://people.mozilla.org/~hwine/try_history/try-reset-2014-08-13-1826.tar.bz2 <http://people.mozilla.org/%7Ehwine/try_history/try-reset-2014-08-13-1826.tar.bz2>(3.6Gb)

The Dev Services team (Kendall Libby and Ben Kero) and Hal Wine from Release Engineering worked on getting better instrumentation. This showed us what was happening when CPUs spiked: a spinning hg process. Erik Rose from Web Engineering burrowed into the spinning processes to extract both C and Python tracebacks and profiles, localizing the problem.Greg Szorc from the Stability team, who is an hg contributor, could then reason out the higher-level hg flow that was leading to the spins, givingus our solution.

This approach--both looking at hg internals rather than treating it as a black box, and having developers and operations people work closely together--is new. It's part of a strategy to prioritize Developer Services and give these critical parts of our infrastructure the attention they deserve. I want to thank all of the individuals named above for working hard and well together to solve the problem.

= Next steps =
A postmortem to go over the outage and plan our strategies going forward will be held in the next few days. Please contact me if you would like to be invited to the postmortem, or would like to receive notes after the meeting.

We also have plans to improve the architecture of hg to support Mozilla as we continue to grow, and solve these issues in a more permanent fashion.

Let me know if you have questions or concerns.

Best,

Laura Thomson






_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to