Hi all - I've run into a problem that I'm looking for some help with. Let me describe the situation, and then some thoughts.
The company I work for uses git. We use GitHub Enterprise as a frontend for our primary git server. We're using Chef solo to manage a fleet of upwards of 10,000 hosts, which regularly pull from our Chef git repository so as to converge. A few years ago, in order to reduce the load on the primary server, the firm set up a fleet of replication servers. This allowed the majority of our infrastructure to target the replication servers for all pull-based activity rather than the primary git server. The actual replication process works as follows: 1. The primary git server receives a push and sends a webhook with the details of the push (repo, ref, sha, some metadata) to a "publisher" box 2. The publisher enqueues the details of the webhook into a queue 3. A fleet of "subscriber" (replica) boxes each reads the payload of the enqueued message. Each of these then tries to either clone the repository if they don't already have it, or they run `git fetch`. During the course of either of these operations we use a repository-level lockfile. If another request comes in while the repo is locked, we re-enqueue it. When that request comes in later, if the push event time is earlier than the most recent successful fetch, we don't do any further work. --- The problem we've been running into has been that as we've continued to add developers, the CPU load on our primary instance has gone through the roof. We've upgraded the machine to some of the punchiest hardware we can get and it's regularly exceeding 70% CPU load. We know that the overwhelming drive of this CPU load is near-constant git operations on some of our larger and more active repositories. Migrating off of Chef solo is essentially a non-starter at this point in time, and we've also added a considerable number of non-Chef git dependencies such that getting rid of the replication service would be a massive undertaking. Even if we were to kill the replication fleet, we'd have to figure out where to point that traffic in such a way that we didn't overwhelm the primary git server anyways. So I'm looking for some help in figuring out what to do next. At the moment, I have two thoughts: 1. We currently run a blanket `git fetch` rather than specifically fetching the ref that was pushed. My understanding from poking around the git source code is that this causes the replication server to send a list of all of its ref tips to the primary server, and the primary server then has to verify and compare each of these tips to the ref tips residing on the server. That might not be a ton of work to do on an individual fetch operation, but some of our repositories have over 5,000 branches and are pushed to 1,000 times a day. Algorithmically, this would suggest that the cost of a fetch will go up in terms of both N -- branches and M -- pushes, so we're talking about a cost of N*M, both of which will increase when we hire developers. This implies exponential growth in load as we add engineers, which is...not good. Also worth noting - if we assume that the replication server should be reasonably up-to-date (see lockfile logic description, above), we're talking about typically packing objects for one ref, and at most a small number (<5). My hypothesis is that moving to fetching the specific branch rather than doing a blanket fetch would have a significant and material impact on server load. If we do go down this route, we'll potentially need to do some refactoring around how we handle "failed fetches", which relates both to our locking logic and to the actual potential failure of a git fetch. Discussion of the locking mechanism follows: 2. Our current locking mechanism seems to me to be un-necessary. I'm aware that git uses a few different locking mechanisms internally, and the use of a repo-level lockfile would seem to only guarantee that we're using coarser-grained locking than git actually supports. But I don't know if git supports specific fetch-level ref locking that would permit concurrent ref fetch operations. If it does, our current architecture would seem to prevent git from being able to take advantage of those mechanisms. In other words, let's imagine a world in which we ditch our current repo-level locking mechanism entirely. Let's also presume we move to fetching specific refs rather than using blanket fetches. Does that mean that if a fetch for ref A and a fetch for ref B are issued at roughly the exact same time, the two will be able to be executed at once without running into some git-internal locking mechanism on a granularity coarser than the ref? i.e. are fetch A and fetch B going to be blocked on the other's completion in any way? (let's presume that ref A and ref B are not parents of each other). --- I'm neither a contributor nor an expert in git, so my inferences thus far as based purely off of what I would describe as "stumbling around the source code like a drunken baby". The ultimate goal for us is just figuring out how we can best reduce the CPU load on the primary instance so that we don't find ourselves in a situation where we're not able to run basic git operations anymore. If I'm barking up the wrong tree, or if there are other optimizations we should be considering, I'd be eager to learn about those as well. Of course, if what I'm describing sounds about right, I'd like confirmation of that from some people who actually know what they're talking about (i.e., not me :) ). Thanks. - Venantius -- ============ venanti.us 203.918.2328 ============ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html