Francis J. Lacoste wrote: > ---------- Forwarded Message ---------- > > Subject: Results of Loggerhead Bug Investigation > Date: November 19, 2009 > From: "Max Kanat-Alexander" <[email protected]> > To: "Francis J. Lacoste" <[email protected]> > > Hey Francis. > > So, I investigated the memory leak and the codebrowse hanging problem. > > The memory leak is just some part of the code leaking a tiny amount of > memory when a specific type of page is requested (I'm not sure which > page yet). The tiny leak grows over days until the process is very > large. I can reproduce the leak locally. The rest of the work involved > in this would be tracking down where the leak occurs and patching it--I > suspect this will not be a major architectural change, just a fix to > loggerhead or perhaps Paste. However, I think the task of initial > analysis is complete.
This sounds sane. > The more significant issue is the hangs. The hang is, in a sense, two > separate issues: > > 1) When a user loads multiple revisions of a very large branch > (launchpad itself, bzr itself, or mysql) that doesn't have a revision > graph yet, building the revision graph takes an enormous amount of time > and causes the rest of loggerhead to slow to a crawl, thus causing it to > appear hung for three to five minutes. As suspected then, but it sounds worse than I'd guessed. > 2) Loggerhead (or perhaps just a single loggerhead instance) doesn't > scale very well across many branches with many users, partially because > of how the revision graph is currently built and partially (I suspect) > because any given Python process is going to be limited by the Global > Interpreter Lock on how many concurrent requests it can honestly handle. Yeah. > So the question for this issue is--what level would you like me to > address it on? If you'd like me to simply work on the revision graph > issue, I could do that within the current architecture of loggerhead and > devise a fix. Probably the simplest would be to just place a mutex > around building a revision graph for any one branch. That's probably a good fix for loggerhead, but maybe not sufficient for Launchpad. > However, that may > not fix the actual *performance* problems seen with codebrowse, it just > might make hangs less likely. A more general approach to loggerhead's > scalability would result in a fix for this and also for any performance > issues that loggerhead sees in the Launchpad environment. A quick search > for "python paste scale" in Google turns up > http://pypi.python.org/pypi/Spawning/ which (after sufficient vetting) > might be a reasonable solution. Another team at Canonical tried spawning and had to give up and go back to paste. So let's learn from their misfortune :) > Then once we have a better single-server > solution, making it scale out to multiple servers (by having a central > store for the revision graph cache and making sure that loggerhead plays > well under load-balancing) would be the next step. As Rob pointed out in the bug report, if we can have the load balancer always direct requests for the same branch to the same loggerhead backend, we don't need to worry too much about the central store part. Speaking more generally, the problem is the revision cache -- can we make it go away, or at least handle it better? I always forget why we actually need it, so let's try to recap: 1. Going from revid -> revno. Loggerhead does this a lot. 2. Going from revno -> revid. Probably done ~once per page. 3. In History.get_revids_from(). This gets into behaviour territory. Basically it "mainline-izes" a bunch of revisions. It can probably touch quite a lot of the graph. 4. get_merge_point_list(). I can't remember what this does :( 5. get_short_revision_history_by_fileid(). Just uses it to get the set of all revids in the branch. Y'see, one of the problems with a central graph store is that graphs are big, and any central store implies IPC which implies serialization, and serializing and deserializing something as big as Launchpad's revision graph cache is annoyingly slow. So one idea would be to have this central store not serve up entire graphs, but instead be able to answer the questions above. There would be many problems with this approach of course -- for example you probably don't want to make a cross procedure call for every revid -> revno translation loggerhead does, and gathering all the revids you'd want to translate before you start rendering would be painful. On the more serious end, it might be worth pushing the generation of the cache into the store though and then it can compute stores in subprocesses or whatever to maximize CPU utilization, and to maintain performance of the loggerhead process(es). Another, probably more tractable problem would be to be able to incrementally generate revision caches in the common case of revisions merely being added to the branch. If the graph store stored the graphs as more than just a lump, you can probably reuse parts of the graph for mainline when building the graph for a derived branch. I think John Arbash Meinel might have some code for this... In the mean time, if someone can tease out what: self.simplify_merge_point_list(self.get_merge_point_list(revid)) actually does, I'm all ears. > Perhaps the best thing would be to come up with a "quick patch" to save > the LOSAs from having to constantly restart codebrowse, and then once we > have that situation at least mitigated, we could go on to work on the > actual underlying scalability issue. I'm not sure what the "quick patch" would be -- the mutex around revision graph cache building? Cheers, mwh _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

