We want low timeouts to prevent cascading server lockups: it only takes a few long concurrent CPU bound queries and other users experience huge delays in their work. Low timeouts also gives users a faster signal when they do something thats unusually expensive. A snappy system is easier to keep running fast and snappy: less lock contention, less IO contention etc.
Most, if not all of our timeouts are caused because we *have a successful product*. Starting with the simplest thing and iterating is a fantastic principle, but the corollary is that we have to iterate: the hard timeout limit is our backstop for when something we're iterating on is made slower by users. So every time we make the system better we can simply expect more users and more pressure on the DB, more mails to send, more bugs to index, etc. Thus, it seems to me that we should only consider a timeout problem fixed when we *have enough headroom* to tolerate growth for a reasonable time. Concretely, something that is timing out on 5% of calls and routinely taking (say) 15 seconds on the server today - thats something we should bring down to completing in (say) 3 seconds on the 15 second dataset before switching context. Otherwise we'll be coming back to it almost immediately, and progress will feel slow. In bzr, once we got status really seriously under control, it let us stop getting hammered with performance issues in that part of the code base, we stopped reanalysing the same problem and were able to forget about it for a couple of years because it was under control. There are some infrastructural issues that will make this hard: I'm going to unblock anyone who wants to work on such things (like rabbit) in anyway I can; as time permits I'll be working directly on those enablers too. Cheers, Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

