Re: Increase in mozilla-inbound bustage due to people not using Try
On 16/08/2012 4:10 PM, Mike Hommey wrote: ... Something I noticed recently is that we spend more than 5 minutes (!) during windows clobber builds to do the clobber (rm -rf). All try builds are clobbers. IME, rd /s/q is usually much faster than rm -rf - using cmd /c rd /s/q obj-xxx might be worth investigating... Mark ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 8/15/12 11:10 PM, Mike Hommey wrote: Something I noticed recently is that we spend more than 5 minutes (!) during windows clobber builds to do the clobber (rm -rf). All try builds are clobbers. A lot of time is wasted on mercurial cloning, too. What is interesting is that the corresponding times are in the order of seconds on linux and osx. We're just hitting the fact that windows sucks at I/O. That is an over-generalization. I/O on Windows itself does not suck. I/O on Windows sucks when you are using the POSIX APIs instead of the Win32 ones. And, I'm willing to bet that rm (along with most of the GNU tools in our MozillaBuild environment) is using the POSIX APIs or is at least not using the most optimal Win32 API for the desired task. A few months back, John Ford wrote a standalone win32 executable that used the proper APIs to delete an entire directory. I think he said that it deleted the object directory 5-10x faster or something. No clue what happened with that. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 08/16/2012 12:03 AM, Nicholas Nethercote wrote: On Wed, Aug 15, 2012 at 11:41 PM, Mike Hommey m...@glandium.org wrote: A few months back, John Ford wrote a standalone win32 executable that used the proper APIs to delete an entire directory. I think he said that it deleted the object directory 5-10x faster or something. No clue what happened with that. I wish this were true, but I seriously doubt it. I can buy that it's faster, but not 5-10 times so. http://blog.johnford.org/writting-a-native-rm-program-for-windows/ says that it deleted a mozilla-central clone 3x faster. And renaming the directory (then deleting it in parallel with the build, or later) ought to be some power of ten faster than that, at least from the build-time perspective. At least if you don't do anything expensive like our nsIFile NTFS renaming goopage (that traverses the directory tree making sure NTFS ACLs are preserved for all files).Which most versions of 'rm' aren't going to do, I'd guess. Jason ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 08/16/12 02:10 AM, Mike Hommey wrote: But maybe we can work around this. At least for rm -rf, instead of rm -rf'ing before the build, we could move the objdir away so that a fresh new one is created. The older one could be removed much later. I don't think this would be any more than a one-time win until the disk fills up. At the start of each job we ensure there's enough space to do the current job. By moving the objdir away we'd avoiding doing any clean up until we need more space than is available. After that, each job would still end up cleaning up roughly one objdir to clean up enough space to run. A common technique for dealing with this on Windows is to have a dedicated partition for the builds, and to format it on start-up rather than delete things, because a quick format is much quicker than deleting. I don't think it's something RelEng could implement quickly, but might be worthwhile looking at in the longer term. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote: I don't think this would be any more than a one-time win until the disk fills up. At the start of each job we ensure there's enough space to do the current job. By moving the objdir away we'd avoiding doing any clean up until we need more space than is available. After that, each job would still end up cleaning up roughly one objdir to clean up enough space to run. Why can't you move it, then spawn a background thread to remove it at minimum priority? IIUC, Vista and later support I/O prioritization, and the lowest priority will throttle down to two I/O's a second if other I/O is happening. Or are build slaves already I/O-saturated? ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On Thu, Aug 16, 2012 at 09:18:11AM -0400, Ben Hearsum wrote: On 08/16/12 02:10 AM, Mike Hommey wrote: But maybe we can work around this. At least for rm -rf, instead of rm -rf'ing before the build, we could move the objdir away so that a fresh new one is created. The older one could be removed much later. I don't think this would be any more than a one-time win until the disk fills up. At the start of each job we ensure there's enough space to do the current job. By moving the objdir away we'd avoiding doing any clean up until we need more space than is available. After that, each job would still end up cleaning up roughly one objdir to clean up enough space to run. If the cleanup happened at the end of the build, rather than at the beginning, tests could start earlier. Mike ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 08/16/12 09:23 AM, Aryeh Gregor wrote: On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote: I don't think this would be any more than a one-time win until the disk fills up. At the start of each job we ensure there's enough space to do the current job. By moving the objdir away we'd avoiding doing any clean up until we need more space than is available. After that, each job would still end up cleaning up roughly one objdir to clean up enough space to run. Why can't you move it, then spawn a background thread to remove it at minimum priority? IIUC, Vista and later support I/O prioritization, and the lowest priority will throttle down to two I/O's a second if other I/O is happening. Or are build slaves already I/O-saturated? I hadn't considered using a background thread to remove it. During pulling/update we're I/O-saturated, I'm not sure about during compile. Implementing this would be very tricky thoughthe way the build works is by executing commands serially, so I'm not sure how we'd do this in parallel with compilation. There's probably a way, but we'd have to be reasonably sure it's useful to do before diving deeper, I think. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
Gregory Szorc schrieb: On 8/15/12 11:10 PM, Mike Hommey wrote: What is interesting is that the corresponding times are in the order of seconds on linux and osx. We're just hitting the fact that windows sucks at I/O. That is an over-generalization. I/O on Windows itself does not suck. I/O on Windows sucks when you are using the POSIX APIs instead of the Win32 ones. From all I heard so far, the truth is in the middle of your and Mike's position. I/O on Windows sucks, but it sucks even more when you are using POSIX APIs on top of it. An interesting data point is that the Wine team found out that running tests involving file/disk I/O are significantly slower on native Windows than on Wine-on-Linux on the same hardware. This implies that Windows I/O really sucks already by itself (and I know from my own experience how painful it is even with native Windows applications to delete larger trees, even more so when they are VMs, which we have eliminated from out build pools nowadays, though). Emulating POSIX upon that already slow I/O makes it even worse, though. Robert Kaiser ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 08/15/2012 07:08 PM, Gregory Szorc wrote: When I was working on this project last year, I designed a build charts view to help visualize which parts were taking the longest (you can see implicit dependencies between build/test tasks by seeing when certain jobs run), which proved very helpful to determine which areas we needed to optimize: http://brasstacks.mozilla.com/gofaster/#/buildcharts Very nice. If you are accepting feature requests, I think the most helpful would be checkboxes to filter hardware platforms. It's kind of hard sorting through everything when all the platforms are mixed together. We have a bugzilla component for filing these sorts of things (though note that AFAIK no one's actively working on the dashboard atm): https://bugzilla.mozilla.org/enter_bug.cgi?component=GoFasterproduct=Testing I do agree that more filtering options would be useful. I think the first thing to do would be to confirm the data in these charts is valid though. I would also like to see hardware utilization in this chart somehow. If a build step is consuming all local hardware resources (mainly CPU and I/O), that is a completely different optimization strategy from one where we are not fully utilizing local capacity or are waiting on external resources, such as those on a network. I'm not sure if this works at all anymore, but it used to be that you could click on a particular build to get the breakdown of the amount of time spent on any particular step. We could certainly do a similar thing with hardware utilization -- just a matter of getting the information available somewhere we can access it (we used elastic search for the build steps). Will ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On 08/16/2012 06:23 AM, Aryeh Gregor wrote: On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote: I don't think this would be any more than a one-time win until the disk fills up. At the start of each job we ensure there's enough space to do the current job. By moving the objdir away we'd avoiding doing any clean up until we need more space than is available. After that, each job would still end up cleaning up roughly one objdir to clean up enough space to run. Why can't you move it, then spawn a background thread to remove it at minimum priority? IIUC, Vista and later support I/O prioritization, Brian Bondy just added I/O prioritization to our code that removes corrupt HTTP caches, in bug 773518, in case that code helps. Jason ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On Tue, Aug 14, 2012 at 10:47 PM, Gregory Szorc g...@mozilla.com wrote: Is there a tracking bug for areas where we could gain efficiency? We all know the build phase is full of clownshoes. But, I believe we also do silly things like execute some tests serially, only taking advantage of 1/N CPU cores in the process. This is just wasting resources. See [1] for a concrete example. Don't we execute *all* tests serially? Many of our tests require focus, so you can't do two runs in parallel on the same desktop. In theory we could specially flag the ones that don't need focus, and make sure to always run them without focus -- that would probably be most of the tests. Then those could be run in parallel. They could also be run in the background on developer machines, which would be nice. This would require a bunch of developer work. Alternatively, the test machines could be set up with multiple desktops with independent focus. At least Windows and Linux should support this, AFAIK -- it's necessary if you want to allow a thin-client setup in corporate environments. This would require a bunch of IT work. (I don't think xvfb-run is a good solution, because it's not exactly the same as a normal X session. In my experience, a small fraction of tests unexpectedly fail using xvfb-run. By the same token, I'm guessing some will incorrectly pass. It doesn't seem like a good idea to use a different environment for test machines than users will use, if we can avoid it.) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote: Is there a plan to mitigate the coalescing on m-i? It seems like that is a big part of the problem. Reducing the amount of coalescing permitted would just mean we end up with a backlog of pending tests on the repo tip - which would result in tree closures regardless. So other than bug 690672 making sheriffs' lives easier, we just need more machines in the test pool - since it's simply a case of demand exceeding capacity. The situation is made worse now that we're adding new platforms (OS X 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster than we're EOLing them - and we're pushing more changes per day than ever before [1]. From what I understand, Apple's aggressive hardware cycle is also making it difficult to expand the test pool [2]. On a more positive note, at the end of this cycle we should be able to turn off Android XUL on trunk trees [3], which will at least help improve the wait on that platform :-) [1] http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/ [2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=777037#c4 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
Is there a plan to mitigate the coalescing on m-i? It seems like that is a big part of the problem. it's simply a case of demand exceeding capacity. Understood. But I think my question still stands: Is there a plan to address the fact that we do not have capacity to run all the tests we need to run? It sounds like [2] the answer is no, for at least the medium-term, because releng is busy deploying Mac 10.8 and Windows 8. I do not think we can afford to wait on these large projects before deploying more hardware. I'd like to see data, but it seems to me that we've hugely regressed tryserver turnaround times in the past few months. Unless we're able to add more machines to the pool, there is no end in sight. It seems that we need a concrete promise from releng / it to keep end-to-end tryserver times (push to final test finished) below X hours at the 90th percentile, and to coalesce fewer than Y% of pushes to m-i/m-c (measured during the busiest Z hours of each day). Then there's no need to guess about whether the pool is unacceptably backed up, or whether fixing the pile-up should be a priority. -Justin [2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3 On Tue, Aug 14, 2012 at 3:14 PM, Ed Morley bmo.takethis...@edmorley.co.uk wrote: On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote: Is there a plan to mitigate the coalescing on m-i? It seems like that is a big part of the problem. Reducing the amount of coalescing permitted would just mean we end up with a backlog of pending tests on the repo tip - which would result in tree closures regardless. So other than bug 690672 making sheriffs' lives easier, we just need more machines in the test pool - since it's simply a case of demand exceeding capacity. The situation is made worse now that we're adding new platforms (OS X 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster than we're EOLing them - and we're pushing more changes per day than ever before [1]. From what I understand, Apple's aggressive hardware cycle is also making it difficult to expand the test pool [2]. On a more positive note, at the end of this cycle we should be able to turn off Android XUL on trunk trees [3], which will at least help improve the wait on that platform :-) [1] http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/ [2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=777037#c4 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
But, I believe we also do silly things like execute some tests serially, only taking advantage of 1/N CPU cores in the process. This is just wasting resources. See [1] for a concrete example. It would be very cool if we could run mochitests inside xvfb on Linux (and maybe Mac?). But it is another point of failure -- for example, on my machine, xvfb-run causes mochitest to randomly segfault. (I think it's Firefox, not xvfb, that's dying, although I'm not positive.) Of course, investigating and implementing this would require resources, which would require us to acknowledge that we're failing by some metric, which would require us to agree on specific goals, which would require us first to agree that we should have such goals in the first place! :) On Tue, Aug 14, 2012 at 3:47 PM, Gregory Szorc g...@mozilla.com wrote: On 8/14/12 12:14 PM, Ed Morley wrote: On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote: Is there a plan to mitigate the coalescing on m-i? It seems like that is a big part of the problem. Reducing the amount of coalescing permitted would just mean we end up with a backlog of pending tests on the repo tip - which would result in tree closures regardless. So other than bug 690672 making sheriffs' lives easier, we just need more machines in the test pool - since it's simply a case of demand exceeding capacity. The situation is made worse now that we're adding new platforms (OS X 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster than we're EOLing them - and we're pushing more changes per day than ever before [1]. From what I understand, Apple's aggressive hardware cycle is also making it difficult to expand the test pool [2]. Is there a tracking bug for areas where we could gain efficiency? We all know the build phase is full of clownshoes. But, I believe we also do silly things like execute some tests serially, only taking advantage of 1/N CPU cores in the process. This is just wasting resources. See [1] for a concrete example. Do we have data on the actual hardware load for the test runners? If we are throwing away significant CPU cycles, etc, we could probably alleviate a lot of the problems just with software changes. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=686240 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Increase in mozilla-inbound bustage due to people not using Try
Justin Lebar wrote: In addition, please bear in mind that landing bustage on trunk trees actually makes the Try wait times worse (since the trunk backouts/retriggers take test job priority over Try) - leading to others not bothering to use Try either, and so the situation cascades. I thought tryserver used a different pool of machines isolated from all the other trees, because we treated the tryserver machines as pwned. Is that not or no longer the case? Yes and no, the build machines are completely different the test machines -- not so much. The testers however are shared. Testers have a completely different passwords set, as well as other mitigations. The idea here is that our test machines also have no permissions to upload anyway, nor any way to leak/get sekrets. And all machines are in a restricted network environment overall anyway. So load on inbound affects *test* load on try, yes. -- ~Justin Wood (Callek) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform