Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-17 Thread Mark Hammond

On 16/08/2012 4:10 PM, Mike Hommey wrote:
...
 Something I noticed recently is that we spend more than 5 minutes (!)

during windows clobber builds to do the clobber (rm -rf). All try builds
are clobbers.


IME, rd /s/q is usually much faster than rm -rf - using cmd /c rd 
/s/q obj-xxx might be worth investigating...


Mark

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Gregory Szorc

On 8/15/12 11:10 PM, Mike Hommey wrote:

Something I noticed recently is that we spend more than 5 minutes (!)
during windows clobber builds to do the clobber (rm -rf). All try builds
are clobbers. A lot of time is wasted on mercurial cloning, too.

What is interesting is that the corresponding times are in the order of
seconds on linux and osx. We're just hitting the fact that windows sucks
at I/O.


That is an over-generalization. I/O on Windows itself does not suck. I/O 
on Windows sucks when you are using the POSIX APIs instead of the Win32 
ones.


And, I'm willing to bet that rm (along with most of the GNU tools in our 
MozillaBuild environment) is using the POSIX APIs or is at least not 
using the most optimal Win32 API for the desired task.


A few months back, John Ford wrote a standalone win32 executable that 
used the proper APIs to delete an entire directory. I think he said that 
it deleted the object directory 5-10x faster or something. No clue what 
happened with that.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Jason Duell

On 08/16/2012 12:03 AM, Nicholas Nethercote wrote:

On Wed, Aug 15, 2012 at 11:41 PM, Mike Hommey m...@glandium.org wrote:

A few months back, John Ford wrote a standalone win32 executable
that used the proper APIs to delete an entire directory. I think he
said that it deleted the object directory 5-10x faster or something.
No clue what happened with that.

I wish this were true, but I seriously doubt it. I can buy that it's
faster, but not 5-10 times so.

http://blog.johnford.org/writting-a-native-rm-program-for-windows/
says that it deleted a mozilla-central clone 3x faster.


And renaming the directory (then deleting it in parallel with the build, 
or later) ought to be some power of ten faster than that, at least from 
the build-time perspective. At least if you don't do anything expensive 
like our nsIFile NTFS renaming goopage (that traverses the directory 
tree making sure NTFS ACLs are preserved for all files).Which most 
versions of 'rm' aren't going to do, I'd guess.


Jason
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Ben Hearsum
On 08/16/12 02:10 AM, Mike Hommey wrote:
 But maybe we can work around this. At least for rm -rf, instead of
 rm -rf'ing before the build, we could move the objdir away so that a
 fresh new one is created. The older one could be removed much later.

I don't think this would be any more than a one-time win until the disk
fills up. At the start of each job we ensure there's enough space to do
the current job. By moving the objdir away we'd avoiding doing any clean
up until we need more space than is available. After that, each job
would still end up cleaning up roughly one objdir to clean up enough
space to run.

A common technique for dealing with this on Windows is to have a
dedicated partition for the builds, and to format it on start-up rather
than delete things, because a quick format is much quicker than
deleting. I don't think it's something RelEng could implement quickly,
but might be worthwhile looking at in the longer term.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Aryeh Gregor
On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote:
 I don't think this would be any more than a one-time win until the disk
 fills up. At the start of each job we ensure there's enough space to do
 the current job. By moving the objdir away we'd avoiding doing any clean
 up until we need more space than is available. After that, each job
 would still end up cleaning up roughly one objdir to clean up enough
 space to run.

Why can't you move it, then spawn a background thread to remove it at
minimum priority?  IIUC, Vista and later support I/O prioritization,
and the lowest priority will throttle down to two I/O's a second if
other I/O is happening.  Or are build slaves already I/O-saturated?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Mike Hommey
On Thu, Aug 16, 2012 at 09:18:11AM -0400, Ben Hearsum wrote:
 On 08/16/12 02:10 AM, Mike Hommey wrote:
  But maybe we can work around this. At least for rm -rf, instead of
  rm -rf'ing before the build, we could move the objdir away so that a
  fresh new one is created. The older one could be removed much later.
 
 I don't think this would be any more than a one-time win until the disk
 fills up. At the start of each job we ensure there's enough space to do
 the current job. By moving the objdir away we'd avoiding doing any clean
 up until we need more space than is available. After that, each job
 would still end up cleaning up roughly one objdir to clean up enough
 space to run.

If the cleanup happened at the end of the build, rather than at the
beginning, tests could start earlier.

Mike
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Ben Hearsum
On 08/16/12 09:23 AM, Aryeh Gregor wrote:
 On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote:
 I don't think this would be any more than a one-time win until the disk
 fills up. At the start of each job we ensure there's enough space to do
 the current job. By moving the objdir away we'd avoiding doing any clean
 up until we need more space than is available. After that, each job
 would still end up cleaning up roughly one objdir to clean up enough
 space to run.
 
 Why can't you move it, then spawn a background thread to remove it at
 minimum priority?  IIUC, Vista and later support I/O prioritization,
 and the lowest priority will throttle down to two I/O's a second if
 other I/O is happening.  Or are build slaves already I/O-saturated?
 

I hadn't considered using a background thread to remove it. During
pulling/update we're I/O-saturated, I'm not sure about during compile.
Implementing this would be very tricky thoughthe way the build works
is by executing commands serially, so I'm not sure how we'd do this in
parallel with compilation. There's probably a way, but we'd have to be
reasonably sure it's useful to do before diving deeper, I think.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Robert Kaiser

Gregory Szorc schrieb:

On 8/15/12 11:10 PM, Mike Hommey wrote:

What is interesting is that the corresponding times are in the order of
seconds on linux and osx. We're just hitting the fact that windows sucks
at I/O.


That is an over-generalization. I/O on Windows itself does not suck. I/O
on Windows sucks when you are using the POSIX APIs instead of the Win32
ones.


From all I heard so far, the truth is in the middle of your and Mike's 
position. I/O on Windows sucks, but it sucks even more when you are 
using POSIX APIs on top of it.


An interesting data point is that the Wine team found out that running 
tests involving file/disk I/O are significantly slower on native Windows 
than on Wine-on-Linux on the same hardware. This implies that Windows 
I/O really sucks already by itself (and I know from my own experience 
how painful it is even with native Windows applications to delete larger 
trees, even more so when they are VMs, which we have eliminated from out 
build pools nowadays, though). Emulating POSIX upon that already slow 
I/O makes it even worse, though.


Robert Kaiser

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread William Lachance

On 08/15/2012 07:08 PM, Gregory Szorc wrote:


When I was working on this project last year, I designed a build charts
view to help visualize which parts were taking the longest (you can see
implicit dependencies between build/test tasks by seeing when certain
jobs run), which proved very helpful to determine which areas we needed
to optimize:

http://brasstacks.mozilla.com/gofaster/#/buildcharts


Very nice. If you are accepting feature requests, I think the most
helpful would be checkboxes to filter hardware platforms. It's kind of
hard sorting through everything when all the platforms are mixed together.


We have a bugzilla component for filing these sorts of things (though 
note that AFAIK no one's actively working on the dashboard atm):


https://bugzilla.mozilla.org/enter_bug.cgi?component=GoFasterproduct=Testing

I do agree that more filtering options would be useful. I think the 
first thing to do would be to confirm the data in these charts is valid 
though.



I would also like to see hardware utilization in this chart somehow. If
a build step is consuming all local hardware resources (mainly CPU and
I/O), that is a completely different optimization strategy from one
where we are not fully utilizing local capacity or are waiting on
external resources, such as those on a network.


I'm not sure if this works at all anymore, but it used to be that you 
could click on a particular build to get the breakdown of the amount of 
time spent on any particular step. We could certainly do a similar thing 
with hardware utilization -- just a matter of getting the information 
available somewhere we can access it (we used elastic search for the 
build steps).


Will
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-16 Thread Jason Duell

On 08/16/2012 06:23 AM, Aryeh Gregor wrote:

On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum bhear...@mozilla.com wrote:

I don't think this would be any more than a one-time win until the disk
fills up. At the start of each job we ensure there's enough space to do
the current job. By moving the objdir away we'd avoiding doing any clean
up until we need more space than is available. After that, each job
would still end up cleaning up roughly one objdir to clean up enough
space to run.

Why can't you move it, then spawn a background thread to remove it at
minimum priority?  IIUC, Vista and later support I/O prioritization,


Brian Bondy just added I/O prioritization to our code that removes 
corrupt HTTP caches, in bug 773518, in case that code helps.


Jason


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-15 Thread Aryeh Gregor
On Tue, Aug 14, 2012 at 10:47 PM, Gregory Szorc g...@mozilla.com wrote:
 Is there a tracking bug for areas where we could gain efficiency? We all
 know the build phase is full of clownshoes. But, I believe we also do silly
 things like execute some tests serially, only taking advantage of 1/N CPU
 cores in the process. This is just wasting resources. See [1] for a concrete
 example.

Don't we execute *all* tests serially?  Many of our tests require
focus, so you can't do two runs in parallel on the same desktop.  In
theory we could specially flag the ones that don't need focus, and
make sure to always run them without focus -- that would probably be
most of the tests.  Then those could be run in parallel.  They could
also be run in the background on developer machines, which would be
nice.  This would require a bunch of developer work.

Alternatively, the test machines could be set up with multiple
desktops with independent focus.  At least Windows and Linux should
support this, AFAIK -- it's necessary if you want to allow a
thin-client setup in corporate environments.  This would require a
bunch of IT work.

(I don't think xvfb-run is a good solution, because it's not exactly
the same as a normal X session.  In my experience, a small fraction of
tests unexpectedly fail using xvfb-run.  By the same token, I'm
guessing some will incorrectly pass.  It doesn't seem like a good idea
to use a different environment for test machines than users will use,
if we can avoid it.)
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-14 Thread Ed Morley
On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar  wrote:
 Is there a plan to mitigate the coalescing on m-i?  It seems like that
 is a big part of the problem.

Reducing the amount of coalescing permitted would just mean we end up with a 
backlog of pending tests on the repo tip - which would result in tree closures 
regardless. So other than bug 690672 making sheriffs' lives easier, we just 
need more machines in the test pool - since it's simply a case of demand 
exceeding capacity. 

The situation is made worse now that we're adding new platforms (OS X 10.7, B2G 
GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster 
than we're EOLing them - and we're pushing more changes per day than ever 
before [1]. From what I understand, Apple's aggressive hardware cycle is also 
making it difficult to expand the test pool [2]. 

On a more positive note, at the end of this cycle we should be able to turn off 
Android XUL on trunk trees [3], which will at least help improve the wait on 
that platform :-)


[1] http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=777037#c4
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-14 Thread Justin Lebar
 Is there a plan to mitigate the coalescing on m-i?  It seems like that
 is a big part of the problem.

 it's simply a case of demand exceeding capacity.

Understood.

But I think my question still stands: Is there a plan to address the
fact that we do not have capacity to run all the tests we need to run?

It sounds like [2] the answer is no, for at least the medium-term,
because releng is busy deploying Mac 10.8 and Windows 8.

I do not think we can afford to wait on these large projects before
deploying more hardware.  I'd like to see data, but it seems to me
that we've hugely regressed tryserver turnaround times in the past few
months.  Unless we're able to add more machines to the pool, there is
no end in sight.

It seems that we need a concrete promise from releng / it to keep
end-to-end tryserver times (push to final test finished) below X hours
at the 90th percentile, and to coalesce fewer than Y% of pushes to
m-i/m-c (measured during the busiest Z hours of each day).  Then
there's no need to guess about whether the pool is unacceptably backed
up, or whether fixing the pile-up should be a priority.

-Justin

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3

On Tue, Aug 14, 2012 at 3:14 PM, Ed Morley
bmo.takethis...@edmorley.co.uk wrote:
 On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar  wrote:
 Is there a plan to mitigate the coalescing on m-i?  It seems like that
 is a big part of the problem.

 Reducing the amount of coalescing permitted would just mean we end up with a 
 backlog of pending tests on the repo tip - which would result in tree 
 closures regardless. So other than bug 690672 making sheriffs' lives easier, 
 we just need more machines in the test pool - since it's simply a case of 
 demand exceeding capacity.

 The situation is made worse now that we're adding new platforms (OS X 10.7, 
 B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) 
 faster than we're EOLing them - and we're pushing more changes per day than 
 ever before [1]. From what I understand, Apple's aggressive hardware cycle is 
 also making it difficult to expand the test pool [2].

 On a more positive note, at the end of this cycle we should be able to turn 
 off Android XUL on trunk trees [3], which will at least help improve the wait 
 on that platform :-)


 [1] http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/
 [2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3
 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=777037#c4
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-14 Thread Justin Lebar
 But, I believe we also do silly
 things like execute some tests serially, only taking advantage of 1/N CPU
 cores in the process. This is just wasting resources. See [1] for a concrete
 example.

It would be very cool if we could run mochitests inside xvfb on Linux
(and maybe Mac?).  But it is another point of failure -- for example,
on my machine, xvfb-run causes mochitest to randomly segfault.  (I
think it's Firefox, not xvfb, that's dying, although I'm not
positive.)

Of course, investigating and implementing this would require
resources, which would require us to acknowledge that we're failing by
some metric, which would require us to agree on specific goals, which
would require us first to agree that we should have such goals in the
first place!  :)

On Tue, Aug 14, 2012 at 3:47 PM, Gregory Szorc g...@mozilla.com wrote:
 On 8/14/12 12:14 PM, Ed Morley wrote:

 On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar  wrote:

 Is there a plan to mitigate the coalescing on m-i?  It seems like that
 is a big part of the problem.


 Reducing the amount of coalescing permitted would just mean we end up with
 a backlog of pending tests on the repo tip - which would result in tree
 closures regardless. So other than bug 690672 making sheriffs' lives easier,
 we just need more machines in the test pool - since it's simply a case of
 demand exceeding capacity.

 The situation is made worse now that we're adding new platforms (OS X
 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8
 metro) faster than we're EOLing them - and we're pushing more changes per
 day than ever before [1]. From what I understand, Apple's aggressive
 hardware cycle is also making it difficult to expand the test pool [2].


 Is there a tracking bug for areas where we could gain efficiency? We all
 know the build phase is full of clownshoes. But, I believe we also do silly
 things like execute some tests serially, only taking advantage of 1/N CPU
 cores in the process. This is just wasting resources. See [1] for a concrete
 example.

 Do we have data on the actual hardware load for the test runners? If we are
 throwing away significant CPU cycles, etc, we could probably alleviate a lot
 of the problems just with software changes.

 [1] https://bugzilla.mozilla.org/show_bug.cgi?id=686240

 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Increase in mozilla-inbound bustage due to people not using Try

2012-08-09 Thread Justin Wood (Callek)

Justin Lebar wrote:

In addition, please bear in mind that landing bustage on trunk trees actually
makes the Try wait times worse (since the trunk backouts/retriggers take test
job priority over Try) - leading to others not bothering to use Try either, and 
so
the situation cascades.


I thought tryserver used a different pool of machines isolated from
all the other trees, because we treated the tryserver machines as
pwned.  Is that not or no longer the case?



Yes and no, the build machines are completely different the test 
machines -- not so much.


The testers however are shared. Testers have a completely different 
passwords set, as well as other mitigations. The idea here is that our 
test machines also have no permissions to upload anyway, nor any way to 
leak/get sekrets. And all machines are in a restricted network 
environment overall anyway.


So load on inbound affects *test* load on try, yes.

--
~Justin Wood (Callek)


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform