Re: How to reduce the time m-i is closed?

2013-11-21 Thread Robert Kaiser

Philip Chee schrieb:

I thought that there was a plan to pre-allocate on startup some memory
for the minidump/crash reporter?


For one thing, I'm not sure how far that went, for the other, we are 
calling a Windows function to generate the minidump and I'm not sure if 
we can reserve the memory it needs reasonably beforehand.


KaiRo
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to reduce the time m-i is closed?

2013-11-21 Thread Benjamin Smedberg

On 11/21/2013 1:11 PM, Robert Kaiser wrote:

Philip Chee schrieb:

I thought that there was a plan to pre-allocate on startup some memory
for the minidump/crash reporter?


For one thing, I'm not sure how far that went, for the other, we are 
calling a Windows function to generate the minidump and I'm not sure 
if we can reserve the memory it needs reasonably beforehand.
We did this in bug 837835. We currently reserve 12MB of address space 
for the crash reporter. This is apparently either not enough or doesn't 
work for many crashes; it doesn't appear to have made a noticeable 
impact converting empty-dump crashes to collect useful minidumps.


--BDS

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to reduce the time m-i is closed?

2013-11-20 Thread Robert Kaiser

Nicholas Nethercote schrieb:

It also assumes that we can backout stuff to fix
the problem;  we tried that to some extent with the first OOM closure
-- it is the standard response to test failure, of course -- but it
didn't work.


Yes, in the case of those OOM issues that caused this closure, they are 
probably just a symptom of a larger problem.


We've been having a step-by-step rise of OOM issues over quite some time 
now, most intensely seen as an increase of crashes with empty dumps. I 
alerted to that in bug 837835 but we couldn't track down a decent 
regression range (we mostly know in which 6-week cycle we had 
regressions, we can do some assumptions to narrow things a bit further 
down on trunk, but not nearly well enough to get to candidate checkins). 
Because of that, this has been lingering without any real tries to fix 
things, and from what I saw in data, things did even get worse recently 
- and that's talking of the release channel, so whatever might have 
increased troubles on trunk around this closure is even on top of that.


As in a lot of cases we're seeing, there's apparently too little memory 
available for Windows to even create a minidump, we have pretty little 
info about those issues - but we do have our additional annotations we 
send along with the crash report, and those gives us some info that 
AFAIK gives us the assumption that in many cases we're running out of 
virtual memory space but not necessarily of physical memory. As I'm 
told, that can for example happen with VM fragmentation as well as bugs 
causing a mapping of the same physical page over and over into virtual 
memory. We're not sure if that's all on our code or if system code or 
(graphics?) driver code exposes issues to us there.


From what I know, bsmedberg and dmajor are looking into those issues 
more closely, both now that we had the tree closure problem and also 
because it has been a lingering stability issue for months. I'm sure any 
help in those efforts is appreciated as those are tough issues, and it 
might be multiple problems that all contribute a share to the overall issue.


Making us more efficient on memory sounds like a worthwhile goal overall 
anyhow (even though the bullet of running out of VM space can be dodged 
by switching to Win64 and/or e10s giving us multiple processes that all 
have their 32bit virtual memory space, but not sure if those should or 
will be our primary solutions).


I think in other cases, where a clear cause to the tree-closing issues 
is easy to assess, a backout-based process can work better, but with 
those OOM issues there's not a clear patch or patch set to point to. 
IMHO, we should work on the overall issue cluster of OOM, though.


KaiRo
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to reduce the time m-i is closed?

2013-11-20 Thread Philip Chee
On 21/11/2013 00:20, Robert Kaiser wrote:

 As in a lot of cases we're seeing, there's apparently too little memory 
 available for Windows to even create a minidump, we have pretty little 
 info about those issues - but we do have our additional annotations we 
 send along with the crash report, and those gives us some info that 
 AFAIK gives us the assumption that in many cases we're running out of 
 virtual memory space but not necessarily of physical memory. As I'm 
 told, that can for example happen with VM fragmentation as well as bugs 
 causing a mapping of the same physical page over and over into virtual 
 memory. We're not sure if that's all on our code or if system code or 
 (graphics?) driver code exposes issues to us there.

I thought that there was a plan to pre-allocate on startup some memory
for the minidump/crash reporter?

 KaiRo

Phil
-- 
Philip Chee phi...@aleytys.pc.my, philip.c...@gmail.com
http://flashblock.mozdev.org/ http://xsidebar.mozdev.org
Guard us from the she-wolf and the wolf, and guard us from the thief,
oh Night, and so be good for us to pass.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to reduce the time m-i is closed?

2013-11-18 Thread Ehsan Akhgari

On 2013-11-18 7:17 AM, Ed Morley wrote:

On 16/11/2013 15:17, smaug wrote:

the recent OOM cases have been really annoying. They have slowed down
development, even for those who
haven't been dealing with the actual issue(s).

Could we handle this kind of cases differently. Perhaps clone the bad
state of m-i to
some other repository we're tracking using tbpl, backout stuff from m-i
to the state where we can
run it, re-open it and do the fixes in the clone.


Unfortunately as Nick mentioned - this wasn't possible, otherwise we
would just have performed a backout similar to those performed several
times a day when something breaks the tree in a more 'normal' way.

The closure was due to a seemingly chronic issue, that had only been
highlighted by recent landings (and no one particular landing, since the
one backout that was performed, still didn't make the failures disappear
entirely). Even if we had just reverted the last week's worth of changes
it would not fix the root cause - which was that any single patch could
potentially tip us off the edge of being OOM again.


But we still reopened without the root cause being fixed, didn't we? 
What am I missing?


Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: How to reduce the time m-i is closed?

2013-11-17 Thread Nicholas Nethercote
On Sun, Nov 17, 2013 at 2:17 AM, smaug sm...@welho.com wrote:

 the recent OOM cases have been really annoying. They have slowed down
 development, even for those who
 haven't been dealing with the actual issue(s).

 Could we handle this kind of cases differently. Perhaps clone the bad state
 of m-i to
 some other repository we're tracking using tbpl, backout stuff from m-i to
 the state where we can
 run it, re-open it and do the fixes in the clone.
 And then, say in a week, merge the clone back to m-i. If the state is still
 bad (no one has step up to fix the
 issues), then keep m-i closed until the issues have been fixed.

Sounds complicated.  It also assumes that we can backout stuff to fix
the problem;  we tried that to some extent with the first OOM closure
-- it is the standard response to test failure, of course -- but it
didn't work.

More generally, I don't like the idea of making this kind of breakage
normal.  I'd prefer to see effort go towards preventing it than
tolerating it.

Nick
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


How to reduce the time m-i is closed?

2013-11-16 Thread smaug

Hi all,


the recent OOM cases have been really annoying. They have slowed down 
development, even for those who
haven't been dealing with the actual issue(s).

Could we handle this kind of cases differently. Perhaps clone the bad state of 
m-i to
some other repository we're tracking using tbpl, backout stuff from m-i to the 
state where we can
run it, re-open it and do the fixes in the clone.
And then, say in a week, merge the clone back to m-i. If the state is still bad 
(no one has step up to fix the
issues), then keep m-i closed until the issues have been fixed.


thoughts?


-Olli
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform