Re: PSA: Flaky timeouts in mochitest-plain now fail newly added tests

Ehsan Akhgari Fri, 19 Dec 2014 15:59:36 -0800

On 2014-12-19 4:40 PM, Nils Ohlmeier wrote:

On Dec 19, 2014, at 6:56 AM, Ehsan Akhgari <ehsan.akhg...@gmail.com> wrote:

Let me try to rephrase the problem in different terms, to hopefully make it 
clearer why using timers like this is a bad idea.

setTimeout(foo, 1000) may seem to suggest run foo after 1 second, but that is 
*not* what that function does, at all.  What it does is, run foo after 1 
second, or perhaps at any other point in time later than that, and it could 
even be *minutes* later.

A lot of people are under the assumption that just because their local timings 
pan out, and they've added a, let's say, 10x error margin to their timers, 
their use of timers is fine.  It's not, as has been demonstrated as 
intermittent oranges over the years.  We have no way of knowing whether those 
calculations hold under various types of machine loads, on new platforms, on 
much lower powered machines (think of running the tests where the such 
measurements have been done on a beefy desktop on a low-end phone.)  And 
increasing the error margin won't save us, it will just move the problem 
further into the future.

Ok. That’s fine by my.
First of all I assume that the slowness of the machine also effects mochitest’s 
over all timer, right?


That's correct.

> If not, then we are in big trouble.

Secondly remember that I’m NOT doing anything useful when my 
“missing-event-timer” pops. I’m just logging a message that the event is 
missing and then I’m trying to exit the test early (earlier then the mochitests 
general test timeout).

Your approach is _still_ faulty in that you are assuming that section ofyour test will reasonably finish in 60s, but if the test slave isstarved, the 60s timer may fire *before* the event you are waiting forhas a chance to be dispatched. Based on that, your approach is prone tointermittent failures.

Of course, the exact level of problem depends on what you mean by "log"above. If you mean something that is *actually* side-effect free suchas dump() or info(), then the worst thing that can happen is that youmay see misleading log messages sometimes. But if you do *anything*which can have side-effects, you should stop using setTimeout ASAP.That includes things such as ok(false, "the event didn't arrive"),SimpleTest.finish(), etc.

At this point though, you should be asking yourself, why go through thepain of ensuring those timer callbacks are side-effect-free and thatthey remain so for the years to come? Really, the problem that you aretrying to protect yourself against doesn't exist. So, please just don'tuse setTimeouts like that, ever! :-) That will keep things muchsimpler, and less footgun-ish.

The only problem I can see with that approach is: if the “missing-event-timer” 
actually pops more or less precisely after the requested time, BUT the actual 
work is running super slow motion for some reason. And so the event would have 
arrived, but after the even-timeout.
So if my “missing-event-timers” are popping on time, because they are for some 
bizarre reason not affected by the over all slowness of the machine then we 
have a problem with this. But in that case we should have a problem with the 
over test timeout of mochitest as well, or?

The default mochitest timeout is 5 minutes. And you're right, if yourtest takes longer than that, it will fail. If it does sointermittently, then it will fail intermittently. And if that happens,the solution is to either use SimpleTest.requestLongerTimeout() torequest a longer timeout, or to split up the test. Years of experiencehas shown that the default 5 minute timeout is sufficient for all of ourplatforms in practice. And in the future if that proves to be tooshort, we can change it globally with a one line change (inTestRunner.js). If individual test authors roll out their own clones ofthe harness timeout mechanism, updating them in that situation willbecome very very hard, if possible at all.

Also, please note that letting the test time out if it doesn't succeed is a 
perfectly valid failure scenario, we have numerous tests that do things like 
fire an async event, setup a listener and wait for the listener to fire and 
SimpleTest.finish() there, without any timers to catch the case where the event 
does not fire.


My test are not about testing this one timer. Doing that would be relatively 
easy.
In WebRTC I need to orchestrate two participants to establish a real time audio 
video call. Lots of the tests need to establish a full call, before assessing 
if the call got properly established under the given criteria. The most common 
scenario for intermittent failures is that the call does not even gets 
established. From start to a full call I need to verify several state machines 
and lots of events firing and all of that twice for the two participants of 
each call.
So yes I could simply try to establish a call (and not worrying about all the 
intermediary steps and events) and then do the verification/assessment. And let 
call failures be caught by the generic mochitest timeout. But then I would 
spend hours and hours staring at pretty useless log files (see below) every 
time a call got stuck.


Can you please point me to the code in question?

  Logging sufficiently is almost always enough to not have to use these timers, 
as those tests have demonstrated in practice.


I disagree. Especially as the logging of mochitest is really poor.
- the timestamps of log messages are lost as described in bug 1020516
- SimpleTest only has one log level: info(). Which I think is the main reason 
the log files get big, because test writers can’t fine tune the logging.
- Log messages from mochitest are not printed together with the stdout/stderr 
of the browser (so correlating these two is almost impossible)
- Logging behaves differently on your local machine then on the build servers

Everybody (yours truly included!) has their favorite list of loggingissues, and honestly I've not been paying close attentions to the recentchanges such as structured logs, so I'm not able to help you on any ofthese specifically. Some of these issues are possible to work around inyour own code (for example you can have your own logging levels andwhatnot, based on a quick grep there's even some prior art, for examplesee Logger in accessible/jsat/Utils.jsm). Some of these (such as thelatter two) seem to be issues that you'd also have with your own customlogging in mochitest, so I'm not sure how the timer based logging is notaffected by them.

BTW I understand that WebRTC has pretty special and probably unique 
requirements, compared to what the rest of the mochitest users do with it. 
That’s why I was trying to make the case for adding a special connivence 
function for setting up these special “missing-event-timer”, which BTW would 
also allow us to increase or decrease the timeout values based on the 
platform/machine the test currently gets executed on.

If there were a way to implement your idea of "missing event timer" in away that is not prone to intermittent failures, I'd say go for it. ButI don't understand how you can do that.

But thinking of this more, do you find it useful to be able to registera callback with the harness to be called when the harness decides tokill your test because it has timed out? (Obviously the callback wouldbe informational, and would not be able to prevent the harness fromkilling the test and moving on.)

But as I don’t seem to be able to make my case here I’ll simply live with 
requesting flaky timeouts for all WebRTC tests.

Well, that would be a terrible conclusion to this thread. :( Thatwould mean that WebRTC tests will be prone to all sorts of intermittenttest failures caused by setTimeout (both the scenario above, and others-- as reuestFlakyTimeout is a blanket pass) forever.

Please note that intermittent test failures are a real problem. Besidesall of the time wasted by sheriffs/developers/machines to deal withthem, they reduce our ability to tell whether a given checkin is "good",and *most importantly* they will result in the tests getting disabled.Tests getting disabled means that areas of our code will be untested, sowe won't be able to detect the regressions that the tests would haveotherwise caught, and we'd end up shipping those bugs to our users.

I'm really trying hard to suggest something that is not flaky for yourneeds. Perhaps the above idea would help. If not, maybe I can come upwith something by looking at the code. But simply callingrequestFlakyTimeout in *all* WebRTC tests and moving on is not really anoutcome that I look forward to seeing.


Cheers,
Ehsan

Cheers,
Ehsan

On 2014-12-19 1:07 AM, Boris Zbarsky wrote:

On 12/18/14, 2:28 PM, Nils Ohlmeier wrote:

Well there is an event listener waiting for the event to fire.
But how else then through a timeout, obviously with a high timeout
value like
30 or 60 seconds


We've had quite a number of random oranges from precisely this scenario.
  It seems that it's not uncommon for the test VMs to completely lose
the timeslice for 60 seconds or more.

If the test is in fact expecting the event to fire, the right thing to
do is to jut have the test time out per the harness timeout (which can
be globally adjusted as needed depending on the characteristics of the
test environment) if the event doesn't fire.  Rolling your own shorter
timer just means contributing to random orange.

Sure I can print an info message that my test is now waiting for an
event to pop,


Sure, and another one when the event comes in.  Then you can check the
log to see whether the timeout was for this reason in the event of a
harness timeout on this test.

But as I tried to describe in my other
email having a long living timer which pops complaining that event X
is missing
I think is a legit use case for setTimeout in test.


Given the number of random oranges we've had from _precisely_ this use
of setTimeout, I don't think it's legit.

-Boris
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: PSA: Flaky timeouts in mochitest-plain now fail newly added tests

Reply via email to