testHighPriorityOverridesWaitTime intermittent failures

Ben Coman Sat, 30 Jan 2016 01:45:45 -0800

Case 17373 had an erroneous failure from
ProcessTest>>testHighPriorityOverridesWaitTime
and several others seem also to have suffered...
* https://pharo.fogbugz.com/default.asp?17373
* https://pharo.fogbugz.com/default.asp?17496
* https://pharo.fogbugz.com/default.asp?17445
* https://pharo.fogbugz.com/default.asp?7444


These have comments that the problem can't be reproduced, which is
common of occasional race conditions.  So rather than just try to
solve this on the issue tracker, I thought I'd share a tip how to
isolate and squash such intermittent failures, essentially by
stressing that single test a large number of times like this...

    1 to: 100 do: [  :n |  Transcript crShow: 'Try ' , n printString.
          ProcessTest debug: #testHighPriorityOverridesWaitTime.  ]

Since the debugger only focusses on a point in time snapshot of the
running system, some Transcript instrumentation was useful to overview
the dynamic behaviour, like this...

    ProcessTest>>testHighPriorityOverridesWaitTime
        | lowerPriorityWaitingLonger higherPriorityWaitingLess
nextReadyProcess |
        lowerPriorityWaitingLonger := [ ] forkAt: 11
        higherPriorityWaitingLess :=   [ ] forkAt: 12.
        nextReadyProcess := Processor nextReadyProcess.
        Transcript cr;
           tab;show: (lowerPriorityWaitingLonger priority ->
lowerPriorityWaitingLonger identityHash); cr;
           tab;show: (higherPriorityWaitingLess priority ->
higherPriorityWaitingLess identityHash); cr;
           tab;show: (nextReadyProcess priority -> nextReadyProcess
identityHash).
        self assert: nextReadyProcess equals: higherPriorityWaitingLess.

which produces...
  Try 1
    11->343137024
    12->40659200
    12->40659200
  Try 2
    11->461794304
    12->374996992
    12->40659200

Observe on the second try nextReadyProcess refers to same process as
from the first try.  This can be cleaned up just prior to the asset
check by adding ...
   lowerPriorityWaitingLonger terminate.
   higherPriorityWaitingLess terminate.
which produces...
  Try 1
    11->1005688064
    12->833273600
    12->833273600
  ...
  Try 100
    11->764450816
    12->807973888
    12->807973888

However this is still a little fragile, since some process with
priority greater than the running test could slip in before the call
to nextReadyProcess, as can be seen by pushing the #to:do: loop out to
100000, which produces...
  Try 32044
    11->784361216
    12->431928576
    40->397533952

or wrapping it by [   ]forkAt:39, which produces...
  Try 1
    11->431897088
    12->400972288
    30->338363648

After several awkward failed attempts to fix this, it seemed best to
observe actual behaviour rather than peeking into the dynamic data
structures. Thus I propose...

  ProcessTest>>testHighPriorityOverridesWaitTime
    "The first process to resume will pass straight through the gate
    while the other waits for the assert to whichRan."

    | gate checkAssert priorityRan |
    gate := Semaphore new signal.
    checkAssert := Semaphore new.

    [ gate wait. priorityRan := 11. checkAssert signal ] forkAt: 11.
    [ gate wait. priorityRan := 12. checkAssert signal ] forkAt: 12.

    checkAssert wait.
    self assert: priorityRan=12 description: 'Second scheduled but
higher priority should run first'.
    gate signal.

    checkAssert wait.
    self assert: whichRan=11 description: 'First scheduled but lower
priority should run after'.

https://pharo.fogbugz.com/default.asp?17501

[Pharo-dev] ProcessTest>>testHighPriorityOverridesWaitTime intermittent failures

Reply via email to