Re: [Pharo-dev] Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman Sun, 17 Jul 2016 02:48:07 -0700

On Wed, Jul 13, 2016 at 12:30 AM, Mariano Martinez Peck
<marianop...@gmail.com> wrote:
>
>
> On Wed, May 18, 2016 at 11:39 PM, Ben Coman <b...@openinworld.com> wrote:
>>
>> On Thu, May 19, 2016 at 8:49 AM, Martin McClure <mar...@hand2mouse.com>
>> wrote:
>> > On 05/18/2016 03:17 PM, Martin McClure wrote:
>> >>
>> >> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> >>> explain how this could happened but it does, and it happened to me a
>> >>> couple
>> >>> of times (but not fully reproducible).
>> >>>
>> >>
>> >> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> >> doubled. It's not clear how that can happen, but I'll look a little
>> >> more.
>> >>
>> >
>> > Mario, is there any chance that you might be saving the image during one
>> > of
>> > these Delays?
>> >
>> >
>> > This one smells like a race condition, and I think I see something that
>> > *might* explain it. But I don't have any more time to spend on this one,
>> > so
>> > I'll leave the rest to someone else. I hope this is helpful:
>> >
>> > The only way I immediately see for the schedulerResumptionTime to become
>> > approximately doubled is if the Delay's resumption time is adjusted by
>> > #restoreResumptionTimes without previously having been adjusted by
>> > #saveResumptionTimes.
>> >
>> > The only time either of those are sent, that I can see, is on saving the
>> > image. Both are normally sent, (save before the snapshot, restore
>> > afterwards), but there may be a hole there.
>> >
>> > #saveResumptionTimes is only sent (by this scheduler class) when the
>> > accessProtect semaphore is held, but #handleTimerEvent: is executed in
>> > the
>> > timing Process *without* the protection of accessProtect, in the case of
>> > the
>> > VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>> > #handleTimerEvent: could run in the middle of #saveResumptionTimes. If
>> > some
>> > Delay expires because of that timer event, our Delay could move from
>> > being
>> > the first suspended delay to being the active delay. If that happens
>> > after
>> > we've adjusted the active delay, but before we've processed the
>> > suspended
>> > delays, that Delay will not get adjusted, and will show the symptoms
>> > that
>> > Mariano is seeing.
>>
>> A quick experiment to test this might be in shutDown/#startUp trying...
>>     [ self saveResumptionTimes ] valueAt: Processor timingPriority
>>     [ self resumeResumptionTimes ] valueAt: Processor timingPriority
>>
>> >
>> > Also, I'm not sure how the Heap that holds the suspendedDelays will
>> > react to
>> > being modified in the middle of an enumeration. That might open a larger
>> > window for the problem.
>> >
>> > Regards,
>> >
>> > -Martin
>> >
>>
>> Even if not directly related to Mariano's problem, I agree with your
>> general assessment.  I'm not comfortable with the way that
>> #save/#restoreResumptionTimes (which manipulate suspendedDelays) are
>> called from user priority code via #shutDown/#startUp.  Per the
>> original code**, accessProtect can't be used inside the timing
>> priority #handleTimerEvent since accessProtect is held by the user
>> priority #schedule when it uses "timingSemaphore signal" to invoke
>> invokes #handleTimerEvent.  accessProtect never protected
>> timingPriority manipulation of suspendedDelays by #handleTimerEvent,
>> nor expired delays waking up.  But ahhh... the disabling of
>> accessProtect previously prevented new delays being scheduled between
>> a #save and #restore. If a new delay is scheduled after the #save,
>> when it is #restore'd its resumptionTime would be wrong.
>>
>> Waiting in the wings for Pharo 6 I have changes that should help:
>> * have #save/#restoreResumptionTimes *only* called from timing
>> priority event loop (i.e. #handleTimerEvent)
>> * shutDown/startUp suspends/resumes the timing priority event loop,
>> instead of trying to block signals to timingSemaphore
>>
>> I haven't touched it for a few months so I'll need to chase it up to
>> provide a preview.
>>
>>
>>
>> Mariano, can you try DelayMillisecondScheduler (which however is
>> missing some fixes for other issues).
>>
>
>
> Ben,
>
> Did you find a chance to do something else with this? I am still finding
> lookups every in a while even with the DelayMillisecondScheduler  :(


Hi Mariano,

Sorry for the delay attending to this.  I got a bit twisted up on
whether I'm complicating things too much with my DelayScheduler
refactoring for Pharo 6 (I should just put it out for early review),
but then I had to get through a Netflix addiction to the Star Trek
prequel Enterprise, and family time over my kids school holidays.  I'm
back looking at it now - and taking a step back to see what is the
minimum change required to fix this.

Do you have some test code or a test image I could run to reproduce
the problem?

Now I have one thing for you to try.  In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect.  Could you try again the SpinScheduler but
restore this code...

>From handletTimerEvent...
  "Check for clock wrap-around."
  millisecondNowTick < activeDelayStartTime ifTrue: [
    "clock wrapped"
    self saveResumptionTimes.
    self restoreResumptionTimes ].
  activeDelayStartTime := millisecondNowTick.

>From runTimerEventLoop...
  Time millisecondClockValue < millisecondNowTick
    ifTrue:[ timingSemaphore signal ]. "retry"

>From startup...
   activeDelayStartTime := Time millisecondClockValue.

cheers -ben

Re: [Pharo-dev] Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Reply via email to