Hello all,

Here is the scheduler assert patch, test cases, and integration with the
test suite combined into a single patch! After a bit more testing with
code that used a lot of threads, I ended up looking at how make-thread
uses ##sys#make-thread and added some missing code to the patch's usage
of ##sys#make-thread.

Best, Jonathan Chan

> On Jul 9, 2015, at 4:29 PM, Jonathan Chan <j...@fastmail.fm> wrote:
>
> ... and here is hopefully the final version of the patch for review.
> In the end I think hijacking a thread to run the signal handler is too
> much prone to error. Although the previous patches try to cope with
> this behavior, as soon as the signal handler tries to do something
> like lock a mutex via srfi-18 or yield, very strange things will
> happen depending on what thread was hijacked, which threads were
> blocking, etc. Blocking
> ##sys#schedule from running is a partial fix, but doesn't help when it
> would have been expected for ##sys#schedule not to return immediately,
> e.g. if a signal handler tries to block on a socket. For example,
>      seems to make the the tcp unit go into a constant loop while
>      trying to recv.
>
> In this newest patch, the ##sys#interrupt-handler in scheduler.scm
> will instead save the old handler in ##sys#signal-vector, change the
> handler to one that makes a thread with ##sys#make-thread to call the
> old handler and then schedules that thread to run, wrap state so that
> the original ##sys#interrupt-handler will reset the handler and then
> finally call the old hook so everything starts going.
>
> In addition to the new patch, I have attached a revised version of the
> old test case that should be more likely to trigger the assert failure
> in unpatched versions of the scheduler as well as a new test case that
> 1) shows problems with the previous patches 2) ensures that the signal
>    handler can properly call blocking code and 3) possibly
>    demonstrates a bug with multiple threads blocking on one fd that I
>    am not certain about.
>
> Instructions for scheduler-fdset.scm:
>
> 1. Run with csi or csc.
> 2. Send SIGINT (e.g. with Ctrl-C)
> 3. The test should write "locking... unlocking... unlocked" and then
>    exit without error.
>
> Instructions for scheduler-fdset-2.scm: You will need to compile
> sink.c with gcc sink.c -o sink
>
> 1. Run the test
> 2. Wait a short bit for the child processes to start
> 3. Start netcat in a new terminal with `nc -l 8080`
> 4. Send SIGINT (e.g. with ^C)
> 5. Verify that the process is not using 100% CPU
> 6. Type `hi` (or anything) in the terminal with netcat and press
>    Return
> 7. The test should exit without error
>
> Sorry for the many versions and emails! This at least lets me get back
> to the work that needed the scheduler in the first place. I hope that
> this finally kills the bug...
>
> All the best,
> --
> Jonathan Chan j...@fastmail.fm
>
> On Thu, Jul 9, 2015, at 03:18 AM, Jonathan Chan wrote:
>> I realized that if I am able to set a flag about whether or not we
>> are in the interrupt hook, I might as well save the previous block
>> object of the thread and then restore it instead of possibly causing
>> spurious wakeups. As an aside, it seems as if force-primordial
>> doesn't currently do this, and relies on the code that asked the
>> scheduler to block on i/o to retry on its own. As I stated in the
>> previous email, this seems to be what the code that uses the
>> scheduler's blocking facilities does anyway.
>>
>> In any case, this patch also passes my tests (as the previous one
>> did) but now should more cleanly avoid "spurious wakeups" of threads
>> blocking on I/O.
>>
>> Also, I realized that the test case lacked instructions; you should
>> just be able to start it, wait a second, then send SIGINT (e.g. with
>> Ctrl-C) and the unpatched scheduler should cause the program to exit
>> with an assert error while with the patched scheduler you should just
>> see a nice:
>>
>> [jyc@jc-avm-arch tests]$ csi -s scheduler-fdset.scm   ^Clocking...
>> unlocking...   unlocked...   done!
>>
>> Best regards,
>> --
>> Jonathan Chan j...@fastmail.fm
>>
>> On Wed, Jul 8, 2015, at 11:34 PM, Jonathan Chan wrote:
>>> Hi again,
>>>
>>> It turns out that changing the current thread's state to 'running
>>> doesn't actually work in all cases because a) unblocking from
>>> condition variables will skip a thread if its state is running b)
>>> the call to change the state back was never called because the call
>>> to the old hook would never return. I found this by doing some more
>>> testing with the program that caused me to run into the original
>>> assert bug.
>>>
>>> Here is a revised patch which instead sets a variable to indicate
>>> whether or not we are in the interrupt hook and skips the scheduler
>>> if so. I've also included an updated test case that catches the
>>> problem with the previous patch.
>>>
>>> Best regards,
>>> --
>>> Jonathan Chan j...@fastmail.fm
>>>
>>> On Wed, Jul 8, 2015, at 08:12 PM, Jonathan Chan wrote:
>>>> Hi all,
>>>>
>>>> I ended up having some free time before the weekend and went back
>>>> to the scheduler assert problem. I wrote a test case for the bug
>>>> and revised the scheduler assert patch after some further testing.
>>>>
>>>> In the previous patch, the thread in which the interrupt handler
>>>> ran would just be removed from ##sys#fd-list if the block object
>>>> was changed from a file. This didn't seem to handle some cases
>>>> where the interrupt handler ended up unblocking. In one of these
>>>> cases, when mutex-lock! and then mutex-unlock! were called in
>>>> succession, it also turned out that the "current thread"'s state
>>>> needed to be saved, set to 'running during the interrupt handler,
>>>> then restored in order to avoid mutex-unlock! from calling the
>>>> scheduler and causing the program to hang/do strange things.
>>>>
>>>> I've attached the the test case for the bug and the revised patch.
>>>> Again, if someone experienced with the scheduler could review it
>>>> and make sure it doesn't cause strange things to happen, it would
>>>> be very much appreciated.
>>>>
>>>> Best regards,
>>>> --
>>>> Jonathan Chan j...@fastmail.fm
>>>>
>>>> On Sat, Jul 4, 2015, at 07:49 PM, Jonathan Chan wrote:
>>>>> Hello all,
>>>>>
>>>>> I was in the IRC channel earlier trying to find out why my program
>>>>> was failing due to a scheduler assertion about the fdset for poll.
>>>>> This would happen consistently when I sent SIGINT.
>>>>>
>>>>> The error looks like:
>>>>>
>>>>> scheduler.c:51: C_fd_ready: Assertion `fd == C_fdset_set[pos].fd'
>>>>> failed
>>>>>
>>>>> After looking at the scheduler code, it looks like this happened
>>>>> because the SIGINT was causing the interrupt handler to be called,
>>>>> which ended up unblocking some thread to run a signal handler,
>>>>> which caused a condition variable broadcast. In the end, this made
>>>>> a thread that was previously blocking on a file descriptor to now
>>>>> block on a mutex.
>>>>>
>>>>> When create-fdset procedure iterated over the ##sys#fd-list, it
>>>>> skipped threads in the list who were not blocking on fds while it
>>>>> added those who were to the fdset using fdset-set. However, it did
>>>>> not remove these threads from ##sys#fd-list, which resulted in the
>>>>> scheduler throwing an error because it expected a fd to be in the
>>>>> fdset when it was not there.
>>>>>
>>>>> I have attached a patch which modifies create-fdset to remove
>>>>> threads from ##sys#fd-list who are not blocking on an fd, rather
>>>>> than only skip them. This seems to fix my problem and has not
>>>>> introduced any bizarre behvaior in the few tests I have run, but
>>>>> hopefully someone who is more familiar with the scheduler could
>>>>> look it over and see if it fixes the issue.
>>>>>
>>>>> Best regards,
>>>>> --
>>>>> Jonathan Chan  j...@fastmail.fm
>>>>> _______________________________________________
>>>>> Chicken-hackers mailing list Chicken-hackers@nongnu.org
>>>>> https://lists.nongnu.org/mailman/listinfo/chicken-hackers Email
>>>>> had 1 attachment:
>>>>> + scheduler-assert.diff 2k (text/plain)
>>>> _______________________________________________
>>>> Chicken-hackers mailing list Chicken-hackers@nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/chicken-hackers Email had
>>>> 2 attachments:
>>>> + scheduler-assert-fix.diff 5k (application/octet-stream)
>>>> + scheduler-fdset.scm 2k (application/octet-stream)
>>> _______________________________________________
>>> Chicken-hackers mailing list Chicken-hackers@nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/chicken-hackers Email had
>>> 3 attachments:
>>> + scheduler-fdset.scm 2k (application/octet-stream)
>>> + sink.c 1k (text/plain)
>>> + scheduler-assert-fix.diff 14k (application/octet-stream)
>> _______________________________________________
>> Chicken-hackers mailing list Chicken-hackers@nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/chicken-hackers Email had 1
>> attachment:
>> + scheduler-assert-fix.diff 14k (application/octet-stream)
> <scheduler-assert-new.diff><scheduler-fdset-2.scm><scheduler-
> fdset.scm><sink.c>_______________________________________________ Chicken-
> hackers mailing list Chicken-hackers@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/chicken-hackers

Attachment: scheduler-assert-final.diff
Description: Binary data

_______________________________________________
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers

Reply via email to