[Felix-language] pthread issues

john skaller Wed, 01 Feb 2012 05:21:22 -0800

Pthreads are not working right. tools/launcg crashes,
and this program fails to do what I expect:


// pthread test

noinline proc mkfibre(p:int, f:int) {
  spawn_fthread {
    for k in 1 upto 10 do
      eprint$ "Thr " + str p + " fibre " + str f + " step " + str k + "\n";
      //Faio::sleep(sys_clock, 0.01 + f.double / 5.0);
    done
    eprint$ "Thr " + str p + " fibre " + str f + " DEAD " + "="*20+"\n";
  };
}

noinline proc mkthread (x:int) {
  spawn_pthread {
    for var j in 1 upto 10 do
      eprint$ "Thr " + str x + " step " + str j + "\n";
      //Faio::sleep(sys_clock, 0.1 + x.double / 2.0);
      for var f in 0 upto 10 do
        mkfibre(x, f);
      done
    done
    print$ "Thread " + str x + " done" + "*"*20+"\n";
  };
}

for var k in 1 upto 10 do
   mkthread k;
done;


I will start by covering two issues that can be dealt with.

1) Due to the way the optimiser works, it doesn't recognize threads:
neither p nor f threads. It will happily inline the procedures above
without the "noinline" keyword, and this causes the fibres threads
to refer to the current values of their parameters in the main procedure,
instead of a private copy of each one. "noinline "stops this, so each
fibre/pthread is bound to a distinct stack frame object.

This NOT a bug in the optimiser, in fact exactly the same thing
happens if the spawned thread is replaced by a closure which is
returned and executed later. the closure is bound to its parent
by a pointer, and if that object is inlined into some other object,
the binding is to that object, instead of a separate object.

"noinline" prevents this. The difficult is in the semantic specification.
Currently both behaviours are correct unless noinline is specified,
which means inlining actually has semantics, it isn't just an optimisation
hint.

This is no different to the parameter passing rule, where "val" parameters
can be evaluated either eagerly or lazily unless you specify one or
the other with a "var" or closure type respectively.

The design is misleading but intentional, to permit optimisations
which otherwise would not be possible, at least without much more
difficult analysis.

2) Pchannels don't work as I expected. They block the pthread.
This isn't wrong, it just isn't what I expected: to block the containing
fthread only, just like other async ops such as sleeping or waiting
on a socket.

3) So now the REAL problem to be addressed here:

if you run the program above, some of the loop instances of the
inner fthread just don't run at all. In fact, the pthreads seem
to die prematurely.

TO explain I will start back a bit!

Fibres work by a flat (stackless) piece of code creating a new fibre
as a heap object which is added to a scheduler wait list.
The current fibre can yield, which it does by storing the "next"
address in a variable and returning. When it is resumed,
a jump to the "next" address is done so control continues
"where it left off before".

Normally, fibres just don't yield like this. What that can do is
read or write an schannel. When a write is first done,
the a pointer to the writing fibre is stored in the schannel
and then the next fibre on the schedule list is run.
The writing fibre is NOT added to the wait list.

When a fibre reads, it grabs data from the writer fibre and the
writer is unlinked from the schannel and put on the wait list.

The waiting fibre is not garbage collected if its in the wait list.
If it's attached to a channel, then if the channel will usually be stored
in a variable in some reachable fibre, and so the channel is reachable
and thus the waiting writer is also reachable. If there are no "owners"
of the channel other than the writer, the channel and writer aren't
reachable and they get reaped by the collector. If there's a "deadlock"
on two channels, two fibres both become unreachable and suicide,
eliminating the deadlock: fibres cannot deadlock, they just commit suicide.

Ok, that's the SIMPLE case. Here's the harder one: when a fibre does a
supported "blocking" operations: socket I/O or sleeping on Faio timer clock,
the fibre is made a GC root to ensure it isn't reaped by the collector and
moved into an async sleep queue. This is distinct from the synchronous 
scheduling queue.
When the system "poll" thread finds an event that the fibre is waiting on, it 
wakes the
fibre up by unrooting it and putting it back on the synchronous scheduling 
queue.
For sockets, it does the I/O first, so the woken up fibre will resume with its 
request
serviced.

OK, so now the hard part! Termination! Felix (should) terminate
a thread when

(a) there are no fibres on the synchronous schedule queue
(b) there are no fibres on the asynchronous sleep queue

and should terminate the program when all pthreads have completed.
If you run a bunch of fthreads that wait on the clock a bit, you'll find
the program will not exit until all the fthreads are completed.
If you run several pthreads you'll find that the program will not terminate
until all the pthreads have completed. 

The "main" thread is special, because it 

(a) creates the garbage collector
(b) creates the "thread frame"
(c) starts the code with standard file parameters, etc
(d) won't complete until all child pthreads finish.

There is special code for this main thread. It sucks a bit.
The asymmetry is ugly. 

Now you must also note there is only ONE thread doing the timer
and socket servicing! All the pthreads use a single asynchronous
event monitor thread. Access to the event queue therefore has
to be serialised. The requests going to this thread pass a pointer
the object to be woken up, it is always going to be rescheduled
in its original thread (which is never the thread that reschedules it,
that's the event service thread).

NOW: what I think is happening is: the spawned pthreads
are returning whilst there are fibres waiting on that threads
sleep queue. They're just objects. When the pthread returns
the sleep queues for that thread just get lost. In the program
above this doesn't cause a crash, but in the launch program ..
well the event handler is rescheduling an fibre from a pthread
queue that is deleted, onto a synchronous wait queue that is
also deleted.

In other words .. the pthread is being killed when
it hits the "return" statement at the end of its procedure.
Unlike the main thread, it isn't waiting for asynch events.
It should.

That is my THEORY: roughly "spawn_pthread" is not calling
the right RTL routine.


--
john skaller
skal...@users.sourceforge.net





------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Felix-language mailing list
Felix-language@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/felix-language

[Felix-language] pthread issues

Reply via email to