Re: Events and JIT

2004-01-19 Thread Leopold Toetsch
Jeff Clites [EMAIL PROTECTED] wrote:

 But if the event dispatch thread is setting some flag for the target
 thread to detect, it's going to need to lock (or something similar) to
 make sure that the value of this flag is visible to other threads.

Yep, that's true. The event thread puts an event into the interpreters
task queue. This is done by locking the task queue, inserting the entry
and unlocking it. This sequence ensures, that the queue entry with the
event hanging off it, will be visible to the target interpreter.

 ... So that could mean a
 lock inside of every invoke

No not AFAIK. Inserting the event into the task queue should already be
enough, to make it visble for the target interpreter.

  So
 now we are probably quadratic with the size of the segment. (Patching N
 locations leads to N times where we un-patch all N locations.)

I would say O(2*N) max.

Yet another idea: The event thread could set some piece of memory (e.g.
the JITed instruction stream) to PROT_NONE and catch the SIGSEGV. Some
read/write barrier schemes for inkremental GC are using this approach.

 JEff

leo


Re: Events and JIT

2004-01-18 Thread Leopold Toetsch
Gordon Henriksen [EMAIL PROTECTED] wrote:
 On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote:
 Ops that can leave the code segment have to explicitely check for=20
 events.

 Then no need to patch invoke*.

Yep, that's true. I'm waiting for the ops-file hints, then that can be
cleaned up.

 If the target thread handles the event and unpatches the bytecode before=20=

 the source thread finishes patching?

Then we have kind of a race ;)
But AFAIK it doesn't harm (except it slows down the interpreter a bit)
If the target thread hits again a patched instruction it checks the
task_queue, finds it empty, patches again the opcode stream with the
original code and continues running until the next real event is
scheduled.

And the target thread could communicate with the event-handler thread
with and volatile *patch_opcode_t. This hasn't to be exact (so no
locking), but would allow, that the event-thread stops patching.

But if events arrive faster, then the time needed for patching the
opcode stream twice, then there is a real problem :)

[ BTW does anyone have a spare dual-Opteron machine, I'd really like to
  test threads on a MP machine ]

 Gordon Henriksen

 --Apple-Mail-7--579835501--

Can you switch that to send ASCII mails?

leo


Re: Events and JIT

2004-01-18 Thread Leopold Toetsch
Jeff Clites [EMAIL PROTECTED] wrote:
 On Jan 17, 2004, at 2:51 AM, Leopold Toetsch wrote:

 What I meant was, _every_ backward branch (etc.) in the bytecode, or
 just one location that we thought would execute soon. (But you
 implicitly answered this, below.)

Every backward branch. Guessing the PC and patching in front of that
can't work.

 But we can of course measure how long it takes to patch a certain
 amount of code twice.

 It just seems bad that the efficiency of event delivery would be
 proportional to the size of the bytecode (or segment). So it's slower
 to deliver events in large programs. That seems like a bad omen.

It depends, how bad it really is:
- it should be possible, to patch only the current subroutine, which
  might or might not help (a Sub can still be huge). The program flow
  can't have left the Sub in the meantime, because:
- a bunch of opcodes (sleep, invoke*) will check events explicitely
  (this is the original strategy)
- the problem is of course loops, that don't have any explicit check
  event opcode inside. For these there are 2 strategies:
  a) the event handler thread patches these backward branches or
  b) the backward branch op checks for events always

Both have some costs: a) means slower event delivery, b) means slower
execution in the absence of events. If patching takes too long, we can't
do it and have to use b) that's it.

The negative performance impact of b) can be reduced somtimes by partial
loop unrolling.

As said, we have to benchmark all.

 JEff

leo


Re: Events and JIT

2004-01-18 Thread Jeff Clites
On Jan 17, 2004, at 12:33 PM, Leopold Toetsch wrote:

Gordon Henriksen [EMAIL PROTECTED] wrote:
On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote:
Ops that can leave the code segment have to explicitely check for
events.
But if the event dispatch thread is setting some flag for the target 
thread to detect, it's going to need to lock (or something similar) to 
make sure that the value of this flag is visible to other threads. (If 
not, then the target thread could branch to a new code segment after 
the dispatch thread thinks it has set this flag, but before it's 
actually become visible to the target threads.) So that could mean a 
lock inside of every invoke

If the target thread handles the event and unpatches the bytecode 
before
the source thread finishes patching?
Then we have kind of a race ;)
But AFAIK it doesn't harm (except it slows down the interpreter a bit)
If the target thread hits again a patched instruction it checks the
task_queue, finds it empty, patches again the opcode stream with the
original code and continues running until the next real event is
scheduled.
So the really bad case here will be if the patching thread is working 
just ahead of the running thread--if it has to patch 1000 locations, 
then we could end up hitting each of those (and checking for an event, 
unpatching everything...) before the patching thread has finished. So 
now we are probably quadratic with the size of the segment. (Patching N 
locations leads to N times where we un-patch all N locations.)

And the target thread could communicate with the event-handler thread
with and volatile *patch_opcode_t. This hasn't to be exact (so no
locking), but would allow, that the event-thread stops patching.
As above, it may be ineffective if we don't lock--the other thread may 
not see the value change. (The volatile keyword in Java is magic in 
terms of inter-thread visibility, but not in C.)

JEff



Re: Events and JIT

2004-01-17 Thread Leopold Toetsch
Jeff Clites [EMAIL PROTECTED] wrote:

 Where in the stream would we patch? If not in a loop, you may never hit
 the single patched location again, but still might not end for a very
 long time

Same as with prederefed run cores: backward branches, invoke*

 If we are patching all locations with branches and such, in large
 bytecode this could take a long time, and the executing thread might
 outrun the patching thread.

We have of course to benchmark this scheme against other approaches.
Albeit finding/writing such benchmarks and measuring responsiveness will
be some fun ;)
But we can of course measure how long it takes to patch a certain
amount of code twice.

 ... Also, once the handler is entered we'd
 have to fix all of patched locations, which again could be
 time-consuming for large bytecode.

Please remember: only the current code segment has to be patched, which
theoretically could be very large, but that's unlikely.

 JEff

leo


Re: Events and JIT

2004-01-17 Thread Leopold Toetsch
Gordon Henriksen [EMAIL PROTECTED] wrote:

 Other threads than the target could be executing the same chunk
 of JITted code at the same time.

No. JITed (and prederefed) code is thread-specific, because register
addresses are converted to absolute memory locations.

leo


Re: Events and JIT

2004-01-17 Thread Gordon Henriksen
On Saturday, January 17, 2004, at 05:53 , Leopold Toetsch wrote:

Gordon Henriksen [EMAIL PROTECTED] wrote:

Other threads than the target could be executing the same chunk of 
JITted code at the same time.
No. JITed (and prederefed) code is thread-specific, because register 
addresses are converted to absolute memory locations.
Ack! Might want that to be configurable for the JIT, to reduce the 
overhead of starting a thread...



Gordon Henriksen
[EMAIL PROTECTED]


Re: Events and JIT

2004-01-17 Thread Gordon Henriksen
On Saturday, January 17, 2004, at 05:51 , Leopold Toetsch wrote:

Jeff Clites [EMAIL PROTECTED] wrote:

... Also, once the handler is entered we'd have to fix all of patched 
locations, which again could be time-consuming for large bytecode.
Please remember: only the current code segment has to be patched, which 
theoretically could be very large, but that's unlikely.
What if control leaves the current code segment before you finish 
patching it? If an event is being delivered from another thread, this is 
very likely to occur if the event delivery thread is preempted by the 
event receiver.



Gordon Henriksen
[EMAIL PROTECTED]


Re: Events and JIT

2004-01-17 Thread Leopold Toetsch
Gordon Henriksen [EMAIL PROTECTED] wrote:

 What if control leaves the current code segment before you finish=20
 patching it? If an event is being delivered from another thread, this is=20=

Ops, that can leave the code segment have to explicitely check for
events.

 Gordon Henriksen

leo


Re: Events and JIT

2004-01-17 Thread Gordon Henriksen
On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote:

Gordon Henriksen [EMAIL PROTECTED] wrote:

What if control leaves the current code segment before you finish 
patching it?
Ops that can leave the code segment have to explicitely check for 
events.
Then no need to patch invoke*.

If the target thread handles the event and unpatches the bytecode before 
the source thread finishes patching?



Gordon Henriksen
[EMAIL PROTECTED]


Re: Events and JIT

2004-01-17 Thread Jeff Clites
On Jan 17, 2004, at 2:51 AM, Leopold Toetsch wrote:

Jeff Clites [EMAIL PROTECTED] wrote:

Where in the stream would we patch? If not in a loop, you may never 
hit
the single patched location again, but still might not end for a very
long time
Same as with prederefed run cores: backward branches, invoke*
What I meant was, _every_ backward branch (etc.) in the bytecode, or 
just one location that we thought would execute soon. (But you 
implicitly answered this, below.)

If we are patching all locations with branches and such, in large
bytecode this could take a long time, and the executing thread might
outrun the patching thread.
We have of course to benchmark this scheme against other approaches.
Albeit finding/writing such benchmarks and measuring responsiveness 
will
be some fun ;)
But we can of course measure how long it takes to patch a certain
amount of code twice.
It just seems bad that the efficiency of event delivery would be 
proportional to the size of the bytecode (or segment). So it's slower 
to deliver events in large programs. That seems like a bad omen.

... Also, once the handler is entered we'd
have to fix all of patched locations, which again could be
time-consuming for large bytecode.
Please remember: only the current code segment has to be patched, which
theoretically could be very large, but that's unlikely.
I don't think that's correct--the thread to which we are trying to 
deliver the event could have branched to a new segment before we 
finished patching.

JEff



Events and JIT

2004-01-16 Thread Leopold Toetsch
Event handling currently works for all run cores[1] except JIT.

The JIT core can't use the schemes described below, but we could:
1) explicitely insert checks, if events are to be handled
1a) everywhere or
1b) in places like described below under [1] c)
2) Patch the native opcodes at these places with e.g. int3 (SIG_TRAP, 
debugger hook) cpu instruction and catch the trap. Running the event 
handler (sub) from there should be safe, as we are in a consistent state 
in the run loop.
3) more ideas?

1) of course slows down execution of all JIT code, 2) is 
platform/architecture dependent, but JIT code is that anyway.

Comments welcome,
leo
[1]
a) Run cores with an opcode dispatch table get a new dispatch table, 
where all entries point to the event handling code
b) The switch core checks at the beginning of the switch statement.
c) Prederefed run cores get the opcode stream patched, where back-ward 
branches and invoke or such[2] are replaced with event-check opcodes.

While a) and c) is run async from the event thread, it shouldn't cause 
problems, because (assuming atomic word-access) either the old function 
table/opcode pointer is visible or already the new, there is no 
inconsistent state.

Events using a) or b) are handled instantly, while c) events get handled 
some (probably very short time) after they were scheduled.

[2] Explicit hints from the ops-files, where to check events would 
simplify that. E.g.

op event invoke()
op event_after sleep(..)


Re: Events and JIT

2004-01-16 Thread Michal Wallace
On Fri, 16 Jan 2004, Dan Sugalski wrote:

   2) Those that explicitly check for events
...
 Ops like spin_in_event_loop (or whatever we call it) or checkevent is
 in category two. They check events because, well, that's what they're
 supposed to do. Compilers should emit these with some frequency,
 though it's arguable how frequent they ought to be.

I don't understand that part. Why the compiler?
If the high-level code doesn't do anything with
the events, then there's no point in checking.
If it does use the events, then shouldn't
developers call the even loop explicitly?

Sincerely,

Michal J Wallace
Sabren Enterprises, Inc.
-
contact: [EMAIL PROTECTED]
hosting: http://www.cornerhost.com/
my site: http://www.withoutane.com/
--



RE: Events and JIT

2004-01-16 Thread Gordon Henriksen
Leopold Toetsch wrote:

 Event handling currently works for all run cores[1] except JIT.
 
 The JIT core can't use the schemes described below, but we could:
 2) Patch the native opcodes at these places with e.g. int3 (SIG_TRAP, 
 debugger hook) cpu instruction and catch the trap. Running the event 
 handler (sub) from there should be safe, as we are in a 
 consistent state 
 in the run loop.

I don't think that bytecode-modifying versions should fly; they're not
threadsafe, and it would be nice to write-protect the instruction stream
to avert that attack vector.


 1) explicitely insert checks, if events are to be handled
 1a) everywhere or
 1b) in places like described below under [1] c)

I like this (1b). With the JIT, an event check could be inlined to 1
load and 1 conditional branch to the event dispatcher, yes? (So long as
interp is already in a register.) If that's done before blocking and at
upward branches, the hit probably won't be killer for most of code. For
REALLY tight loops (i.e., w/o branches or jumps, and w/ op count less
than a particular threshold), maybe unroll the loop a few times and then
still check on the upward branch.

Those branches will almost always fall straight through, so while there
will be load in the platform's branch prediction cache and a bit of
bloat, there shouldn't be much overhead in terms of pipeline bubbles.
The event ready word (in the interpreter, presumably) will stay in the
L1 or L2 cache, avoiding stalls.

No, it's not zero-overhead, but it's simple and easy enough to do
portably. Crazy platform-specific zero-overhead schemes can come later
as optimizations.

-- 

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
[EMAIL PROTECTED]



Re: Events and JIT

2004-01-16 Thread Michal Wallace
On Fri, 16 Jan 2004, Dan Sugalski wrote:

 I don't understand that part. Why the compiler?

 Because we don't have  the sort of control of the async environment
 that hardware does to deal with interrupts.

 And, realistically, all code has to deal with the possibility of
 interrupts. Even if they aren't doing any IO at all they're still
 potentially watching for keyboard (or other OS-initiated)
 breaks/interrupts.

I see your point. In python if you press ^C, it should
raise a KeyboardInterrupt exception.

But rather than always monitoring for that, I'd want
to register a listener and then have parrot handle the
polling for me. Maybe by default, parrot does the check
every N instructions... And then you could turn that off
if you wanted more speed.

 Well... There's the issue of signals, which is the big one. If we
 could skip signals, that'd be great, but we can't, even on systems
 that don't do them in the true Unix-y sense. Windows programs should
 respond to breaks from the keyboard (or close-window requests in a
 terminal-esque environment if we build one) and have a chance to shut
 down cleanly, so... that's an event.

This is probably a dumb question but: what if signals
threw exceptions instead?

I mean, they're pretty rare aren't they? They seem like
a completely different kind of thing than watching a
mouse or socket... Different because signals have nothing
to do with the program itself but come entire from the
outside. (Whereas with the regular events, the data
comes from outside but the choice to listen for the
data was made inside the program.)

Sincerely,

Michal J Wallace
Sabren Enterprises, Inc.
-
contact: [EMAIL PROTECTED]
hosting: http://www.cornerhost.com/
my site: http://www.withoutane.com/
--



RE: Events and JIT

2004-01-16 Thread Gordon Henriksen
Michal,

 But rather than always monitoring for that, I'd want
 to register a listener and then have parrot handle the
 polling for me.

This is precisely what's being discussed.


 This is probably a dumb question but: what if signals
 threw exceptions instead?

I'd hope that the event handler for a signal event could 
elect to throw an exception; it could even be the default.
But the exception has to get into the thread somehow--
exceptions don't autonomously happen, and they require
considerable cooperation from the thread on which the
exception occurs. High-priority events are that mechanism
through which the code which will throw the exception can
interrupt normal program execution.

-- 

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
[EMAIL PROTECTED]



Re: Events and JIT

2004-01-16 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:
 At 11:38 AM +0100 1/16/04, Leopold Toetsch wrote:
Event handling currently works for all run cores[1] except JIT.

 What I'd planned for with events is a bit less responsive than the
 system you've put together for the non-JIT case, and I think it'll be
 OK generally speaking.

 Ops fall into three categories:

   1) Those that don't check for events
   2) Those that explicitly check for events
   3) Those that implicitly check for events

Yep, that are the cases. I think I have boiled down that scheme to no
cost for non JIT run cores[1], that is, in the absence of events there
is no overhead for event checking. Event delivery (which I consider rare
in terms of CPU cycles) takes a bit more instead - but not much.

But the JIT core has to deal with event delivery too. So we have to
decide which JITted ops are 3) - (case 2) the explicit check op is already
available, that's no problem we need hints for 3)

 Ops in the third category are a bit trickier. Anything that sleeps or
 waits should spin on the event queue

Ok, the latter is the simple part - all IO or event related ops. But the
problem remains:

What about the loop[2] of mops.pasm? Only integers in registers running
at one Parrot op per CPU cycle.

 The big thing to ponder is which ops ought go in category three. I
 can see the various invoke ops doing it, but beyond that I'm up in
 the air.

Yes. First: do we guarantee timely event handling in highly optimized
loops like in mops.pasm? Can we uses schemes like my proposal of using
the int3 x86 instruction...

leo

[1] the switched core currently checks after the switch statement, but
its not simple to optimize that

[2]
  jit_func+116:   sub%edi,%ebx
  jit_func+118:   jne0x81c73a4 jit_func+116


Re: Events and JIT

2004-01-16 Thread Leopold Toetsch
Michal Wallace [EMAIL PROTECTED] wrote:
 On Fri, 16 Jan 2004, Dan Sugalski wrote:
 interrupts. Even if they aren't doing any IO at all they're still
 potentially watching for keyboard (or other OS-initiated)
 breaks/interrupts.

 I see your point. In python if you press ^C, it should
 raise a KeyboardInterrupt exception.

 But rather than always monitoring for that, I'd want
 to register a listener

Ehem: that's what we are talking about. There is a listener already
running, that's the event thread. It currently doesn't much, but it
listens eg for a timer event, that is, it waits on a condition. This
doesn't take any CPU time during waiting except for a bit overhead, when
the kernel awakens that thread and it executes (for a very short time).

So this event thread sees: Oh the kernel just noticed me, that the
cookie for interpreter #5 is ready, so I'll tell that interpreter.
That's what the event thread is doing.

Now interpreter #5 - currently busy running in a tight loop - doesn't
see that his cookie is ready. *If* this interpreter would check
every pace, if there is a cookie, it would be a big slowdown.

To minimze the overhead, the event thread throws a big piece of
wood in front of the fast running interpreter #5, which finally, after a
bit stumbling, realizes: Oh, my cookie has arrived.

 This is probably a dumb question but: what if signals
 threw exceptions instead?

We will (AFAIK) convert signals to events, which dispatch further.

 I mean, they're pretty rare aren't they?

Async is the problem.

 Michal J Wallace

leo


Re: Events and JIT

2004-01-16 Thread Leopold Toetsch
Gordon Henriksen [EMAIL PROTECTED] wrote:
 Leopold Toetsch wrote:

 2) Patch the native opcodes at these places with e.g. int3

 I don't think that bytecode-modifying versions should fly; they're not
 threadsafe,

Why? The bytecode is patched by a different thread *if* an event is due
(which in CPU cycles is rare). And I don't see a thread safety problem.
The (possibly different) CPU reads an opcode and runs it. Somewhere in
the meantime, the opcode at that memory position changes to the byte
sequence 0xCC (on intel: int3 ) one byte changes, the CPU executes the
trap or not (or course changing that memory position is assumed to be
atomic, which AFAIK works on i386) - but next time in the loop the trap
is honored.

 ... and it would be nice to write-protect the instruction stream
 to avert that attack vector.

We did protect it, so we can un- and reprotect it, that's not the
problem.

 1b) in places like described below under [1] c)

 I like this (1b). With the JIT, an event check could be inlined to 1
 load and 1 conditional branch to the event dispatcher, yes?

Yep. That's the plain average slower case :) Its a fallback, if there
are no better and faster solutions.

 (So long as
 interp is already in a register.)

Arghh, damned i386 with *zero* registers, where zero is around 4
(usable, general... ) ;)
So no interpreter in registers here - no.

Its at least 3? cycles + branch prediction overhead, so a lot compared to
nul overhead...

 ... If that's done before blocking and at
 upward branches, the hit probably won't be killer for most of code. For
 REALLY tight loops (i.e., w/o branches or jumps, and w/ op count less
 than a particular threshold), maybe unroll the loop a few times and then
 still check on the upward branch.

Yep, loop unrolling would definitely help, that was the currently very
likely working solution in my head.

 Those branches will almost always fall straight through, so while there
 will be load in the platform's branch prediction cache and a bit of
 bloat, there shouldn't be much overhead in terms of pipeline bubbles.
 The event ready word (in the interpreter, presumably) will stay in the
 L1 or L2 cache, avoiding stalls.

Yep. Still I like these numbers:
$ parrot -j examples/assembly/mops.pasm
M op/s:790.105001  # on AMD 800

 No, it's not zero-overhead, but it's simple and easy enough to do
 portably. Crazy platform-specific zero-overhead schemes can come later
 as optimizations.

s(Crazy)(Reasonable) but later is ok:)

leo


RE: Events and JIT

2004-01-16 Thread Gordon Henriksen
Leopold Toetsch wrote:

 Why? The bytecode is patched by a different thread *if* an 
 event is due (which in CPU cycles is rare). And I don't see a 
 thread safety problem. The (possibly different) CPU reads an 
 opcode and runs it. Somewhere in the meantime, the opcode at 
 that memory position changes to the byte sequence 0xCC (on 
 intel: int3 ) one byte changes, the CPU executes the trap or 
 not (or course changing that memory position is assumed to be 
 atomic, which AFAIK works on i386) - but next time in the 
 loop the trap is honored.

Other threads than the target could be executing the same chunk
of JITted code at the same time.

-- 

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
[EMAIL PROTECTED]



Re: Events and JIT

2004-01-16 Thread Jeff Clites
On Jan 16, 2004, at 1:20 PM, Leopold Toetsch wrote:

Gordon Henriksen [EMAIL PROTECTED] wrote:
Leopold Toetsch wrote:

2) Patch the native opcodes at these places with e.g. int3

I don't think that bytecode-modifying versions should fly; they're not
threadsafe,
Why? The bytecode is patched by a different thread *if* an event is due
(which in CPU cycles is rare). And I don't see a thread safety problem.
The (possibly different) CPU reads an opcode and runs it. Somewhere in
the meantime, the opcode at that memory position changes to the byte
sequence 0xCC (on intel: int3 ) one byte changes, the CPU executes the
trap or not (or course changing that memory position is assumed to be
atomic, which AFAIK works on i386) - but next time in the loop the trap
is honored.
Where in the stream would we patch? If not in a loop, you may never hit 
the single patched location again, but still might not end for a very 
long time

If we are patching all locations with branches and such, in large 
bytecode this could take a long time, and the executing thread might 
outrun the patching thread. Also, once the handler is entered we'd 
have to fix all of patched locations, which again could be 
time-consuming for large bytecode.

It could work, but seems problematic.

JEff