Re: Events and JIT
Jeff Clites [EMAIL PROTECTED] wrote: But if the event dispatch thread is setting some flag for the target thread to detect, it's going to need to lock (or something similar) to make sure that the value of this flag is visible to other threads. Yep, that's true. The event thread puts an event into the interpreters task queue. This is done by locking the task queue, inserting the entry and unlocking it. This sequence ensures, that the queue entry with the event hanging off it, will be visible to the target interpreter. ... So that could mean a lock inside of every invoke No not AFAIK. Inserting the event into the task queue should already be enough, to make it visble for the target interpreter. So now we are probably quadratic with the size of the segment. (Patching N locations leads to N times where we un-patch all N locations.) I would say O(2*N) max. Yet another idea: The event thread could set some piece of memory (e.g. the JITed instruction stream) to PROT_NONE and catch the SIGSEGV. Some read/write barrier schemes for inkremental GC are using this approach. JEff leo
Re: Events and JIT
Gordon Henriksen [EMAIL PROTECTED] wrote: On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote: Ops that can leave the code segment have to explicitely check for=20 events. Then no need to patch invoke*. Yep, that's true. I'm waiting for the ops-file hints, then that can be cleaned up. If the target thread handles the event and unpatches the bytecode before=20= the source thread finishes patching? Then we have kind of a race ;) But AFAIK it doesn't harm (except it slows down the interpreter a bit) If the target thread hits again a patched instruction it checks the task_queue, finds it empty, patches again the opcode stream with the original code and continues running until the next real event is scheduled. And the target thread could communicate with the event-handler thread with and volatile *patch_opcode_t. This hasn't to be exact (so no locking), but would allow, that the event-thread stops patching. But if events arrive faster, then the time needed for patching the opcode stream twice, then there is a real problem :) [ BTW does anyone have a spare dual-Opteron machine, I'd really like to test threads on a MP machine ] Gordon Henriksen --Apple-Mail-7--579835501-- Can you switch that to send ASCII mails? leo
Re: Events and JIT
Jeff Clites [EMAIL PROTECTED] wrote: On Jan 17, 2004, at 2:51 AM, Leopold Toetsch wrote: What I meant was, _every_ backward branch (etc.) in the bytecode, or just one location that we thought would execute soon. (But you implicitly answered this, below.) Every backward branch. Guessing the PC and patching in front of that can't work. But we can of course measure how long it takes to patch a certain amount of code twice. It just seems bad that the efficiency of event delivery would be proportional to the size of the bytecode (or segment). So it's slower to deliver events in large programs. That seems like a bad omen. It depends, how bad it really is: - it should be possible, to patch only the current subroutine, which might or might not help (a Sub can still be huge). The program flow can't have left the Sub in the meantime, because: - a bunch of opcodes (sleep, invoke*) will check events explicitely (this is the original strategy) - the problem is of course loops, that don't have any explicit check event opcode inside. For these there are 2 strategies: a) the event handler thread patches these backward branches or b) the backward branch op checks for events always Both have some costs: a) means slower event delivery, b) means slower execution in the absence of events. If patching takes too long, we can't do it and have to use b) that's it. The negative performance impact of b) can be reduced somtimes by partial loop unrolling. As said, we have to benchmark all. JEff leo
Re: Events and JIT
On Jan 17, 2004, at 12:33 PM, Leopold Toetsch wrote: Gordon Henriksen [EMAIL PROTECTED] wrote: On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote: Ops that can leave the code segment have to explicitely check for events. But if the event dispatch thread is setting some flag for the target thread to detect, it's going to need to lock (or something similar) to make sure that the value of this flag is visible to other threads. (If not, then the target thread could branch to a new code segment after the dispatch thread thinks it has set this flag, but before it's actually become visible to the target threads.) So that could mean a lock inside of every invoke If the target thread handles the event and unpatches the bytecode before the source thread finishes patching? Then we have kind of a race ;) But AFAIK it doesn't harm (except it slows down the interpreter a bit) If the target thread hits again a patched instruction it checks the task_queue, finds it empty, patches again the opcode stream with the original code and continues running until the next real event is scheduled. So the really bad case here will be if the patching thread is working just ahead of the running thread--if it has to patch 1000 locations, then we could end up hitting each of those (and checking for an event, unpatching everything...) before the patching thread has finished. So now we are probably quadratic with the size of the segment. (Patching N locations leads to N times where we un-patch all N locations.) And the target thread could communicate with the event-handler thread with and volatile *patch_opcode_t. This hasn't to be exact (so no locking), but would allow, that the event-thread stops patching. As above, it may be ineffective if we don't lock--the other thread may not see the value change. (The volatile keyword in Java is magic in terms of inter-thread visibility, but not in C.) JEff
Re: Events and JIT
Jeff Clites [EMAIL PROTECTED] wrote: Where in the stream would we patch? If not in a loop, you may never hit the single patched location again, but still might not end for a very long time Same as with prederefed run cores: backward branches, invoke* If we are patching all locations with branches and such, in large bytecode this could take a long time, and the executing thread might outrun the patching thread. We have of course to benchmark this scheme against other approaches. Albeit finding/writing such benchmarks and measuring responsiveness will be some fun ;) But we can of course measure how long it takes to patch a certain amount of code twice. ... Also, once the handler is entered we'd have to fix all of patched locations, which again could be time-consuming for large bytecode. Please remember: only the current code segment has to be patched, which theoretically could be very large, but that's unlikely. JEff leo
Re: Events and JIT
Gordon Henriksen [EMAIL PROTECTED] wrote: Other threads than the target could be executing the same chunk of JITted code at the same time. No. JITed (and prederefed) code is thread-specific, because register addresses are converted to absolute memory locations. leo
Re: Events and JIT
On Saturday, January 17, 2004, at 05:53 , Leopold Toetsch wrote: Gordon Henriksen [EMAIL PROTECTED] wrote: Other threads than the target could be executing the same chunk of JITted code at the same time. No. JITed (and prederefed) code is thread-specific, because register addresses are converted to absolute memory locations. Ack! Might want that to be configurable for the JIT, to reduce the overhead of starting a thread... Gordon Henriksen [EMAIL PROTECTED]
Re: Events and JIT
On Saturday, January 17, 2004, at 05:51 , Leopold Toetsch wrote: Jeff Clites [EMAIL PROTECTED] wrote: ... Also, once the handler is entered we'd have to fix all of patched locations, which again could be time-consuming for large bytecode. Please remember: only the current code segment has to be patched, which theoretically could be very large, but that's unlikely. What if control leaves the current code segment before you finish patching it? If an event is being delivered from another thread, this is very likely to occur if the event delivery thread is preempted by the event receiver. Gordon Henriksen [EMAIL PROTECTED]
Re: Events and JIT
Gordon Henriksen [EMAIL PROTECTED] wrote: What if control leaves the current code segment before you finish=20 patching it? If an event is being delivered from another thread, this is=20= Ops, that can leave the code segment have to explicitely check for events. Gordon Henriksen leo
Re: Events and JIT
On Saturday, January 17, 2004, at 12:47 , Leopold Toetsch wrote: Gordon Henriksen [EMAIL PROTECTED] wrote: What if control leaves the current code segment before you finish patching it? Ops that can leave the code segment have to explicitely check for events. Then no need to patch invoke*. If the target thread handles the event and unpatches the bytecode before the source thread finishes patching? Gordon Henriksen [EMAIL PROTECTED]
Re: Events and JIT
On Jan 17, 2004, at 2:51 AM, Leopold Toetsch wrote: Jeff Clites [EMAIL PROTECTED] wrote: Where in the stream would we patch? If not in a loop, you may never hit the single patched location again, but still might not end for a very long time Same as with prederefed run cores: backward branches, invoke* What I meant was, _every_ backward branch (etc.) in the bytecode, or just one location that we thought would execute soon. (But you implicitly answered this, below.) If we are patching all locations with branches and such, in large bytecode this could take a long time, and the executing thread might outrun the patching thread. We have of course to benchmark this scheme against other approaches. Albeit finding/writing such benchmarks and measuring responsiveness will be some fun ;) But we can of course measure how long it takes to patch a certain amount of code twice. It just seems bad that the efficiency of event delivery would be proportional to the size of the bytecode (or segment). So it's slower to deliver events in large programs. That seems like a bad omen. ... Also, once the handler is entered we'd have to fix all of patched locations, which again could be time-consuming for large bytecode. Please remember: only the current code segment has to be patched, which theoretically could be very large, but that's unlikely. I don't think that's correct--the thread to which we are trying to deliver the event could have branched to a new segment before we finished patching. JEff
Events and JIT
Event handling currently works for all run cores[1] except JIT. The JIT core can't use the schemes described below, but we could: 1) explicitely insert checks, if events are to be handled 1a) everywhere or 1b) in places like described below under [1] c) 2) Patch the native opcodes at these places with e.g. int3 (SIG_TRAP, debugger hook) cpu instruction and catch the trap. Running the event handler (sub) from there should be safe, as we are in a consistent state in the run loop. 3) more ideas? 1) of course slows down execution of all JIT code, 2) is platform/architecture dependent, but JIT code is that anyway. Comments welcome, leo [1] a) Run cores with an opcode dispatch table get a new dispatch table, where all entries point to the event handling code b) The switch core checks at the beginning of the switch statement. c) Prederefed run cores get the opcode stream patched, where back-ward branches and invoke or such[2] are replaced with event-check opcodes. While a) and c) is run async from the event thread, it shouldn't cause problems, because (assuming atomic word-access) either the old function table/opcode pointer is visible or already the new, there is no inconsistent state. Events using a) or b) are handled instantly, while c) events get handled some (probably very short time) after they were scheduled. [2] Explicit hints from the ops-files, where to check events would simplify that. E.g. op event invoke() op event_after sleep(..)
Re: Events and JIT
On Fri, 16 Jan 2004, Dan Sugalski wrote: 2) Those that explicitly check for events ... Ops like spin_in_event_loop (or whatever we call it) or checkevent is in category two. They check events because, well, that's what they're supposed to do. Compilers should emit these with some frequency, though it's arguable how frequent they ought to be. I don't understand that part. Why the compiler? If the high-level code doesn't do anything with the events, then there's no point in checking. If it does use the events, then shouldn't developers call the even loop explicitly? Sincerely, Michal J Wallace Sabren Enterprises, Inc. - contact: [EMAIL PROTECTED] hosting: http://www.cornerhost.com/ my site: http://www.withoutane.com/ --
RE: Events and JIT
Leopold Toetsch wrote: Event handling currently works for all run cores[1] except JIT. The JIT core can't use the schemes described below, but we could: 2) Patch the native opcodes at these places with e.g. int3 (SIG_TRAP, debugger hook) cpu instruction and catch the trap. Running the event handler (sub) from there should be safe, as we are in a consistent state in the run loop. I don't think that bytecode-modifying versions should fly; they're not threadsafe, and it would be nice to write-protect the instruction stream to avert that attack vector. 1) explicitely insert checks, if events are to be handled 1a) everywhere or 1b) in places like described below under [1] c) I like this (1b). With the JIT, an event check could be inlined to 1 load and 1 conditional branch to the event dispatcher, yes? (So long as interp is already in a register.) If that's done before blocking and at upward branches, the hit probably won't be killer for most of code. For REALLY tight loops (i.e., w/o branches or jumps, and w/ op count less than a particular threshold), maybe unroll the loop a few times and then still check on the upward branch. Those branches will almost always fall straight through, so while there will be load in the platform's branch prediction cache and a bit of bloat, there shouldn't be much overhead in terms of pipeline bubbles. The event ready word (in the interpreter, presumably) will stay in the L1 or L2 cache, avoiding stalls. No, it's not zero-overhead, but it's simple and easy enough to do portably. Crazy platform-specific zero-overhead schemes can come later as optimizations. -- Gordon Henriksen IT Manager ICLUBcentral Inc. [EMAIL PROTECTED]
Re: Events and JIT
On Fri, 16 Jan 2004, Dan Sugalski wrote: I don't understand that part. Why the compiler? Because we don't have the sort of control of the async environment that hardware does to deal with interrupts. And, realistically, all code has to deal with the possibility of interrupts. Even if they aren't doing any IO at all they're still potentially watching for keyboard (or other OS-initiated) breaks/interrupts. I see your point. In python if you press ^C, it should raise a KeyboardInterrupt exception. But rather than always monitoring for that, I'd want to register a listener and then have parrot handle the polling for me. Maybe by default, parrot does the check every N instructions... And then you could turn that off if you wanted more speed. Well... There's the issue of signals, which is the big one. If we could skip signals, that'd be great, but we can't, even on systems that don't do them in the true Unix-y sense. Windows programs should respond to breaks from the keyboard (or close-window requests in a terminal-esque environment if we build one) and have a chance to shut down cleanly, so... that's an event. This is probably a dumb question but: what if signals threw exceptions instead? I mean, they're pretty rare aren't they? They seem like a completely different kind of thing than watching a mouse or socket... Different because signals have nothing to do with the program itself but come entire from the outside. (Whereas with the regular events, the data comes from outside but the choice to listen for the data was made inside the program.) Sincerely, Michal J Wallace Sabren Enterprises, Inc. - contact: [EMAIL PROTECTED] hosting: http://www.cornerhost.com/ my site: http://www.withoutane.com/ --
RE: Events and JIT
Michal, But rather than always monitoring for that, I'd want to register a listener and then have parrot handle the polling for me. This is precisely what's being discussed. This is probably a dumb question but: what if signals threw exceptions instead? I'd hope that the event handler for a signal event could elect to throw an exception; it could even be the default. But the exception has to get into the thread somehow-- exceptions don't autonomously happen, and they require considerable cooperation from the thread on which the exception occurs. High-priority events are that mechanism through which the code which will throw the exception can interrupt normal program execution. -- Gordon Henriksen IT Manager ICLUBcentral Inc. [EMAIL PROTECTED]
Re: Events and JIT
Dan Sugalski [EMAIL PROTECTED] wrote: At 11:38 AM +0100 1/16/04, Leopold Toetsch wrote: Event handling currently works for all run cores[1] except JIT. What I'd planned for with events is a bit less responsive than the system you've put together for the non-JIT case, and I think it'll be OK generally speaking. Ops fall into three categories: 1) Those that don't check for events 2) Those that explicitly check for events 3) Those that implicitly check for events Yep, that are the cases. I think I have boiled down that scheme to no cost for non JIT run cores[1], that is, in the absence of events there is no overhead for event checking. Event delivery (which I consider rare in terms of CPU cycles) takes a bit more instead - but not much. But the JIT core has to deal with event delivery too. So we have to decide which JITted ops are 3) - (case 2) the explicit check op is already available, that's no problem we need hints for 3) Ops in the third category are a bit trickier. Anything that sleeps or waits should spin on the event queue Ok, the latter is the simple part - all IO or event related ops. But the problem remains: What about the loop[2] of mops.pasm? Only integers in registers running at one Parrot op per CPU cycle. The big thing to ponder is which ops ought go in category three. I can see the various invoke ops doing it, but beyond that I'm up in the air. Yes. First: do we guarantee timely event handling in highly optimized loops like in mops.pasm? Can we uses schemes like my proposal of using the int3 x86 instruction... leo [1] the switched core currently checks after the switch statement, but its not simple to optimize that [2] jit_func+116: sub%edi,%ebx jit_func+118: jne0x81c73a4 jit_func+116
Re: Events and JIT
Michal Wallace [EMAIL PROTECTED] wrote: On Fri, 16 Jan 2004, Dan Sugalski wrote: interrupts. Even if they aren't doing any IO at all they're still potentially watching for keyboard (or other OS-initiated) breaks/interrupts. I see your point. In python if you press ^C, it should raise a KeyboardInterrupt exception. But rather than always monitoring for that, I'd want to register a listener Ehem: that's what we are talking about. There is a listener already running, that's the event thread. It currently doesn't much, but it listens eg for a timer event, that is, it waits on a condition. This doesn't take any CPU time during waiting except for a bit overhead, when the kernel awakens that thread and it executes (for a very short time). So this event thread sees: Oh the kernel just noticed me, that the cookie for interpreter #5 is ready, so I'll tell that interpreter. That's what the event thread is doing. Now interpreter #5 - currently busy running in a tight loop - doesn't see that his cookie is ready. *If* this interpreter would check every pace, if there is a cookie, it would be a big slowdown. To minimze the overhead, the event thread throws a big piece of wood in front of the fast running interpreter #5, which finally, after a bit stumbling, realizes: Oh, my cookie has arrived. This is probably a dumb question but: what if signals threw exceptions instead? We will (AFAIK) convert signals to events, which dispatch further. I mean, they're pretty rare aren't they? Async is the problem. Michal J Wallace leo
Re: Events and JIT
Gordon Henriksen [EMAIL PROTECTED] wrote: Leopold Toetsch wrote: 2) Patch the native opcodes at these places with e.g. int3 I don't think that bytecode-modifying versions should fly; they're not threadsafe, Why? The bytecode is patched by a different thread *if* an event is due (which in CPU cycles is rare). And I don't see a thread safety problem. The (possibly different) CPU reads an opcode and runs it. Somewhere in the meantime, the opcode at that memory position changes to the byte sequence 0xCC (on intel: int3 ) one byte changes, the CPU executes the trap or not (or course changing that memory position is assumed to be atomic, which AFAIK works on i386) - but next time in the loop the trap is honored. ... and it would be nice to write-protect the instruction stream to avert that attack vector. We did protect it, so we can un- and reprotect it, that's not the problem. 1b) in places like described below under [1] c) I like this (1b). With the JIT, an event check could be inlined to 1 load and 1 conditional branch to the event dispatcher, yes? Yep. That's the plain average slower case :) Its a fallback, if there are no better and faster solutions. (So long as interp is already in a register.) Arghh, damned i386 with *zero* registers, where zero is around 4 (usable, general... ) ;) So no interpreter in registers here - no. Its at least 3? cycles + branch prediction overhead, so a lot compared to nul overhead... ... If that's done before blocking and at upward branches, the hit probably won't be killer for most of code. For REALLY tight loops (i.e., w/o branches or jumps, and w/ op count less than a particular threshold), maybe unroll the loop a few times and then still check on the upward branch. Yep, loop unrolling would definitely help, that was the currently very likely working solution in my head. Those branches will almost always fall straight through, so while there will be load in the platform's branch prediction cache and a bit of bloat, there shouldn't be much overhead in terms of pipeline bubbles. The event ready word (in the interpreter, presumably) will stay in the L1 or L2 cache, avoiding stalls. Yep. Still I like these numbers: $ parrot -j examples/assembly/mops.pasm M op/s:790.105001 # on AMD 800 No, it's not zero-overhead, but it's simple and easy enough to do portably. Crazy platform-specific zero-overhead schemes can come later as optimizations. s(Crazy)(Reasonable) but later is ok:) leo
RE: Events and JIT
Leopold Toetsch wrote: Why? The bytecode is patched by a different thread *if* an event is due (which in CPU cycles is rare). And I don't see a thread safety problem. The (possibly different) CPU reads an opcode and runs it. Somewhere in the meantime, the opcode at that memory position changes to the byte sequence 0xCC (on intel: int3 ) one byte changes, the CPU executes the trap or not (or course changing that memory position is assumed to be atomic, which AFAIK works on i386) - but next time in the loop the trap is honored. Other threads than the target could be executing the same chunk of JITted code at the same time. -- Gordon Henriksen IT Manager ICLUBcentral Inc. [EMAIL PROTECTED]
Re: Events and JIT
On Jan 16, 2004, at 1:20 PM, Leopold Toetsch wrote: Gordon Henriksen [EMAIL PROTECTED] wrote: Leopold Toetsch wrote: 2) Patch the native opcodes at these places with e.g. int3 I don't think that bytecode-modifying versions should fly; they're not threadsafe, Why? The bytecode is patched by a different thread *if* an event is due (which in CPU cycles is rare). And I don't see a thread safety problem. The (possibly different) CPU reads an opcode and runs it. Somewhere in the meantime, the opcode at that memory position changes to the byte sequence 0xCC (on intel: int3 ) one byte changes, the CPU executes the trap or not (or course changing that memory position is assumed to be atomic, which AFAIK works on i386) - but next time in the loop the trap is honored. Where in the stream would we patch? If not in a loop, you may never hit the single patched location again, but still might not end for a very long time If we are patching all locations with branches and such, in large bytecode this could take a long time, and the executing thread might outrun the patching thread. Also, once the handler is entered we'd have to fix all of patched locations, which again could be time-consuming for large bytecode. It could work, but seems problematic. JEff