Re: Using imcc as JIT optimizer
Leopold Toetsch wrote: Phil Hassey wrote: But with a processor with > 16 registers (do such things exist?). Parrot would be overflowing registers that it could have been using in the JIT. RISC processor have a lot of them. But before there are unused processor registers, we will allocate P and S registers too. When a CPU has more then 4*32 free registers, we will look again. Like IA64? AFAIK it has 128 integer registers and 128 fp registers...
Re: Using imcc as JIT optimizer
Nicholas Clark <[EMAIL PROTECTED]> writes: > On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote: > > [snip lots of good stuff] > >> All this is obviously machine dependent: the code generated should >> only run in the machine it was compiled for. So we should always keep >> the original imc code in case we copy the pbc file to another >> machine. > > Er, but doesn't that mean that imc code has now usurped the role of parrot > byte code? > > I'm not sure what is a good answer here. But I thought that the intent of > parrot's bytecode was to be the same bytecode that runs everywhere. Which > is slightly incompatible with compiling perl code to something that runs > as fast as possible on the machine that you're both compiling and running > on. (These two being the same machine most of the time). > > Maybe we starting to get to the point of having imcc deliver parrot bytecode > if you want to be portable, and something approaching native machine code > if you want speed. Or maybe if you want the latter we save "fat" bytecode > files, that contain IMC code, bytecode and JIT-food for one or more > processors. Aren't there safety implications with 'fat' code? One could envisage a malicious fat PBC where the IMC code and the bytecode did different things... -- Piers
[CVS ci] Using imcc as JIT optimizer #3
This concludes for now this experiment. It works, but to do it right, it should go in the direction Angel Faus has mentioned. Also calling conventions have to be done before, to get the data flow right. With the -Oj option a mininal CFG section is created in the packfile, which is used by parrots JIT code, to get sections and register mappings. This is significantly faster then current's jit optimizer, which has a relatively high impact on program load times. The JIT loader looks at the packfile now, and uses either method to generate information needed for actually producing bytecode. Further included: - some CFG hacks to figure out info about subroutines - implemented the in the comment mentioned optimization in the register interference code - implement read/write semantics of pusx/popx/clearx/saveall/restoreall - some bugfixes WRT memory handling of SymRegs/life_info - improved default_dump for pdump - removed unused warnings in jit.c, all -O3 unitialized warnings in imcc leo PS $ imcc -O1j primes.pasm Elapsed time: 3.485836 $ ./primes # -O3 gcc 2.95.2 Elapsed time: 3.643756 $ imcc -O1 -j primes.pasm Elapsed time: 3.884460 $ make test IMCC="languages/imcc/imcc -O1j" succeeds, except for t/op/interp_2, where the trace output is different due to inserted register load/store ops. For the nci stuff -Oj gets disabled internally.
Re: Using imcc as JIT optimizer
Phil Hassey wrote: ... The current bytecode from parrot already has potential for slowing things down, and that's what worries me here. I don't see that. My understanding is that PBC has a limit of 16 (32?) integer registers. When a code block needs more than 16 registers, they are overflowed into a PMC. There are 32 registers per type. When life analysis of all used temporary registers, can't allocate all used vars to a parrot register, then overflowed vars get spilled into a PerlArray. This may be different to just "a block needs more than...": set $I0, 10 add $11, $I0, 2 print $I1 add $12, $I0, 3 print $I2 only needs two registers, $11 and $I2 get the same parrot register, because their usage doesn't overlap. But with a processor with > 16 registers (do such things exist?). Parrot would be overflowing registers that it could have been using in the JIT. RISC processor have a lot of them. But before there are unused processor registers, we will allocate P and S registers too. When a CPU has more then 4*32 free registers, we will look again. Thanks, Phil leo
Re: Using imcc as JIT optimizer
> > Although it might be nice if IMC were binary at this stage (for some > > feel-good-reason?). > > You mean, that a HL like perl6 should produce a binary equivalent to > ther current .imc file? Yep - this was discussed already, albeit there > was no discussion, how this should look like. And the lexer in imcc is > pretty fast. > > > ... The current bytecode from parrot already has potential > > for slowing things down, and that's what worries me here. > > I don't see that. My post was more a "wish-list" of what I was hoping parrot would be like in terms of imc/pbc/jit/whatever. Since I don't completely understand how parrot works, my comment above was actually more of a guess. But I'll try to explain what I meant, in the off-chance it was right. My understanding is that PBC has a limit of 16 (32?) integer registers. When a code block needs more than 16 registers, they are overflowed into a PMC. With a processor with < 16 registers, I guess this would work. Although the JIT would have to overflow more than what was originally planned in the PBC. (Or does it just switch back and forth between the VM and the JIT, I don't know.) But with a processor with > 16 registers (do such things exist?). Parrot would be overflowing registers that it could have been using in the JIT. My guess is that this would slow things down. Anyway, before I strut my ignorance of VMs and JITs and processors anymore, I think I will end this message. :) Thanks, Phil
Re: Using imcc as JIT optimizer
Angel Faus wrote: (1) First, do a register allocation for machine registers, assuming that there are N machine registers and infinite parrot registers. This uses equally the top N used registers for processor regs. The "spilling" for (1) is loading/moving them to parrot registers/temp registers. Only the load/store would be that what spilling code makes out of those. Then you still have 32 parrot registers per kind to allocate. But it is not as easy as it reads: We have non preserved registers too, which can be mapped, but are not preserved over function calls, so they must, when mapped and used, be stored to parrots regs and reloaded after extern function calls, if used again in that block or after. Albeit load/stores of this kind can be optimized, depending on register usage. For example, code generated by (1) would look like: set m3, 1 # m3 is the machine 3d register add m3, m3, 1 print m3 set $I1, m3 # $I1 is a parrot virtual register Not exactly: print is an external function. Assuming ri0 - ri3 are mapped, ri3 is not callee saved: set ri0, 1 add ri0, 1 set $I0, ri0 # save for print $I0 set $I1, ri3 # save/preserve the register, when used print $I0 # external function set ri3, $I1 # load add ri3, ri1, ri2 # do something (For debugging mapped registers are printed ri0..x or rn0..y by imcc) Hope that it know make more sense, More, yes. This would give us 32 + N - (0..x) registers, where x is the amount of non callee saved registers in the worst case, or 0 most of the time. The $1 above can be always a new temp, which would then have a very limited life range inside one basic block. -angel leo
Re: Using imcc as JIT optimizer
Phil Hassey wrote: [snip] Although it might be nice if IMC were binary at this stage (for some feel-good-reason?). You mean, that a HL like perl6 should produce a binary equivalent to ther current .imc file? Yep - this was discussed already, albeit there was no discussion, how this should look like. And the lexer in imcc is pretty fast. ... The current bytecode from parrot already has potential for slowing things down, and that's what worries me here. I don't see that. 3. He can hand out a platform specific .jit (which would require the target to be able to run it.) I suspect most end users would be able to use #1 or #2. However for use on embedded systems where size is an issue, having #3 an option would be useful, as I suspect it would shrink the footprint of parrot somewhat. The JIT-PBC for #3 has a somewhat larger size then plain PBC due to register load/store ops and an additional CFG/register usage PBC section. But running it does require less memory, because the JIT optimizer doesn't have to create all the internal bookkeeping tables. Cheers, Phil leo
Re: Using imcc as JIT optimizer
> [ you seem to be living some hors ahead in time ] Yep, sorry about that. > The problem stays the same: spilling processors to parrot's or > parrots to array. > Thinking a bit more about it, now I believe that the best way to do it would be: (1) First, do a register allocation for machine registers, assuming that there are N machine registers and infinite parrot registers. (2) Second, do a register allocation for parrot registers, using an array as spill area. The first step assures us that we generate code that always puts data in the availabe machine registers, and tries to minimize moves between registers and the physical memory. The second step tries to put all the data in parrot registers, and if it is not able to do that in the parrot spilling area (currently an PerlArray) For example, code generated by (1) would look like: set m3, 1 # m3 is the machine 3d register add m3, m3, 1 print m3 set $I1, m3 # $I1 is a parrot virtual register etc... Then we would do register allocation for the virtual $I1 registers, hoping to be able to put them all in the 32 parrot registers. I believe this would be the optimal way to do it, because it actually models our priorities: first to put all data in physical registers, otherwise try do it in parrot registers. This is better than reserving the machine registers for the most used parrot registers (your original proposal) or doing a pyhsical register allocation and assuming that we have an infinite number of parrot registers (my original proposal). Hope that it know make more sense, -angel
Re: Using imcc as JIT optimizer
[snip] > > Maybe we starting to get to the point of having imcc deliver parrot > > bytecode if you want to be portable, and something approaching native > > machine code if you want speed. > > IMHO yes, the normal options produce a plain PBC file, more or less > optimized at PASM level, the -Oj option is definitely a machine > optimization option, which can run or will create a PBC that runs only > on a machine with equally or less mapped registers and the same external > (non JITted instructions) i.e. on the same $arch. > But the normal case is, that I compile the source for my machine and run > it here - with all possible optimizations. > I never did do any cross compilation here. Shipping the source is > enough. Plain PBC is still like an unoptimized executable running > everywhere - not a machine specific cross compile EXE. > > > ... Or maybe if you want the latter we save "fat" bytecode > > files, that contain IMC code, bytecode and JIT-food for one or more > > processors. > > There is really no need for a fat PBC. Though - as already stated - I > could imagine some cross compile capabilities for -Oj PBCs. Seems to me it would be good if - mycode.pl -- my original code would be compiled into - mycode.pbc/imc -- platform neutral parrot bytecode with (as I sort of suggested a day ago) no limitations on what registers there are, no spilling code, as that comes next... In someways, this is what IMC code is right now. Although it might be nice if IMC were binary at this stage (for some feel-good-reason?). The current bytecode from parrot already has potential for slowing things down, and that's what worries me here. which when run on any system would generate - mycode.jit -- a platform specific thing with native compiled code And as a worst case, if a system didn't have a jit module would just run the mycode.pbc, albeit not very speedily. This gives the developer several choices: 1. He can hand out his original source (which would require the target to be able to compile, jit) 2. He can hand out a platform neutral pbc/imc of compiled code that can be compiled to full speed (which would require the target to be able to either jit or just run it.) 3. He can hand out a platform specific .jit (which would require the target to be able to run it.) I suspect most end users would be able to use #1 or #2. However for use on embedded systems where size is an issue, having #3 an option would be useful, as I suspect it would shrink the footprint of parrot somewhat. Just the thoughts of a future parrot user :) Hope they benefit someone. Cheers, Phil
Re: Using imcc as JIT optimizer
Nicholas Clark wrote: Well, I think that proper IO would be useful. But I don't think it affects the innards of the execution system greatly > No, though we will need some more ops - or not. Current io also defines a more or less dummy io PMC (e.g. io.ops:open). This could be a full PMC, with a io_vtable (which could reflect the io stack). The most used operations would be separate opcodes, others could be methods of this io_pmc. ...- is there any reason why parrot (or at least PBC) can't conceptually treat in the same way that C treats IO - just another standard library? Some times ago, I posted: "[RfC] a scheme for core.ops extending" :) "Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries" and "reading foreign bytecode". I regard the first as important, the second as "would be nice". I think Dan rates "reading foreign bytecode" more important than I do. AFAIK are we not able to directly execute Z-code by just loading a different opcode library. The Z-ops have encoded parameters in them. So we can only load a Z-code interpreter/compiler which then reads the Z-code program which is simple data then, no bytecode. Though it might help, to have some specialized Z-ops for execution, but this falls under above "extending". Nicholas Clark leo
Re: Using imcc as JIT optimizer
On Tue, Feb 25, 2003 at 11:58:41PM +0100, Leopold Toetsch wrote: > Nicholas Clark wrote: [thanks for the explanation] > > And is this all premature optimisation, give that we haven't got objects, > > exceptions, IO or a Z-code interpreter yet? > And yes: We don't have exceptions and threads yet. The other items, > don't matter (IMHO). Well, I think that proper IO would be useful. But I don't think it affects the innards of the execution system greatly - is there any reason why parrot (or at least PBC) can't conceptually treat in the same way that C treats IO - just another standard library? "Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries" and "reading foreign bytecode". I regard the first as important, the second as "would be nice". I think Dan rates "reading foreign bytecode" more important than I do. Nicholas Clark
Re: Using imcc as JIT optimizer
[ you seem to be living some hors ahead in time ] Angel Faus wrote: I explained very badly. The issue is not spilling (at the parrot level) The problem stays the same: spilling processors to parrot's or parrots to array. [ ... ] set I3, 1 add I3, I3, 1 print I3 fast_save I3, 1 set I3, 1 Above's "fast_save" is spilling at parrot register level and moving regs to parrot registers a processor regs level. Actual machine code could be: mov 1, %eax # first write to a parrot register inc %eax# add I3, I3, 1 => (*) add I3, 1 => inc I3 mov %eax, I3# store reg to parrot registers mem print I3# print is external *) already done now Above sequence of code wouldn't consume any mapped register - for the whole sequence originally shown. So the final goal could be, to emit these load/stores too, which then could be optimized to avoid duplicate loading/storing. An even better goal would be to have imcc know how many temporaries every JITed op requires, and use this information during register allocation. As shown above, yep. All this is obviously machine dependent: the code generated should only run in the machine it was compiled for. So we should always keep the original imc code in case we copy the pbc file to another machine. I'l answer this part in the reply to Nicholas reply. -angel leo
Re: Using imcc as JIT optimizer
Nicholas Clark wrote: On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote: [snip lots of good stuff] All this is obviously machine dependent: the code generated should only run in the machine it was compiled for. So we should always keep the original imc code in case we copy the pbc file to another machine. Er, but doesn't that mean that imc code has now usurped the role of parrot byte code? No. It's like another runtime option. Run "imcc -Oj the.pasm" and you get what you want, a differently optimized piece of JIT code, that might run faster then "imcc -j the.pasm". And saying "imcc -Oj -o the.pbc the.pasm" should spit out the fastest bytecode possible, for your very machine. I'm not sure what is a good answer here. But I thought that the intent of parrot's bytecode was to be the same bytecode that runs everywhere. Yep ... Which is slightly incompatible with compiling perl code to something that runs as fast as possible on the machine that you're both compiling and running on. (These two being the same machine most of the time). At PBC level, imcc already has "-Op" which does parrot register renumbering (modulo NCI and such, where fixed registers are needed, and this is -- hmmm suboptimal then :) and imcc can write out CFG information in some machine independent form, i.e. at basic block level. But no processor specific load/store instructions and such. This can help JIT optimizer to do the job faster, though it isn't that easy, because there are non JITed code sequences intersparsed. I think some difficulties arise, when looking at, what imcc now is: It's the assemble.pl generating PBC files. But it's also parrot, it can run PBC files - and it's both - it can run PASM (or IMC) files - immediately. And the latter one can be always as fast as the $arch allows. Generating PBC doesn't have to use the same compile options - as you wouldn't use, when running "gcc -b machine". Maybe we starting to get to the point of having imcc deliver parrot bytecode if you want to be portable, and something approaching native machine code if you want speed. IMHO yes, the normal options produce a plain PBC file, more or less optimized at PASM level, the -Oj option is definitely a machine optimization option, which can run or will create a PBC that runs only on a machine with equally or less mapped registers and the same external (non JITted instructions) i.e. on the same $arch. But the normal case is, that I compile the source for my machine and run it here - with all possible optimizations. I never did do any cross compilation here. Shipping the source is enough. Plain PBC is still like an unoptimized executable running everywhere - not a machine specific cross compile EXE. ... Or maybe if you want the latter we save "fat" bytecode files, that contain IMC code, bytecode and JIT-food for one or more processors. There is really no need for a fat PBC. Though - as already stated - I could imagine some cross compile capabilities for -Oj PBCs. And is this all premature optimisation, give that we haven't got objects, exceptions, IO or a Z-code interpreter yet? It is a different approach to JIT register allocation. The current optimizer allocates registers per JITed section, with no chance (IMHO) to reuse registers after a branch, because the optimizer lacks all the information to know, that this branch target will only be reached from here, and that the registers are the same, so finally knows, the savin/loading processor registers to memory could be avoided. OTOH imcc has almost all this info already at hand (coming out of CFG/life information needed for allocating parrot regs from $temps). So the chance for generating faster code is there, IMHO. Premature optimization - partly of course yes/no: My copy here runs now all parrot tests except op/interp_2 (obvious, this compares traced instructions, where the -Oj inserted some register load/saves) and the pmc/nci tests, where just the fixed parameter/return result register are mess up - the "imcc calling conventions" thread has a proposal for this. And yes: We don't have exceptions and threads yet. The other items, don't matter (IMHO). But we will come to a point, where for certain languages, we will optimize P-registers, or mix them with I-regs, reusing same processor regs. :-) Nicholas Clark leo
Re: Using imcc as JIT optimizer
On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote: [snip lots of good stuff] > All this is obviously machine dependent: the code generated should > only run in the machine it was compiled for. So we should always keep > the original imc code in case we copy the pbc file to another > machine. Er, but doesn't that mean that imc code has now usurped the role of parrot byte code? I'm not sure what is a good answer here. But I thought that the intent of parrot's bytecode was to be the same bytecode that runs everywhere. Which is slightly incompatible with compiling perl code to something that runs as fast as possible on the machine that you're both compiling and running on. (These two being the same machine most of the time). Maybe we starting to get to the point of having imcc deliver parrot bytecode if you want to be portable, and something approaching native machine code if you want speed. Or maybe if you want the latter we save "fat" bytecode files, that contain IMC code, bytecode and JIT-food for one or more processors. And is this all premature optimisation, give that we haven't got objects, exceptions, IO or a Z-code interpreter yet? Nicholas Clark
Re: Using imcc as JIT optimizer
I explained very badly. The issue is not spilling (at the parrot level) The problem is: if you only pick the highest priority parrot registers and put them in real registers you are losing oportunities where copying the date once will save you from copying it many times. You are, in some sense, underspilling. Let's see an example. Imagine you are compilling this imc, to be run in a machine which has 3 registers free (after temporaries): set $I1, 1 add $I1, $I1, 1 print $I1 set $I2, 1 add $I2, $I2, 1 print $I2 set $I3, 1 add $I3, $I3, 1 print $I3 set $I4, 1 add $I4, $I4, 1 print $I4 set $I5, 1 add $I5, $I5, 1 print $I5 print $I1 print $I2 print $I3 print $I4 print $I5 Very silly code indeed, but you get the idea. Since we have only 5 vars, imcc would turn this into: set I1, 1 add I1, I1, 1 print I1 set I2, 1 add I2, I2, 1 print I2 set I3, 1 add I3, I3, 1 print I3 set I4, 1 add I4, I4, 1 print I4 set I5, 1 add I5, I5, 1 print I5 print I1 print I2 print I3 print I4 print I5 Now, assuming you put registers I1-I3 in real registers, what would it take to execute this code in JIT? It would have to move the values of I4 and I5 from memory to registers a total of 10 times (4 saves and 6 restores if you assume the JIT is smart) [This particular example could be improved by making the jit look if the same parrot register is going to be used in the next op, but that's not the point] But, if IMCC knew that there were really only 3 registers in the machine, it would generate: set I1, 1 add I1, I1, 1 print I1 set I2, 1 add I2, I2, 1 print I2 set I3, 1 add I3, I3, 1 print I3 fast_save I3, 1 set I3, 1 add I3, I3, 1 print I3 fast_save I3, 2 set I3, 1 add I3, I3, 1 print I3 fast_save I3, 3 print I1 print I2 fast_restore I3, 3 print I3 fast_restore I3, 2 print I3 fast_restore I3, 1 print I3 When running this code in the JIT, it would only require 6 moves (3 saves, 3 restores): exactly the ones generated by imcc. In reality this would be even better, because as you have the garantee of having the data already in real registers you need less temporaries and so have more machine registers free. > So the final goal could be, to emit these load/stores too, which > then could be optimized to avoid duplicate loading/storing. Or imcc > could emit a register move, if in the next instruction the parrot > register is used again. Yes, that's the idea: making imcc generate the loads/stores, using the info about how many registers are actually available in the real machine _and_ its own knowledge about the program flow. An even better goal would be to have imcc know how many temporaries every JITed op requires, and use this information during register allocation. All this is obviously machine dependent: the code generated should only run in the machine it was compiled for. So we should always keep the original imc code in case we copy the pbc file to another machine. -angel
Re: Using imcc as JIT optimizer
Phil Hassey wrote: Not knowing much about virtual machine design... Here's a question -- Why do we have a set number of registers? Particularily since JITed code ends up setting the register constraints again, I'm not sure why parrot should set up register limit constraints first. Couldn't each code block say "I need 12 registers for this block" and then the JIT system would go on to do it's appropriate spilling magic with the system registers... This is somehow the approach, the current optimizer in jit.c takes. The optimizer looks at a section (a JITed part of a basic block) checks register usage and then assigns the top N registers to processor registers. This has 2 disadvantages: - its done at runtime - always. It's pretty fast, but could have non trivial overhead for big programs - as each section and therefore each basic block has its own set of mapped registers, now on almost every boundary of a basic block and when calling out to non JITed code, processor registers have to be saved parrot's and restored back again. These memory accesses slow things down, so I want to avoid them where possible. Phil leo
Re: Using imcc as JIT optimizer
On Tuesday 25 February 2003 08:51, Leopold Toetsch wrote: > Angel Faus wrote: > > Saturday 22 February 2003 16:28, Leopold Toetsch wrote: > > > > With your approach there are three levels of parrot "registers": > > > > - The first N registers, which in JIT will be mapped to physical > > registers. > > > > - The others 32 - N parrot registers, which will be in memory. > > > > - The "spilled" registers, which are also on memory, but will have to > > be copied to a parrot register (which may be a memory location or a > > physical registers) before being used. > > Spilling is really rare, you have to work hard, to get a test case :-) > But when it comes to spilling, we should do some register renumbering > (which is the case for processor registers too). The current allocation > is per basic block. When we start spilling, new temp registers are > created, so that the register life range is limited to the usage of the > new temp register and the spill code. > This is rather expensive, as for one spilled register, the whole life > analysis has to be redone. Not knowing much about virtual machine design... Here's a question -- Why do we have a set number of registers? Particularily since JITed code ends up setting the register constraints again, I'm not sure why parrot should set up register limit constraints first. Couldn't each code block say "I need 12 registers for this block" and then the JIT system would go on to do it's appropriate spilling magic with the system registers... I suspect the answer has something to do with optimized C and not making things hairy, but I had to ask anyway. :) ... Phil
Re: Using imcc as JIT optimizer
Angel Faus wrote: Saturday 22 February 2003 16:28, Leopold Toetsch wrote: With your approach there are three levels of parrot "registers": - The first N registers, which in JIT will be mapped to physical registers. - The others 32 - N parrot registers, which will be in memory. - The "spilled" registers, which are also on memory, but will have to be copied to a parrot register (which may be a memory location or a physical registers) before being used. Spilling is really rare, you have to work hard, to get a test case :-) But when it comes to spilling, we should do some register renumbering (which is the case for processor registers too). The current allocation is per basic block. When we start spilling, new temp registers are created, so that the register life range is limited to the usage of the new temp register and the spill code. This is rather expensive, as for one spilled register, the whole life analysis has to be redone. I believe it would be smarter if we instructed IMCC to generate code that only uses N parrot registers (where N is the number of machine register available). This way we avoid the risk of having to copy twice the data. I don't think so. When we have all 3 levels of registers, using less parrot registers would just produce more spilled registers. Actually, I'm currently generating code that uses 32+N registers. The processor registers are numbered -1, -2 ... for the top used parrot registers 0, 1, ... But the processor registers are only fixed mirrors of the parrot registers. This is also insteresting because it gives the register allocation algorithm all the information about the actual structure of the machine we are going to run in. I am quite confident that code generated this way would run faster. All the normal operations boil down basically to 2 different machine instruction types e.g. for some binop : _rm or _rr (i386) _rrr (RISC arch) These are surrounded by mov_rm / mov_mr to load/store non mapped processor registers from/to parrot registers, the reg(s) are some scratch registers then like %eax on i386 or r11/r12 for ppc. s. e.g. jit/{i386,ppc}/core.jit So the final goal could be, to emit these load/stores too, which then could be optimized to avoid duplicate loading/storing. Or imcc could emit a register move, if in the next instruction the parrot register is used again. Then processor specific hints could come in, like: shr_rr_i for i386 has to have the shift count in %ecx. We also need tho have a better procedure for saving and restoring spilled registers. Specially in the case of JIT compilation, where it could be translated to a machine save/restore. I don't see much here. Where should the spilled registers be stored then? What do you think about it? I think, when it comes to spilling, we should divide the basic block, to get shorter life ranges, which would allow register renumbering then. -angel leo
Re: Using imcc as JIT optimizer
On Tue, Feb 25, 2003 at 07:18:11PM +0100, Angel Faus wrote: > I believe it would be smarter if we instructed IMCC to generate code > that only uses N parrot registers (where N is the number of machine > register available). This way we avoid the risk of having to copy > twice the data. It's not going to be very good if I compile code to pbc on an x86 where there are about 3 usable registers and try to run it on any other CPU with a lot more registers. -- Jason
Re: Using imcc as JIT optimizer
Saturday 22 February 2003 16:28, Leopold Toetsch wrote: > Gopal V wrote: > > If memory serves me right, Leopold Toetsch wrote: > > > > > > Ok .. well I sort of understood that the first N registers will > > be the ones MAPped ?. So I thought re-ordering/sorting was the > > operation performed. > > Yep. Register renumbering, so that the top N used (in terms of > score) registers are I0, I1, ..In-1 With your approach there are three levels of parrot "registers": - The first N registers, which in JIT will be mapped to physical registers. - The others 32 - N parrot registers, which will be in memory. - The "spilled" registers, which are also on memory, but will have to be copied to a parrot register (which may be a memory location or a physical registers) before being used. I believe it would be smarter if we instructed IMCC to generate code that only uses N parrot registers (where N is the number of machine register available). This way we avoid the risk of having to copy twice the data. This is also insteresting because it gives the register allocation algorithm all the information about the actual structure of the machine we are going to run in. I am quite confident that code generated this way would run faster. We also need tho have a better procedure for saving and restoring spilled registers. Specially in the case of JIT compilation, where it could be translated to a machine save/restore. What do you think about it? -angel
Re: Using imcc as JIT optimizer
Leopold Toetsch wrote: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers I have committed the next bunch of changes and an updated jit.pod. - it should now be platform independent, *but* other platforms have to define what they consider as preserved (callee-saved) registers and put these first in the mapped register lists. - for testing enable JIT_IMCC_OJ in jit.c and for platforms != i386: copy the MAP macro at bottom of jit/i386/jit_emit.h to your jit_emit.h - run programs like so: imcc -Oj -d8 primes.pasm (-d8 shows generates ins) It runs now ~95% of parrot tests on i386 but YMMV. Have fun, leo
Re: Using imcc as JIT optimizer
Dan Sugalski wrote: At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote: Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm), I did this experiment: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers This sounds pretty interesting, and I bet it could make things faster. I have now checked in a first version for testing: - the define JIT_IMCC_OJ in jit.c is disabled - so no impact - jit2h.pl defines now a MAP macro, which makes jit_cpu.c more readable Restrictions: - no vtable ops - no saving of non preserved registers (%edx on I386) So not much will run, when experimenting with it. But I think, the numbers are promising, so it's worth a further try. To enable the whole fun, recompile with JIT_IMCC_OJ enabled, build imcc and use the -Oj switch (primes.pasm is from examples/benchmarks): $ time imcc -j -Oj primes.pasm N primes up to 5 is: 5133 last is: 4 Elapsed time: 3.523477 real0m3.548s $ ./primes # primes.c -O3 gcc 2.95.2 N primes up to 5 is: 5133 last is: 4 Elapsed time: 3.647063 $ time imcc -j -O1 primes.pasm # normal JIT N primes up to 5 is: 5133 last is: 4 Elapsed time: 4.039121 real0m4.065s imcc/parrot was built without optimization, but this doesn't matter, no external code is called for jit/i386 in the primes.pasm. The timings for imcc obviously include compiling too. leo
Re: Using imcc as JIT optimizer
At 4:28 PM +0100 2/22/03, Leopold Toetsch wrote: Gopal V wrote: Direct hardware maps (like using CX for loop count etc) will need to be platform dependent ?. Or you could have a fixed reg that can be used for loop count (and gets mapped on hardware appropriately). We currently don't have special registers, like %ecx for loops, they are not used in JIT either. My Pentium manual states, that these ops are not the fastest. But in the long run, we should have some hints, that e.g. i386 needs %ecx as shift count, or that div uses %edx. But probably i386 is the only weird architecure with such ugly restrictions - and with far too few registers. I'm OK with adding in documentation that encourages using particular registers for particular purposes, or having some sort of metadata for the JIT that notes loop registers or something. As long as it's out of band and optional, that's cool. -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Using imcc as JIT optimizer
On Sat, Feb 22, 2003 at 09:27:04PM +, nick wrote: > On Sat, Feb 22, 2003 at 08:44:12PM -, Rafael Garcia-Suarez wrote: > > What undefined behaviour are you referring to exactly ? the shift > > overrun ? AFAIK it's very predictable (given one int size). Cases of > > Will you accept a shortcut written in perl? The shift op uses C signed > integers: Oops. The logical shift uses *un*signed integers, except under use integer $ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 1234 0 $ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 1234 1 $ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 4321 0 $ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 4321 1 So there's actually no difference in the numbers. But as I'm being a pedant I ought to get the facts right. [I guess it's my fault for drinking Australian wine :-)] Nicholas Clark
Re: Using imcc as JIT optimizer
On Sat, Feb 22, 2003 at 08:44:12PM -, Rafael Garcia-Suarez wrote: > Nicholas Clark wrote in perl.perl6.internals : > > > >> > r->score = r->use_count + (r->lhs_use_count << 2); > >> > > >> >r->score += 1 << (loop_depth * 3); > [...] > > I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc > > that added trap code for all classes of undefined behaviour, and caused > > code to abort (or something more colourfully "undefined") if anything > > undefined gets executed. I realise that code would run very slowly, but it > > would be a very very useful debugging tool. > > What undefined behaviour are you referring to exactly ? the shift > overrun ? AFAIK it's very predictable (given one int size). Cases of Will you accept a shortcut written in perl? The shift op uses C signed integers: $ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 1234 0 vs $ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 1234 1 $ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 4321 1 vs $ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)' linux 4321 0 (all 4 are Debian GNU/Linux And both architectures that give 0 for a shift of 32, happen to give 1 for a shift of 256. But I wouldn't count on it for all architectures) > potential undefined behavior can usually be detected at compile-time. I In this specific case, maybe. In the general case no. signed integer arithmetic overflowing is undefined behavior > imagine that shift overrun detection can be enabled via an ugly macro > and a cpp symbol. > > (what's a nasal demon ? can't find the nasald(8) manpage) Demons flying out of your nose. One alleged consequence of undefined behaviour. Another is your computer turning into a butterfly. I guess a third is "Microsoft releasing a bug free program" Nicholas Clark
Re: Using imcc as JIT optimizer
Nicholas Clark wrote: r->score += 1 << (loop_depth * 3); until variables in 11 deep loops go undefined? Not undefined, but spilled. First *oops*, but second of course this all not final. I did change scoring several times from the code base AFAIK Angel Faus did implement. And we don't currently have any code that goes near that omplexity of such a deep nested loop. There are probably a *lot* of such gotchas in the whole CFG code in imcc. I'm currently on some failing perl6 tests, when using optimization, all in regexen tests, which do a lot of branching. I'm not sure how to patch this specific instance - just trap loop depths over 10? Should score be unsigned? A linear counting of loop_depth will do it, e.g. r->score += 100 * loop_depth ; Or score deeper nested loops vars always higher then outside, or ... More importantly, how do we trap these sort of things in the general case? With a lot of tests I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc that added trap code for all classes of undefined behaviour, and caused code to abort (or something more colourfully "undefined") if anything undefined gets executed. I realise that code would run very slowly, but it would be a very very useful debugging tool. I'm currently adding asserts to e.g. loop detection code. Last one (to be checked in) is: /* we could also take the depth of the first contained * block, but below is a check, that an inner loop is fully * contained in an outer loop */ This is a check, that all blocks of a deeper nested loop are contained totally in the outer loop, so that there can't be basic blocks outside. But in regex code, this seems not to be true - or a prior stage of optimization messes things up. This issues are as hard to debug as deeply buried in ~400 basic blocks with ~1000 edges connecting those. perl6 $ ../imcc/imcc -O1 -d70 t/rx/basic_2.imc 2>&1 | less Nicholas Clark leo
Re: Using imcc as JIT optimizer
Nicholas Clark wrote in perl.perl6.internals : > >> > r->score = r->use_count + (r->lhs_use_count << 2); >> > >> >r->score += 1 << (loop_depth * 3); [...] > I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc > that added trap code for all classes of undefined behaviour, and caused > code to abort (or something more colourfully "undefined") if anything > undefined gets executed. I realise that code would run very slowly, but it > would be a very very useful debugging tool. What undefined behaviour are you referring to exactly ? the shift overrun ? AFAIK it's very predictable (given one int size). Cases of potential undefined behavior can usually be detected at compile-time. I imagine that shift overrun detection can be enabled via an ugly macro and a cpp symbol. (what's a nasal demon ? can't find the nasald(8) manpage)
Re: Using imcc as JIT optimizer
Please don't take the following as a criticism of imcc - I'm sure I manage to write code with things like this all the time. On Sat, Feb 22, 2003 at 08:13:59PM +0530, Gopal V wrote: > If memory serves me right, Leopold Toetsch wrote: > > r->score = r->use_count + (r->lhs_use_count << 2); > > > >r->score += 1 << (loop_depth * 3); > > Ok ... deeper the loop the more important the var is .. cool. until variables in 11 deep loops go undefined? (it appears to be a signed int) I'm not sure how to patch this specific instance - just trap loop depths over 10? Should score be unsigned? More importantly, how do we trap these sort of things in the general case? I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc that added trap code for all classes of undefined behaviour, and caused code to abort (or something more colourfully "undefined") if anything undefined gets executed. I realise that code would run very slowly, but it would be a very very useful debugging tool. Nicholas Clark
Re: Using imcc as JIT optimizer
Gopal V wrote: If memory serves me right, Leopold Toetsch wrote: Ok .. well I sort of understood that the first N registers will be the ones MAPped ?. So I thought re-ordering/sorting was the operation performed. Yep. Register renumbering, so that the top N used (in terms of score) registers are I0, I1, ..In-1 Direct hardware maps (like using CX for loop count etc) will need to be platform dependent ?. Or you could have a fixed reg that can be used for loop count (and gets mapped on hardware appropriately). We currently don't have special registers, like %ecx for loops, they are not used in JIT either. My Pentium manual states, that these ops are not the fastest. But in the long run, we should have some hints, that e.g. i386 needs %ecx as shift count, or that div uses %edx. But probably i386 is the only weird architecure with such ugly restrictions - and with far too few registers. Loop info Hmm.. this is what I said "sounds like a lot of work" ... which still remains true from my perspective :-) There is still a lot of work, yes, but some things already are done: set I10, 10 x: if I10, ok branch y ok: set I0, 1 sub I10, I10, I0 print I10 print "\n" branch x y: end Ends up (with imcc -O2p) as: set I0, 10 set I1, 1 x: unless I0, y sub I0, I1 print I0 print "\n" branch x y: end You can see: opt1 sub I10, I10, I0 => sub I10, I0 if_branch if ... ok label ok deleted found invariant set I0, 1 inserting it in blk 0 after set I10, 10 The latter one is working out from the most inner loop. Gopal leo
Re: Using imcc as JIT optimizer
If memory serves me right, Leopold Toetsch wrote: > > I'm assuming that the temporaries are the things being moved around here ?. > > > It is not so much a matter of moving things around, but a matter of > allocating (and renumbering) parrot (or for JIT) processor registers. Ok .. well I sort of understood that the first N registers will be the ones MAPped ?. So I thought re-ordering/sorting was the operation performed. Direct hardware maps (like using CX for loop count etc) will need to be platform dependent ?. Or you could have a fixed reg that can be used for loop count (and gets mapped on hardware appropriately). > > does it. But that sounds like a lot of work identifying the loops and > > optimising accordingly. > Loop info > - > loop 0, depth 1, size 2, entry 0, contains blocks: > 1 2 Hmm.. this is what I said "sounds like a lot of work" ... which still remains true from my perspective :-) > r->score = r->use_count + (r->lhs_use_count << 2); > >r->score += 1 << (loop_depth * 3); Ok ... deeper the loop the more important the var is .. cool. Gopal -- The difference between insanity and genius is measured by success
Re: Using imcc as JIT optimizer
Gopal V wrote: I'm assuming that the temporaries are the things being moved around here ?. It is not so much a matter of moving things around, but a matter of allocating (and renumbering) parrot (or for JIT) processor registers. These are of course mainly temporaries, but even when you have some find_lexical/do_something/store_lexical, imcc selects the best register for all involved ops, temps or "variables" it doesn't really matter. The only question I have here , how does imcc identify loops ?. I've been using "if goto" to loop around , which is exactly the way assembly does it. But that sounds like a lot of work identifying the loops and optimising accordingly. Here are basic blocks, the CFG and loop info of 0 set I0, 10 1 x: 1 unless I0, y 2 dec I0 2 print I0 2 print "\n" 2 branch x 3 y: 3end Dumping the CFG: --- 0 (0)-> 1<- 1 (1)-> 2 3 <- 2 0 2 (1)-> 1<- 1 3 (0)-> <- 1 Loop info - loop 0, depth 1, size 2, entry 0, contains blocks: 1 2 To make it more clear -- identifying tight loops and the usage weights correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100 times loop. Which will be come first ?. This is basically the current score calculation used for register allocation: r->score = r->use_count + (r->lhs_use_count << 2); r->score += 1 << (loop_depth * 3); Gopal leo
Re: Using imcc as JIT optimizer
If memory serves me right, Dan Sugalski wrote: > This sounds pretty interesting, and I bet it could make things > faster. The one thing to be careful of is that it's easy to get > yourself into a position where you spend more time optimizing the > code you're JITting than you win in the end. I think that's not the case for ahead of time optimisations . As long as the JIT is not the optimiser , you could take your time optimising. The topic is really misleading ... or am I the one who's wrong ? > You also have to be very careful that you don't reorder things, since > there's not enough info in the bytecode stream to know what can and > can't be moved. (Which is something we need to deal with in IMCC as > well) I'm assuming that the temporaries are the things being moved around here ?. Since imcc already moves them around anyway and the programmer makes no assumptions about their positions -- this shouldn't be a problem ?. The only question I have here , how does imcc identify loops ?. I've been using "if goto" to loop around , which is exactly the way assembly does it. But that sounds like a lot of work identifying the loops and optimising accordingly. To make it more clear -- identifying tight loops and the usage weights correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100 times loop. Which will be come first ?. Gopal -- The difference between insanity and genius is measured by success
Re: Using imcc as JIT optimizer
Dan Sugalski wrote: At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote: Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm), I did this experiment: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers This sounds pretty interesting, and I bet it could make things faster. The one thing to be careful of is that it's easy to get yourself into a position where you spend more time optimizing the code you're JITting than you win in the end. I don't think so. Efficiency of JIT code depends very much on register save/restore instructions. imcc does a full parrot register life analysis, and knows when e.g. I17 is rewritten and thus can assign the same register for it, that some ins above I5 had. Current JIT code is looking at parrot registers and emits save/loads to get processor registers in sync, which is the opposite of: The proposal is, to map the top N used parrot regs to physical processor registers. This means: imcc emits instructions to get parrot registers up to date and not vv. The code is already in terms of processor regs. You also have to be very careful that you don't reorder things, since there's not enough info in the bytecode stream to know what can and can't be moved. (Which is something we need to deal with in IMCC as well) Yep. So, I'm trying to get *all* needed info's into the bytecode stream/into the op_info/or as a hack in imcc. See e.g. "[RFC] imcc calling conventions". Please remember times where I started digging into parrot and core.ops: the in/out/inout definition of P-registers. These issues are *crucial* for a language *compiler*. If perl6 or any other language should run *efficiently*, imcc has to be a compiler with all needed info at hand and not a plain PASM assembler. leo
Re: Using imcc as JIT optimizer
At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote: Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm), I did this experiment: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers This sounds pretty interesting, and I bet it could make things faster. The one thing to be careful of is that it's easy to get yourself into a position where you spend more time optimizing the code you're JITting than you win in the end. You also have to be very careful that you don't reorder things, since there's not enough info in the bytecode stream to know what can and can't be moved. (Which is something we need to deal with in IMCC as well) -- Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Using imcc as JIT optimizer
Leopold Toetsch wrote: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers The "[RFC] imcc calling conventions" didn't get any response. Should I take this fact as an implict "yep, fine"? Here is again the relevant part, which has implications on register renumbering, used for JIT optimization: =head1 Parrot calling conventions (NCI) Proposed syntax: $P0 = load_lib "libname" $P1 = dlfunc $P0, "funcname", "signature" .nciarg z # I5 .nciarg y # I6 .nciarg x # I7 ncicall $P1 # r = funcname(x, y, z) .nciresult r A code snippet like: set I5, I0 dlfunc P0, P1, "func", "ii" invoke set I6, I5 now comes out as: set ri1, ri0 dlfunc P0, P1, "func", "ii" invoke set ri0, ri1 which is clearly not, what pdd03 is intending. For plain PASM at least the .nciarg/.nciresult are necessary, to mark these parrot registers as fix and to have some hint for imcc, that dlfunc is actually using these registers. So there are some possibilities: - disable register renumbering for all compilation units, where a B is found - do it right, i.e. implement above (or a similar) syntax and rewrite existing code leo
Re: Using imcc as JIT optimizer
On Thursday 20 February 2003 18:14, Leopold Toetsch wrote: > Tupshin Harper wrote: > > Leopold Toetsch wrote: > >> Starting from the unbearable fact, that optimized compiled C is still > >> faster then parrot -j (in primes.pasm) > > > > Lol...what are you going to do when somebody comes along with the > > unbearable example of primes.s(optimized x86 assembly), and you are > > forced to throw up your hands in defeat? ;-) > > It only may be equally fast, that's it :) Nahh, you know it can be faster... may be in a couple of years ;-D > > > Cool idea, if I understand correctly, and I am in awe of how fast the > > bloody thing is already. > > That's integer/float only. When it comes to objects, different things > matter. > > > -Tupshin > > leo
Re: Using imcc as JIT optimizer
Tupshin Harper wrote: Leopold Toetsch wrote: Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm) Lol...what are you going to do when somebody comes along with the unbearable example of primes.s(optimized x86 assembly), and you are forced to throw up your hands in defeat? ;-) It only may be equally fast, that's it :) Cool idea, if I understand correctly, and I am in awe of how fast the bloody thing is already. That's integer/float only. When it comes to objects, different things matter. -Tupshin leo
Re: Using imcc as JIT optimizer
Leopold Toetsch wrote: Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm) Lol...what are you going to do when somebody comes along with the unbearable example of primes.s(optimized x86 assembly), and you are forced to throw up your hands in defeat? ;-) Cool idea, if I understand correctly, and I am in awe of how fast the bloody thing is already. -Tupshin
Re: Using imcc as JIT optimizer
Sean O'Rourke wrote: On Thu, 20 Feb 2003, Leopold Toetsch wrote: What do people think? Cool idea -- a lot of optimization-helpers could eventually be passed on to the jit (possibly in the metadata?). One thought -- the information imcc computes should be platform-independent. e.g. it could pass a control flow graph to the JIT, but it probably shouldn't do register allocation for a specific number of registers. How much worse do you think it would be to have IMCC just rank the Parrot registers in order of decreasing spill cost, then have the JIT take the top N, where N is the number of available architectural registers? The registers are already in that order (with -Op or -Oj), this wouldn't be a problem. Difficulties arise, when it comes to the register load/save instructions, which get inserted by imcc in my scheme. These are definitely processor/$arch specific. They depend on the number of mappable (and non-preserved too) registers, and on the state of the op_jit function table. Of course CFG and register life information could be passed on to the JIT, but this seems a little bit complicated, as JIT has it's own sections, which match either a basic block from imcc or are a sequence of non-JITable instructions. But in the long run, it could be a way to go. OTOH - PBC compatibility is not a big point here, when JIT is involved: in 99% of the time the code would run on the machine, where it is generated. And it would be AFAIK easier, to make some JIT crosscompiler. This would basically only need the amount of mappable registers and the extcall bits from the jump table, read in from some config file. /s leo
Re: Using imcc as JIT optimizer
On Thu, 20 Feb 2003, Leopold Toetsch wrote: > What do people think? Cool idea -- a lot of optimization-helpers could eventually be passed on to the jit (possibly in the metadata?). One thought -- the information imcc computes should be platform-independent. e.g. it could pass a control flow graph to the JIT, but it probably shouldn't do register allocation for a specific number of registers. How much worse do you think it would be to have IMCC just rank the Parrot registers in order of decreasing spill cost, then have the JIT take the top N, where N is the number of available architectural registers? /s
Using imcc as JIT optimizer
Starting from the unbearable fact, that optimized compiled C is still faster then parrot -j (in primes.pasm), I did this experiment: - do register allocation for JIT in imcc - use the first N registers as MAPped processor registers Here is the JIT optimized PASM output of $ imcc -Oj -o p.pasm primes.pasm $ cat p.pasm set ri2, 1 set I5, 50 set I4, 0 print "N primes up to " print I5 print " is: " time N1 set rn1, N1 # load REDO: set ri0, 2 div ri3, ri2, 2 LOOP: cmod ri1, ri2, ri0 if ri1, OK # with -O1j unless ri1, NEXT branch NEXT # deleted OK: # deleted inc ri0 le ri0, ri3, LOOP inc I4 set I6, ri2 NEXT: inc ri2 le ri2, I5, REDO time N0 set rn0, N0 # load print I4 print "\nlast is: " print I6 print "\n" sub rn0, rn1 set N0, rn0 # save print "Elapsed time: " print N0 print "\n" end The ri? and rn? are processor registers, above is for intel (4 mapped int/float regs), you can translate the ri? to [%ebx, %edi, %esi, %edx). The processor regs are represented as (-1 - parrot_reg), i.e. %ebx == -1, %edi == -2 ... The MAP macro in jit_emit.h would then be: # define MAP(i) ((i)>= 0 ? 0 : ...map_branch[jit_info->op_i -1-(i)]) where the mappings are directly intval_map or floatval_map. JIT wouldn't need any further calculations. The load/save instructions get inserted by looking at op_jit[].extcall, i.e. if the instruction reads or writes a register, it gets saved/loaded before/after and the parrot register is used instead. (Only the print and time ops are external in i386). I currently have the imcc part for some common cases, emough for above output. What do people think? For reference: a similar idea: "Of mops and microops" leo PS: -O3 C 3.64s, JIT ~3.55.