subject:"Using imcc as JIT optimizer"

Re: Using imcc as JIT optimizer

2003-03-04 Thread Dennis Haney

Leopold Toetsch wrote:

Phil Hassey wrote:

But with a processor with > 16 registers (do such things exist?).  
Parrot would be overflowing registers that it could have been using 
in the JIT. 


RISC processor have a lot of them. But before there are unused 
processor registers, we will allocate P and S registers too. When a 
CPU has more then 4*32 free registers, we will look again.
Like IA64? AFAIK it has 128 integer registers and 128 fp registers...

Re: Using imcc as JIT optimizer

2003-03-03 Thread Piers Cawley

Nicholas Clark <[EMAIL PROTECTED]> writes:

> On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:
>
> [snip lots of good stuff]
>
>> All this is obviously machine dependent: the code generated should 
>> only run in the machine it was compiled for. So we should always keep 
>> the original imc code in case we copy the pbc file to another 
>> machine.
>
> Er, but doesn't that mean that imc code has now usurped the role of parrot
> byte code?
>
> I'm not sure what is a good answer here. But I thought that the intent of
> parrot's bytecode was to be the same bytecode that runs everywhere. Which
> is slightly incompatible with compiling perl code to something that runs
> as fast as possible on the machine that you're both compiling and running
> on. (These two being the same machine most of the time).
>
> Maybe we starting to get to the point of having imcc deliver parrot bytecode
> if you want to be portable, and something approaching native machine code
> if you want speed. Or maybe if you want the latter we save "fat" bytecode
> files, that contain IMC code, bytecode and JIT-food for one or more
> processors.

Aren't there safety implications with 'fat' code? One could envisage a
malicious fat PBC where the IMC code and the bytecode did different things...

-- 
Piers

[CVS ci] Using imcc as JIT optimizer #3

2003-02-28 Thread Leopold Toetsch

This concludes for now this experiment. It works, but to do it right, it 
should go in the direction Angel Faus has mentioned. Also calling 
conventions have to be done before, to get the data flow right.
With the -Oj option a mininal CFG section is created in the packfile, 
which is used by parrots JIT code, to get sections and register 
mappings. This is significantly faster then current's jit optimizer, 
which has a relatively high impact on program load times.
The JIT loader looks at the packfile now, and uses either method to 
generate information needed for actually producing bytecode.

Further included:
- some CFG hacks to figure out info about subroutines
- implemented the in the comment mentioned optimization in the
  register interference code
- implement read/write semantics of pusx/popx/clearx/saveall/restoreall
- some bugfixes WRT memory handling of SymRegs/life_info
- improved default_dump for pdump
- removed unused warnings in jit.c, all -O3 unitialized warnings in imcc
leo

PS
$ imcc -O1j  primes.pasm
Elapsed time: 3.485836
$ ./primes  # -O3 gcc 2.95.2
Elapsed time: 3.643756
$ imcc -O1 -j  primes.pasm
Elapsed time: 3.884460
$ make test IMCC="languages/imcc/imcc -O1j"
succeeds, except for t/op/interp_2, where the trace output is different 
due to inserted register load/store ops. For the nci stuff -Oj gets 
disabled internally.

Re: Using imcc as JIT optimizer

2003-02-26 Thread Leopold Toetsch

Phil Hassey wrote:

... The current bytecode from parrot already has potential
for slowing things down, and that's what worries me here.
I don't see that.

My understanding is that PBC has a limit of 16 (32?) integer registers.  When 
a code block needs more than 16 registers, they are overflowed into a 
PMC.


There are 32 registers per type. When life analysis of all used 
temporary registers, can't allocate all used vars to a parrot register, 
then overflowed vars get spilled into a PerlArray.
This may be different to just "a block needs more than...":

  set $I0, 10
  add $11, $I0, 2
  print $I1
  add $12, $I0, 3
  print $I2
only needs two registers, $11 and $I2 get the same parrot register, 
because their usage doesn't overlap.

But with a processor with > 16 registers (do such things exist?).  Parrot 
would be overflowing registers that it could have been using in the JIT. 


RISC processor have a lot of them. But before there are unused processor 
registers, we will allocate P and S registers too. When a CPU has more 
then 4*32 free registers, we will look again.


Thanks,
Phil
leo

Re: Using imcc as JIT optimizer

2003-02-26 Thread Phil Hassey

> >  Although it might be nice if IMC were binary at this stage (for some
> > feel-good-reason?).
>
> You mean, that a HL like perl6 should produce a binary equivalent to
> ther current .imc file? Yep - this was discussed already, albeit there
> was no discussion, how this should look like. And the lexer in imcc is
> pretty fast.
>
> > ... The current bytecode from parrot already has potential
> > for slowing things down, and that's what worries me here.
>
> I don't see that.

My post was more a "wish-list" of what I was hoping parrot would be like in 
terms of imc/pbc/jit/whatever.  Since I don't completely understand how 
parrot works, my comment above was actually more of a guess.  But I'll try to 
explain what I meant, in the off-chance it was right.  

My understanding is that PBC has a limit of 16 (32?) integer registers.  When 
a code block needs more than 16 registers, they are overflowed into a 
PMC.

With a processor with < 16 registers, I guess this would work.  Although the 
JIT would have to overflow more than what was originally planned in the PBC.  
(Or does it just switch back and forth between the VM and the JIT, I don't 
know.)
 
But with a processor with > 16 registers (do such things exist?).  Parrot 
would be overflowing registers that it could have been using in the JIT.  My 
guess is that this would slow things down.

Anyway, before I strut my ignorance of VMs and JITs and processors anymore, I 
think I will end this message.  :)

Thanks,
Phil

Re: Using imcc as JIT optimizer

2003-02-26 Thread Leopold Toetsch

Angel Faus wrote:

(1) First, do a register allocation for machine registers, assuming 
that there are N machine registers and infinite parrot registers.


This uses equally the top N used registers for processor regs. The 
"spilling" for (1) is loading/moving them to parrot registers/temp 
registers. Only the load/store would be that what spilling code makes 
out of those. Then you still have 32 parrot registers per kind to allocate.

But it is not as easy as it reads: We have non preserved registers too, 
which can be mapped, but are not preserved over function calls, so they 
must, when mapped and used, be stored to parrots regs and reloaded after 
extern function calls, if used again in that block or after. Albeit 
load/stores of this kind can be optimized, depending on register usage.


For example, code generated by (1) would look like:

set m3, 1		# m3 is the machine 3d register 
add m3, m3, 1
print m3

set $I1, m3   # $I1 is a parrot virtual register


Not exactly: print is an external function.
Assuming ri0 - ri3 are mapped, ri3 is not callee saved:
  set ri0, 1
  add ri0, 1
  set $I0, ri0  # save for print $I0
  set $I1, ri3  # save/preserve the register, when used
  print $I0 # external function
  set ri3, $I1  # load
  add ri3, ri1, ri2 # do something
(For debugging mapped registers are printed ri0..x or rn0..y by imcc)


Hope that it know make more sense,


More, yes. This would give us 32 + N - (0..x) registers, where x is the 
amount of non callee saved registers in the worst case, or 0 most of the 
time. The $1 above can be always a new temp, which would then have a 
very limited life range inside one basic block.


-angel


leo

Re: Using imcc as JIT optimizer

2003-02-26 Thread Leopold Toetsch

Phil Hassey wrote:

[snip]


 Although it might be nice if IMC were binary at this stage (for some 
feel-good-reason?).  


You mean, that a HL like perl6 should produce a binary equivalent to 
ther current .imc file? Yep - this was discussed already, albeit there 
was no discussion, how this should look like. And the lexer in imcc is 
pretty fast.

... The current bytecode from parrot already has potential 
for slowing things down, and that's what worries me here.  


I don't see that.


3.  He can hand out a platform specific .jit (which would require the target 
to be able to run it.)

I suspect most end users would be able to use #1 or #2.  However for use on 
embedded systems where size is an issue, having #3 an option would be useful, 
as I suspect it would shrink the footprint of parrot somewhat.


The JIT-PBC for #3 has a somewhat larger size then plain PBC due to 
register load/store ops and an additional CFG/register usage PBC 
section. But running  it does require less memory, because the JIT 
optimizer doesn't have to create all the internal bookkeeping tables.


Cheers,
Phil
leo

Re: Using imcc as JIT optimizer

2003-02-26 Thread Angel Faus


> [ you seem to be living some hors ahead in time ]

Yep, sorry about that.

> The problem stays the same: spilling processors to parrot's or
> parrots to array.
>

Thinking a bit more about it, now I believe that the best way to do it 
would be:

(1) First, do a register allocation for machine registers, assuming 
that there are N machine registers and infinite parrot registers.

(2) Second, do a register allocation for parrot registers, using an 
array as spill area.

The first step assures us that we generate code that always puts data 
in the availabe machine registers, and tries to minimize moves 
between registers and the physical memory.

The second step tries to put all the data in parrot registers, and if 
it is not able to do that in the parrot spilling area (currently an 
PerlArray)
 
For example, code generated by (1) would look like:

set m3, 1   # m3 is the machine 3d register 
add m3, m3, 1
print m3

set $I1, m3   # $I1 is a parrot virtual register

etc...

Then we would do register allocation for the virtual $I1 registers, 
hoping to be able to put them all in the 32 parrot registers.

I believe this would be the optimal way to do it, because it actually 
models our priorities: first to put all data in physical registers, 
otherwise try do it in parrot registers.

This is better than reserving the machine registers for the most used 
parrot registers (your original proposal) or doing a pyhsical 
register allocation and assuming that we have an infinite number of 
parrot registers (my original proposal).

Hope that it know make more sense,

-angel

Re: Using imcc as JIT optimizer

2003-02-26 Thread Phil Hassey

[snip]

> > Maybe we starting to get to the point of having imcc deliver parrot
> > bytecode if you want to be portable, and something approaching native
> > machine code if you want speed.
>
> IMHO yes, the normal options produce a plain PBC file, more or less
> optimized at PASM level, the -Oj option is definitely a machine
> optimization option, which can run or will create a PBC that runs only
> on a machine with equally or less mapped registers and the same external
> (non JITted instructions) i.e. on the same $arch.
> But the normal case is, that I compile the source for my machine and run
> it here - with all possible optimizations.
> I never did do any cross compilation here. Shipping the source is
> enough. Plain PBC is still like an unoptimized executable running
> everywhere - not a machine specific cross compile EXE.
>
> > ... Or maybe if you want the latter we save "fat" bytecode
> > files, that contain IMC code, bytecode and JIT-food for one or more
> > processors.
>
> There is really no need for a fat PBC. Though - as already stated - I
> could imagine some cross compile capabilities for -Oj PBCs.

Seems to me it would be good if

- mycode.pl -- my original code

would be compiled into 
- mycode.pbc/imc -- platform neutral parrot bytecode with (as I sort of 
suggested a day ago) no limitations on what registers there are, no spilling 
code, as that comes next...  In someways, this is what IMC code is right now. 
 Although it might be nice if IMC were binary at this stage (for some 
feel-good-reason?).  The current bytecode from parrot already has potential 
for slowing things down, and that's what worries me here.  

which when run on any system would generate
- mycode.jit -- a platform specific thing with native compiled code

And as a worst case, if a system didn't have a jit module would just run the 
mycode.pbc, albeit not very speedily.

This gives the developer several choices:
1.  He can hand out his original source (which would require the target to be 
able to compile, jit)
2.  He can hand out a platform neutral pbc/imc of compiled code that can be 
compiled to full speed (which would require the target to be able to either 
jit or just run it.)
3.  He can hand out a platform specific .jit (which would require the target 
to be able to run it.)

I suspect most end users would be able to use #1 or #2.  However for use on 
embedded systems where size is an issue, having #3 an option would be useful, 
as I suspect it would shrink the footprint of parrot somewhat.

Just the thoughts of a future parrot user :)  Hope they benefit someone.

Cheers,
Phil

Re: Using imcc as JIT optimizer

2003-02-26 Thread Leopold Toetsch

Nicholas Clark wrote:

Well, I think that proper IO would be useful. But I don't think it affects
the innards of the execution system greatly >


No, though we will need some more ops - or not. Current io also defines 
a more or less dummy io PMC (e.g. io.ops:open). This could be a full 
PMC, with a io_vtable (which could reflect the io stack). The most used 
operations would be separate opcodes, others could be methods of this 
io_pmc.

...- is there any reason why
parrot (or at least PBC) can't conceptually treat in the same way that C
treats IO - just another standard library?


Some times ago, I posted: "[RfC] a scheme for core.ops extending" :)


"Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries"
and "reading foreign bytecode". I regard the first as important, the second
as "would be nice". I think Dan rates "reading foreign bytecode" more
important than I do.


AFAIK are we not able to directly execute Z-code by just loading a 
different opcode library. The Z-ops have encoded parameters in them. So 
we can only load a Z-code interpreter/compiler which then reads the 
Z-code program which is simple data then, no bytecode. Though it might 
help, to have some specialized Z-ops for execution, but this falls under 
above "extending".


Nicholas Clark


leo

Re: Using imcc as JIT optimizer

2003-02-26 Thread Nicholas Clark

On Tue, Feb 25, 2003 at 11:58:41PM +0100, Leopold Toetsch wrote:
> Nicholas Clark wrote:

[thanks for the explanation]

> > And is this all premature optimisation, give that we haven't got objects,
> > exceptions, IO or a Z-code interpreter yet?

> And yes: We don't have exceptions and threads yet. The other items, 
> don't matter (IMHO).

Well, I think that proper IO would be useful. But I don't think it affects
the innards of the execution system greatly - is there any reason why
parrot (or at least PBC) can't conceptually treat in the same way that C
treats IO - just another standard library?

"Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries"
and "reading foreign bytecode". I regard the first as important, the second
as "would be nice". I think Dan rates "reading foreign bytecode" more
important than I do.

Nicholas Clark

Re: Using imcc as JIT optimizer

2003-02-25 Thread Leopold Toetsch

[ you seem to be living some hors ahead in time ]

Angel Faus wrote:

I explained very badly. The issue is not spilling (at the parrot 
level)


The problem stays the same: spilling processors to parrot's or parrots 
to array.

[ ... ]


set I3, 1
add I3, I3, 1
print I3
fast_save I3, 1

set I3, 1


Above's "fast_save" is spilling at parrot register level and moving regs 
to parrot registers a processor regs level. Actual machine code could be:

mov 1, %eax # first write to a parrot register
inc %eax# add I3, I3, 1 => (*) add I3, 1 => inc I3
mov %eax, I3# store reg to parrot registers mem
print I3# print is external
*) already done now
Above sequence of code wouldn't consume any mapped register - for the 
whole sequence originally shown.


So the final goal could be, to emit these load/stores too, which
then could be optimized to avoid duplicate loading/storing. 

An even better goal would be to have imcc know how many temporaries 
every JITed op requires, and use this information during register 
allocation.


As shown above, yep.


All this is obviously machine dependent: the code generated should 
only run in the machine it was compiled for. So we should always keep 
the original imc code in case we copy the pbc file to another 
machine.


I'l answer this part in the reply to Nicholas reply.


-angel
leo

Re: Using imcc as JIT optimizer

2003-02-25 Thread Leopold Toetsch

Nicholas Clark wrote:

On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:

[snip lots of good stuff]


All this is obviously machine dependent: the code generated should 
only run in the machine it was compiled for. So we should always keep 
the original imc code in case we copy the pbc file to another 
machine.

Er, but doesn't that mean that imc code has now usurped the role of parrot
byte code?


No. It's like another runtime option. Run "imcc -Oj the.pasm" and you 
get what you want, a differently optimized piece of JIT code, that might 
run faster then "imcc -j the.pasm".
And saying "imcc -Oj -o the.pbc the.pasm" should spit out the fastest 
bytecode possible, for your very machine.


I'm not sure what is a good answer here. But I thought that the intent of
parrot's bytecode was to be the same bytecode that runs everywhere. 


Yep

... Which
is slightly incompatible with compiling perl code to something that runs
as fast as possible on the machine that you're both compiling and running
on. (These two being the same machine most of the time).


At PBC level, imcc already has "-Op" which does parrot register 
renumbering (modulo NCI and such, where fixed registers are needed, and 
this is -- hmmm suboptimal then :) and imcc can write out CFG 
information in some machine independent form, i.e. at basic block level. 
But no processor specific load/store instructions and such.
This can help JIT optimizer to do the job faster, though it isn't that 
easy, because there are non JITed code sequences intersparsed.

I think some difficulties arise, when looking at, what imcc now is: It's 
the assemble.pl generating PBC files. But it's also parrot, it can run 
PBC files - and it's both - it can run PASM (or IMC) files - 
immediately. And the latter one can be always as fast as the $arch 
allows. Generating PBC doesn't have to use the same compile options - as 
you wouldn't use, when running "gcc -b machine".


Maybe we starting to get to the point of having imcc deliver parrot bytecode
if you want to be portable, and something approaching native machine code
if you want speed. 


IMHO yes, the normal options produce a plain PBC file, more or less 
optimized at PASM level, the -Oj option is definitely a machine 
optimization option, which can run or will create a PBC that runs only 
on a machine with equally or less mapped registers and the same external 
(non JITted instructions) i.e. on the same $arch.
But the normal case is, that I compile the source for my machine and run 
it here - with all possible optimizations.
I never did do any cross compilation here. Shipping the source is 
enough. Plain PBC is still like an unoptimized executable running 
everywhere - not a machine specific cross compile EXE.

... Or maybe if you want the latter we save "fat" bytecode
files, that contain IMC code, bytecode and JIT-food for one or more
processors.


There is really no need for a fat PBC. Though - as already stated - I 
could imagine some cross compile capabilities for -Oj PBCs.


And is this all premature optimisation, give that we haven't got objects,
exceptions, IO or a Z-code interpreter yet?


It is a different approach to JIT register allocation. The current 
optimizer allocates registers per JITed section, with no chance (IMHO) 
to reuse registers after a branch, because the optimizer lacks all the 
information to know, that this branch target will only be reached from 
here, and that the registers are the same, so finally knows, the 
savin/loading processor registers to memory could be avoided.

OTOH imcc has almost all this info already at hand (coming out of 
CFG/life information needed for allocating parrot regs from $temps). So 
the chance for generating faster code is there, IMHO.

Premature optimization - partly of course yes/no:
My copy here runs now all parrot tests except op/interp_2 (obvious, this 
compares traced instructions, where the -Oj inserted some register 
load/saves) and the pmc/nci tests, where just the fixed parameter/return 
result register are mess up - the "imcc calling conventions" thread has 
a proposal for this.
And yes: We don't have exceptions and threads yet. The other items, 
don't matter (IMHO).
But we will come to a point, where for certain languages, we will 
optimize P-registers, or mix them with I-regs, reusing same processor 
regs. :-)


Nicholas Clark
leo

Re: Using imcc as JIT optimizer

2003-02-25 Thread Nicholas Clark

On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:

[snip lots of good stuff]

> All this is obviously machine dependent: the code generated should 
> only run in the machine it was compiled for. So we should always keep 
> the original imc code in case we copy the pbc file to another 
> machine.

Er, but doesn't that mean that imc code has now usurped the role of parrot
byte code?

I'm not sure what is a good answer here. But I thought that the intent of
parrot's bytecode was to be the same bytecode that runs everywhere. Which
is slightly incompatible with compiling perl code to something that runs
as fast as possible on the machine that you're both compiling and running
on. (These two being the same machine most of the time).

Maybe we starting to get to the point of having imcc deliver parrot bytecode
if you want to be portable, and something approaching native machine code
if you want speed. Or maybe if you want the latter we save "fat" bytecode
files, that contain IMC code, bytecode and JIT-food for one or more
processors.

And is this all premature optimisation, give that we haven't got objects,
exceptions, IO or a Z-code interpreter yet?

Nicholas Clark

Re: Using imcc as JIT optimizer

2003-02-25 Thread Angel Faus

I explained very badly. The issue is not spilling (at the parrot 
level)

The problem is: if you only pick the highest priority parrot registers 
and put them in real registers you are losing oportunities where 
copying the date once will save you from copying it many times. You 
are, in some sense, underspilling.

Let's see an example. Imagine you are compilling this imc, to be run 
in a machine which has 3 registers free (after temporaries):

set $I1, 1
add $I1, $I1, 1
print $I1

set $I2, 1
add $I2, $I2, 1
print $I2

set $I3, 1
add $I3, $I3, 1
print $I3

set $I4, 1
add $I4, $I4, 1
print $I4

set $I5, 1
add $I5, $I5, 1
print $I5

print $I1
print $I2
print $I3
print $I4
print $I5

Very silly code indeed, but you get the idea.

Since we have only 5 vars, imcc would turn this into:

set I1, 1
add I1, I1, 1
print I1

set I2, 1
add I2, I2, 1
print I2

set I3, 1
add I3, I3, 1
print I3

set I4, 1
add I4, I4, 1
print I4

set I5, 1
add I5, I5, 1
print I5

print I1
print I2
print I3
print I4
print I5

Now, assuming you put registers I1-I3 in real registers, what would it 
take to execute this code in JIT?

It would have to move the values of I4 and I5 from memory to registers 
a total of 10 times (4 saves and 6 restores if you assume the JIT is 
smart)

[This particular example could be improved by making the jit look if 
the same parrot register is going to be used in the next op, but 
that's not the point]

But, if IMCC knew that there were really only 3 registers in the 
machine, it would generate:

set I1, 1
add I1, I1, 1
print I1

set I2, 1
add I2, I2, 1
print I2

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 1

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 2

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 3

print I1
print I2
fast_restore I3, 3
print I3
fast_restore I3, 2
print I3
fast_restore I3, 1
print I3

When running this code in the JIT, it would only require 6 moves (3 
saves, 3 restores): exactly the ones generated by imcc. 

In reality this would be even better, because as you have the garantee 
of having the data already in real registers you need less 
temporaries and so have more machine registers free.

> So the final goal could be, to emit these load/stores too, which
> then could be optimized to avoid duplicate loading/storing. Or imcc
> could emit a register move, if in the next instruction the parrot
> register is used again.

Yes, that's the idea: making imcc generate the loads/stores, using the 
info about how many registers are actually available in the real 
machine _and_ its own knowledge about the program flow.

An even better goal would be to have imcc know how many temporaries 
every JITed op requires, and use this information during register 
allocation.

All this is obviously machine dependent: the code generated should 
only run in the machine it was compiled for. So we should always keep 
the original imc code in case we copy the pbc file to another 
machine.

-angel

Re: Using imcc as JIT optimizer

2003-02-25 Thread Leopold Toetsch

Phil Hassey wrote:

Not knowing much about virtual machine design...  Here's a question --
Why do we have a set number of registers?  Particularily since JITed code 
ends up setting the register constraints again, I'm not sure why parrot 
should set up register limit constraints first.  Couldn't each code block say 
"I need 12 registers for this block" and then the JIT system would go on to 
do it's appropriate spilling magic with the system registers...


This is somehow the approach, the current optimizer in jit.c takes. The 
optimizer looks at a section (a JITed part of a basic block) checks 
register usage and then assigns the top N registers to processor registers.

This has 2 disadvantages:
- its done at runtime - always. It's pretty fast, but could have non 
trivial overhead for big programs
- as each section and therefore each basic block has its own set of 
mapped registers, now on almost every boundary of a basic block and when 
calling out to non JITed code, processor registers have to be saved 
parrot's and restored back again. These memory accesses slow things 
down, so I want to avoid them where possible.


Phil
leo

Re: Using imcc as JIT optimizer

2003-02-25 Thread Phil Hassey

On Tuesday 25 February 2003 08:51, Leopold Toetsch wrote:
> Angel Faus wrote:
> > Saturday 22 February 2003 16:28, Leopold Toetsch wrote:
> >
> > With your approach there are three levels of parrot "registers":
> >
> > - The first N registers, which in JIT will be mapped to physical
> > registers.
> >
> > - The others 32 - N parrot registers, which will be in memory.
> >
> > - The "spilled" registers, which are also on memory, but will have to
> > be copied to a parrot register (which may be a memory location or a
> > physical registers) before being used.
>
> Spilling is really rare, you have to work hard, to get a test case :-)
> But when it comes to spilling, we should do some register renumbering
> (which is the case for processor registers too). The current allocation
> is per basic block. When we start spilling, new temp registers are
> created, so that the register life range is limited to the usage of the
> new temp register and the spill code.
> This is rather expensive, as for one spilled register, the whole life
> analysis has to be redone.

Not knowing much about virtual machine design...  Here's a question --
Why do we have a set number of registers?  Particularily since JITed code 
ends up setting the register constraints again, I'm not sure why parrot 
should set up register limit constraints first.  Couldn't each code block say 
"I need 12 registers for this block" and then the JIT system would go on to 
do it's appropriate spilling magic with the system registers...

I suspect the answer has something to do with optimized C and not making 
things hairy, but I had to ask anyway.  :)

...

Phil

Re: Using imcc as JIT optimizer

2003-02-25 Thread Leopold Toetsch

Angel Faus wrote:

Saturday 22 February 2003 16:28, Leopold Toetsch wrote:

With your approach there are three levels of parrot "registers": 

- The first N registers, which in JIT will be mapped to physical 
registers.

- The others 32 - N parrot registers, which will be in memory.

- The "spilled" registers, which are also on memory, but will have to 
be copied to a parrot register (which may be a memory location or a 
physical registers) before being used.


Spilling is really rare, you have to work hard, to get a test case :-) 
But when it comes to spilling, we should do some register renumbering 
(which is the case for processor registers too). The current allocation 
is per basic block. When we start spilling, new temp registers are 
created, so that the register life range is limited to the usage of the 
new temp register and the spill code.
This is rather expensive, as for one spilled register, the whole life 
analysis has to be redone.


I believe it would be smarter if we instructed IMCC to generate code 
that only uses N parrot registers (where N is the number of machine 
register available). This way we avoid the risk of having to copy 
twice the data.


I don't think so. When we have all 3 levels of registers, using less 
parrot registers would just produce more spilled registers.

Actually, I'm currently generating code that uses 32+N registers. The 
processor registers are numbered -1, -2 ... for the top used parrot 
registers 0, 1, ... But the processor registers are only fixed mirrors 
of the parrot registers.


This is also insteresting because it gives the register allocation 
algorithm all the information about the actual structure of the 
machine we are going to run in. I am quite confident that code 
generated this way would run faster.


All the normal operations boil down basically to 2 different machine 
instruction types e.g. for some binop :

   _rm or _rr (i386)
   _rrr (RISC arch)
These are surrounded by mov_rm / mov_mr to load/store non mapped 
processor registers from/to parrot registers, the reg(s) are some 
scratch registers then like %eax on i386 or r11/r12 for ppc.

s. e.g. jit/{i386,ppc}/core.jit

So the final goal could be, to emit these load/stores too, which then 
could be optimized to avoid duplicate loading/storing. Or imcc could 
emit a register move, if in the next instruction the parrot register is 
used again.
Then processor specific hints could come in, like:
  shr_rr_i for i386 has to have the shift count in %ecx.

We also need tho have a better procedure for saving and restoring 
spilled registers. Specially in the case of JIT compilation, where it 
could be translated to a machine save/restore.


I don't see much here. Where should the spilled registers be stored then?


What do you think about it?


I think, when it comes to spilling, we should divide the basic block, to 
get shorter life ranges, which would allow register renumbering then.


-angel
leo

Re: Using imcc as JIT optimizer

2003-02-25 Thread Jason Gloudon

On Tue, Feb 25, 2003 at 07:18:11PM +0100, Angel Faus wrote:
> I believe it would be smarter if we instructed IMCC to generate code 
> that only uses N parrot registers (where N is the number of machine 
> register available). This way we avoid the risk of having to copy 
> twice the data.

It's not going to be very good if I compile code to pbc on an x86 where there
are about 3 usable registers and try to run it on any other CPU with a lot more
registers.

-- 
Jason

Re: Using imcc as JIT optimizer

2003-02-25 Thread Angel Faus

Saturday 22 February 2003 16:28, Leopold Toetsch wrote:
> Gopal V wrote:
> > If memory serves me right, Leopold Toetsch wrote:
> >
> >
> > Ok .. well I sort of understood that the first N registers will
> > be the ones MAPped ?. So I thought re-ordering/sorting was the
> > operation performed.
>
> Yep. Register renumbering, so that the top N used (in terms of
> score) registers are I0, I1, ..In-1

With your approach there are three levels of parrot "registers": 

- The first N registers, which in JIT will be mapped to physical 
registers.

- The others 32 - N parrot registers, which will be in memory.

- The "spilled" registers, which are also on memory, but will have to 
be copied to a parrot register (which may be a memory location or a 
physical registers) before being used.

I believe it would be smarter if we instructed IMCC to generate code 
that only uses N parrot registers (where N is the number of machine 
register available). This way we avoid the risk of having to copy 
twice the data.

This is also insteresting because it gives the register allocation 
algorithm all the information about the actual structure of the 
machine we are going to run in. I am quite confident that code 
generated this way would run faster.

We also need tho have a better procedure for saving and restoring 
spilled registers. Specially in the case of JIT compilation, where it 
could be translated to a machine save/restore.

What do you think about it?

-angel

Re: Using imcc as JIT optimizer

2003-02-25 Thread Leopold Toetsch

Leopold Toetsch wrote:


- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers


I have committed the next bunch of changes and an updated jit.pod.
- it should now be platform independent, *but* other platforms have to
  define what they consider as preserved (callee-saved) registers and
  put these first in the mapped register lists.
- for testing enable JIT_IMCC_OJ in jit.c and for platforms != i386:
  copy the MAP macro at bottom of jit/i386/jit_emit.h to your jit_emit.h
- run programs like so:
  imcc -Oj -d8 primes.pasm (-d8 shows generates ins)
It runs now ~95% of parrot tests on i386 but YMMV.

Have fun,

leo

Re: Using imcc as JIT optimizer

2003-02-23 Thread Leopold Toetsch

Dan Sugalski wrote:

At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:

Starting from the unbearable fact, that optimized compiled C is still 
faster then parrot -j (in primes.pasm), I did this experiment:
- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers


This sounds pretty interesting, and I bet it could make things faster. 


I have now checked in a first version for testing:
- the define JIT_IMCC_OJ in jit.c is disabled - so no impact
- jit2h.pl defines now a MAP macro, which makes jit_cpu.c more readable
Restrictions:
- no vtable ops
- no saving of non preserved registers (%edx on I386)
So not much will run, when experimenting with it.
But I think, the numbers are promising, so it's worth a further try.

To enable the whole fun, recompile with JIT_IMCC_OJ enabled, build imcc 
and use the -Oj switch (primes.pasm is from examples/benchmarks):

$ time imcc -j -Oj  primes.pasm
N primes up to 5 is: 5133
last is: 4
Elapsed time: 3.523477
real0m3.548s

$ ./primes  # primes.c -O3 gcc 2.95.2
N primes up to 5 is: 5133
last is: 4
Elapsed time: 3.647063
$ time imcc -j -O1 primes.pasm  # normal JIT
N primes up to 5 is: 5133
last is: 4
Elapsed time: 4.039121
real0m4.065s

imcc/parrot was built without optimization, but this doesn't matter, no 
external code is called for jit/i386 in the primes.pasm.
The timings for imcc obviously include compiling too.

leo

Re: Using imcc as JIT optimizer

2003-02-22 Thread Dan Sugalski

At 4:28 PM +0100 2/22/03, Leopold Toetsch wrote:
Gopal V wrote:
Direct hardware maps (like using CX for loop count etc) will need to be
platform dependent ?. Or you could have a fixed reg that can be used for
loop count (and gets mapped on hardware appropriately).


We currently don't have special registers, like %ecx for loops, they 
are not used in JIT either. My Pentium manual states, that these ops 
are not the fastest.
But in the long run, we should have some hints, that e.g. i386 needs 
%ecx as shift count, or that div uses %edx. But probably i386 is the 
only weird architecure with such ugly restrictions - and with far 
too few registers.
I'm OK with adding in documentation that encourages using particular 
registers for particular purposes, or having some sort of metadata 
for the JIT that notes loop registers or something. As long as it's 
out of band and optional, that's cool.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Using imcc as JIT optimizer

2003-02-22 Thread Nicholas Clark

On Sat, Feb 22, 2003 at 09:27:04PM +, nick wrote:
> On Sat, Feb 22, 2003 at 08:44:12PM -, Rafael Garcia-Suarez wrote:

> > What undefined behaviour are you referring to exactly ? the shift
> > overrun ? AFAIK it's very predictable (given one int size). Cases of
> 
> Will you accept a shortcut written in perl? The shift op uses C signed
> integers:

Oops. The logical shift uses *un*signed integers, except under use integer

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
1234
0

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder},  1 << 32)' 
linux
1234
1

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
4321
0

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
4321
1

So there's actually no difference in the numbers. But as I'm being a pedant I
ought to get the facts right. [I guess it's my fault for drinking Australian
wine :-)]

Nicholas Clark

Re: Using imcc as JIT optimizer

2003-02-22 Thread Nicholas Clark

On Sat, Feb 22, 2003 at 08:44:12PM -, Rafael Garcia-Suarez wrote:
> Nicholas Clark wrote in perl.perl6.internals :
> > 
> >> >   r->score = r->use_count + (r->lhs_use_count << 2);
> >> > 
> >> >r->score += 1 << (loop_depth * 3);
> [...]
> > I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
> > that added trap code for all classes of undefined behaviour, and caused
> > code to abort (or something more colourfully "undefined") if anything
> > undefined gets executed. I realise that code would run very slowly, but it
> > would be a very very useful debugging tool.
> 
> What undefined behaviour are you referring to exactly ? the shift
> overrun ? AFAIK it's very predictable (given one int size). Cases of

Will you accept a shortcut written in perl? The shift op uses C signed
integers:

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
1234
0

vs

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
1234
1

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
4321
1

vs

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder},  1 << 32)'
linux
4321
0

(all 4 are Debian GNU/Linux
 And both architectures that give 0 for a shift of 32, happen to give 1 for
 a shift of 256.
 But I wouldn't count on it for all architectures)

> potential undefined behavior can usually be detected at compile-time. I

In this specific case, maybe. In the general case no.
signed integer arithmetic overflowing is undefined behavior

> imagine that shift overrun detection can be enabled via an ugly macro
> and a cpp symbol.
> 
> (what's a nasal demon ? can't find the nasald(8) manpage)

Demons flying out of your nose. One alleged consequence of undefined
behaviour. Another is your computer turning into a butterfly. I guess a
third is "Microsoft releasing a bug free program"

Nicholas Clark

Re: Using imcc as JIT optimizer

2003-02-22 Thread Leopold Toetsch

Nicholas Clark wrote:

  r->score += 1 << (loop_depth * 3);

until variables in 11 deep loops go undefined?


Not undefined, but spilled. First *oops*, but second of course this all 
not final. I did change scoring several times from the code base AFAIK 
Angel Faus did implement. And we don't currently have any code that goes 
near that omplexity of such a deep nested loop.

There are probably a *lot* of such gotchas in the whole CFG code in 
imcc. I'm currently on some failing perl6 tests, when using 
optimization, all in regexen tests, which do a lot of branching.


I'm not sure how to patch this specific instance - just trap loop depths over
10? Should score be unsigned? 


A linear counting of loop_depth will do it, e.g.

  r->score += 100 * loop_depth ;

Or score deeper nested loops vars always higher then outside, or ...


More importantly, how do we trap these sort of things in the general case?


With  a lot of tests


I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
that added trap code for all classes of undefined behaviour, and caused
code to abort (or something more colourfully "undefined") if anything
undefined gets executed. I realise that code would run very slowly, but it
would be a very very useful debugging tool.


I'm currently adding asserts to e.g. loop detection code. Last one (to 
be checked in) is:

/* we could also take the depth of the first contained
 * block, but below is a check, that an inner loop is fully
 * contained in an outer loop
 */
This is a check, that all blocks of a deeper nested loop are contained 
totally in the outer loop, so that there can't be basic blocks outside. 
But in regex code, this seems not to be true - or a prior stage of 
optimization messes things up.
This issues are as hard to debug as deeply buried in ~400 basic blocks 
with ~1000 edges connecting those.

perl6 $ ../imcc/imcc -O1 -d70 t/rx/basic_2.imc 2>&1 | less


Nicholas Clark


leo

Re: Using imcc as JIT optimizer

2003-02-22 Thread Rafael Garcia-Suarez

Nicholas Clark wrote in perl.perl6.internals :
> 
>> >   r->score = r->use_count + (r->lhs_use_count << 2);
>> > 
>> >r->score += 1 << (loop_depth * 3);
[...]
> I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
> that added trap code for all classes of undefined behaviour, and caused
> code to abort (or something more colourfully "undefined") if anything
> undefined gets executed. I realise that code would run very slowly, but it
> would be a very very useful debugging tool.

What undefined behaviour are you referring to exactly ? the shift
overrun ? AFAIK it's very predictable (given one int size). Cases of
potential undefined behavior can usually be detected at compile-time. I
imagine that shift overrun detection can be enabled via an ugly macro
and a cpp symbol.

(what's a nasal demon ? can't find the nasald(8) manpage)

Re: Using imcc as JIT optimizer

2003-02-22 Thread Nicholas Clark

Please don't take the following as a criticism of imcc - I'm sure I manage
to write code with things like this all the time.

On Sat, Feb 22, 2003 at 08:13:59PM +0530, Gopal V wrote:
> If memory serves me right, Leopold Toetsch wrote:

> >   r->score = r->use_count + (r->lhs_use_count << 2);
> > 
> >r->score += 1 << (loop_depth * 3);
> 
> Ok ... deeper the loop the more important the var is .. cool.

until variables in 11 deep loops go undefined?
(it appears to be a signed int)
I'm not sure how to patch this specific instance - just trap loop depths over
10? Should score be unsigned? 

More importantly, how do we trap these sort of things in the general case?

I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
that added trap code for all classes of undefined behaviour, and caused
code to abort (or something more colourfully "undefined") if anything
undefined gets executed. I realise that code would run very slowly, but it
would be a very very useful debugging tool.

Nicholas Clark

Re: Using imcc as JIT optimizer

2003-02-22 Thread Leopold Toetsch

Gopal V wrote:

If memory serves me right, Leopold Toetsch wrote:


Ok .. well I sort of understood that the first N registers will be the
ones MAPped ?. So I thought re-ordering/sorting was the operation performed.


Yep. Register renumbering, so that the top N used (in terms of score) 
registers are I0, I1, ..In-1


Direct hardware maps (like using CX for loop count etc) will need to be
platform dependent ?. Or you could have a fixed reg that can be used for
loop count (and gets mapped on hardware appropriately). 


We currently don't have special registers, like %ecx for loops, they are 
not used in JIT either. My Pentium manual states, that these ops are not 
the fastest.
But in the long run, we should have some hints, that e.g. i386 needs 
%ecx as shift count, or that div uses %edx. But probably i386 is the 
only weird architecure with such ugly restrictions - and with far too 
few registers.


Loop info

Hmm.. this is what I said "sounds like a lot of work" ... which still 
remains true from my perspective :-)


There is still a lot of work, yes, but some things already are done:

	set I10, 10
x: 
if I10, ok
	branch y
ok: 
set I0, 1
	sub I10, I10, I0
	print I10
	print "\n"
	branch x
y:
	end

Ends up (with imcc -O2p) as:

set I0, 10
set I1, 1
x:
unless I0, y
sub I0, I1
print I0
print "\n"
branch x
y:
end
You can see:

opt1 sub I10, I10, I0 => sub I10, I0
if_branch if ... ok
label ok deleted
found invariant set I0, 1
inserting it in blk 0 after set I10, 10 

The latter one is working out from the most inner loop.


Gopal
leo

Re: Using imcc as JIT optimizer

2003-02-22 Thread Gopal V

If memory serves me right, Leopold Toetsch wrote:
> > I'm assuming that the temporaries are the things being moved around here ?.
> 
> 
> It is not so much a matter of moving things around, but a matter of 
> allocating (and renumbering) parrot (or for JIT) processor registers. 

Ok .. well I sort of understood that the first N registers will be the
ones MAPped ?. So I thought re-ordering/sorting was the operation performed.

Direct hardware maps (like using CX for loop count etc) will need to be
platform dependent ?. Or you could have a fixed reg that can be used for
loop count (and gets mapped on hardware appropriately). 

> > does it. But that sounds like a lot of work identifying the loops and
> > optimising accordingly.

> Loop info
> -
> loop 0,  depth 1, size 2, entry 0, contains blocks:
> 1 2

Hmm.. this is what I said "sounds like a lot of work" ... which still 
remains true from my perspective :-)

>   r->score = r->use_count + (r->lhs_use_count << 2);
> 
>r->score += 1 << (loop_depth * 3);

Ok ... deeper the loop the more important the var is .. cool.

Gopal
-- 
The difference between insanity and genius is measured by success

Re: Using imcc as JIT optimizer

2003-02-22 Thread Leopold Toetsch

Gopal V wrote:

I'm assuming that the temporaries are the things being moved around here ?.


It is not so much a matter of moving things around, but a matter of 
allocating (and renumbering) parrot (or for JIT) processor registers. 
These are of course mainly temporaries, but even when you have some 
find_lexical/do_something/store_lexical, imcc selects the best register 
for all involved ops, temps or "variables" it doesn't really matter.


The only question I have here , how does imcc identify loops ?. I've
been using "if goto" to loop around , which is exactly the way assembly
does it. But that sounds like a lot of work identifying the loops and
optimising accordingly.


Here are basic blocks, the CFG and loop info of
0 set I0, 10
1 x:
1 unless I0, y
2 dec I0
2 print I0
2 print "\n"
2 branch x
3 y:
3end
Dumping the CFG:
---
0 (0)-> 1<-
1 (1)-> 2 3  <- 2 0
2 (1)-> 1<- 1
3 (0)->  <- 1
Loop info
-
loop 0,  depth 1, size 2, entry 0, contains blocks:
1 2

To make it more clear -- identifying tight loops and the usage weights
correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100
times loop. Which will be come first ?. 


This is basically the current score calculation used for register 
allocation:

 r->score = r->use_count + (r->lhs_use_count << 2);

  r->score += 1 << (loop_depth * 3);


Gopal
leo

Re: Using imcc as JIT optimizer

2003-02-22 Thread Gopal V

If memory serves me right, Dan Sugalski wrote:
> This sounds pretty interesting, and I bet it could make things 
> faster. The one thing to be careful of is that it's easy to get 
> yourself into a position where you spend more time optimizing the 
> code you're JITting than you win in the end.

I think that's not the case for ahead of time optimisations . As long
as the JIT is not the optimiser , you could take your time optimising.

The topic is really misleading ... or am I the one who's wrong ?

> You also have to be very careful that you don't reorder things, since 
> there's not enough info in the bytecode stream to know what can and 
> can't be moved. (Which is something we need to deal with in IMCC as 
> well)

I'm assuming that the temporaries are the things being moved around here ?.
Since imcc already moves them around anyway and the programmer makes
no assumptions about their positions -- this shouldn't be a problem ?.

The only question I have here , how does imcc identify loops ?. I've
been using "if goto" to loop around , which is exactly the way assembly
does it. But that sounds like a lot of work identifying the loops and
optimising accordingly.

To make it more clear -- identifying tight loops and the usage weights
correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100
times loop. Which will be come first ?. 

Gopal
-- 
The difference between insanity and genius is measured by success

Re: Using imcc as JIT optimizer

2003-02-21 Thread Leopold Toetsch

Dan Sugalski wrote:

At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:

Starting from the unbearable fact, that optimized compiled C is still 
faster then parrot -j (in primes.pasm), I did this experiment:
- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers


This sounds pretty interesting, and I bet it could make things faster. 
The one thing to be careful of is that it's easy to get yourself into a 
position where you spend more time optimizing the code you're JITting 
than you win in the end.


I don't think so. Efficiency of JIT code depends very much on register 
save/restore instructions. imcc does a full parrot register life 
analysis, and knows when e.g. I17 is rewritten and thus can assign the 
same register for it, that some ins above I5 had. Current JIT code is 
looking at parrot registers and emits save/loads to get processor 
registers in sync, which is the opposite of:
The proposal is, to map the top N used parrot regs to physical processor 
registers. This means: imcc emits instructions to get parrot registers 
up to date and not vv. The code is already in terms of processor regs.


You also have to be very careful that you don't reorder things, since 
there's not enough info in the bytecode stream to know what can and 
can't be moved. (Which is something we need to deal with in IMCC as well)
Yep. So, I'm trying to get *all* needed info's into the bytecode 
stream/into the op_info/or as a hack in imcc. See e.g. "[RFC] imcc 
calling conventions". Please remember times where I started digging into 
 parrot and core.ops: the in/out/inout definition of P-registers. These 
issues are *crucial* for a language *compiler*.
If perl6 or any other language should run *efficiently*, imcc has to be 
a compiler with all needed info at hand and not a plain PASM assembler.

leo

Re: Using imcc as JIT optimizer

2003-02-21 Thread Dan Sugalski

At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:
Starting from the unbearable fact, that optimized compiled C is 
still faster then parrot -j (in primes.pasm), I did this experiment:
- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers
This sounds pretty interesting, and I bet it could make things 
faster. The one thing to be careful of is that it's easy to get 
yourself into a position where you spend more time optimizing the 
code you're JITting than you win in the end.

You also have to be very careful that you don't reorder things, since 
there's not enough info in the bytecode stream to know what can and 
can't be moved. (Which is something we need to deal with in IMCC as 
well)
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Using imcc as JIT optimizer

2003-02-21 Thread Leopold Toetsch

Leopold Toetsch wrote:



- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers



The "[RFC] imcc calling conventions" didn't get any response. Should I 
take this fact as an implict "yep, fine"?

Here is again the relevant part, which has implications on register 
renumbering, used for JIT optimization:

=head1 Parrot calling conventions (NCI)

Proposed syntax:

  $P0 = load_lib "libname"
  $P1 = dlfunc $P0, "funcname", "signature"
  .nciarg z	# I5
  .nciarg y	# I6
  .nciarg x	# I7
  ncicall $P1	# r = funcname(x, y, z)
  .nciresult r

A code snippet like:

	set I5, I0
	dlfunc P0, P1, "func", "ii"
	invoke
	set I6, I5

now comes out as:

set ri1, ri0
dlfunc P0, P1, "func", "ii"
invoke
set ri0, ri1

which is clearly not, what pdd03 is intending. For plain PASM at least 
the .nciarg/.nciresult are necessary, to mark these parrot registers as 
fix and to have some hint for imcc, that dlfunc is actually using these 
registers.

So there are some possibilities:
- disable register renumbering for all compilation units, where a 
B is found
- do it right, i.e. implement above (or a similar) syntax and rewrite 
existing code

leo

Re: Using imcc as JIT optimizer

2003-02-20 Thread Daniel Grunblatt

On Thursday 20 February 2003 18:14, Leopold Toetsch wrote:
> Tupshin Harper wrote:
> > Leopold Toetsch wrote:
> >> Starting from the unbearable fact, that optimized compiled C is still
> >> faster then parrot -j (in primes.pasm)
> >
> > Lol...what are you going to do when somebody comes along with the
> > unbearable example of primes.s(optimized x86 assembly), and you are
> > forced to throw up your hands in defeat? ;-)
>
> It only may be equally fast, that's it :)
Nahh, you know it can be faster... may be in a couple of years ;-D

>
> > Cool idea, if I understand correctly, and I am in awe of how fast the
> > bloody thing is already.
>
> That's integer/float only. When it comes to objects, different things
> matter.
>
> > -Tupshin
>
> leo

Re: Using imcc as JIT optimizer

2003-02-20 Thread Leopold Toetsch

Tupshin Harper wrote:


Leopold Toetsch wrote:


Starting from the unbearable fact, that optimized compiled C is still 
faster then parrot -j (in primes.pasm)


Lol...what are you going to do when somebody comes along with the 
unbearable example of primes.s(optimized x86 assembly), and you are 
forced to throw up your hands in defeat? ;-)


It only may be equally fast, that's it :)



Cool idea, if I understand correctly, and I am in awe of how fast the 
bloody thing is already.


That's integer/float only. When it comes to objects, different things 
matter.


-Tupshin



leo

Re: Using imcc as JIT optimizer

2003-02-20 Thread Tupshin Harper

Leopold Toetsch wrote:


Starting from the unbearable fact, that optimized compiled C is still 
faster then parrot -j (in primes.pasm)

Lol...what are you going to do when somebody comes along with the 
unbearable example of primes.s(optimized x86 assembly), and you are 
forced to throw up your hands in defeat? ;-)

Cool idea, if I understand correctly, and I am in awe of how fast the 
bloody thing is already.

-Tupshin

Re: Using imcc as JIT optimizer

2003-02-20 Thread Leopold Toetsch

Sean O'Rourke wrote:


On Thu, 20 Feb 2003, Leopold Toetsch wrote:


What do people think?



Cool idea -- a lot of optimization-helpers could eventually be passed on
to the jit (possibly in the metadata?).  One thought -- the information
imcc computes should be platform-independent.  e.g. it could pass a
control flow graph to the JIT, but it probably shouldn't do register
allocation for a specific number of registers.  How much worse do you
think it would be to have IMCC just rank the Parrot registers in order of
decreasing spill cost, then have the JIT take the top N, where N is the
number of available architectural registers?



The registers are already in that order (with -Op or -Oj), this wouldn't 
be a problem. Difficulties arise, when it comes to the register 
load/save instructions, which get inserted by imcc in my scheme. These 
are definitely processor/$arch specific. They depend on the number of 
mappable (and non-preserved too) registers, and on the state of the 
op_jit function table.

Of course CFG and register life information could be passed on to the 
JIT, but this seems a little bit complicated, as JIT has it's own 
sections, which match either a basic block from imcc or are a sequence 
of non-JITable instructions.
But in the long run, it could be a way to go. OTOH - PBC compatibility 
is not a big point here, when JIT is involved: in 99% of the time the 
code would run on the machine, where it is generated.
And it would be AFAIK easier, to make some JIT crosscompiler. This would 
basically only need the amount of mappable registers and the extcall 
bits from the jump table, read in from some config file.


/s



leo

Re: Using imcc as JIT optimizer

2003-02-20 Thread Sean O'Rourke

On Thu, 20 Feb 2003, Leopold Toetsch wrote:
> What do people think?

Cool idea -- a lot of optimization-helpers could eventually be passed on
to the jit (possibly in the metadata?).  One thought -- the information
imcc computes should be platform-independent.  e.g. it could pass a
control flow graph to the JIT, but it probably shouldn't do register
allocation for a specific number of registers.  How much worse do you
think it would be to have IMCC just rank the Parrot registers in order of
decreasing spill cost, then have the JIT take the top N, where N is the
number of available architectural registers?

/s

Using imcc as JIT optimizer

2003-02-20 Thread Leopold Toetsch

Starting from the unbearable fact, that optimized compiled C is still 
faster then parrot -j (in primes.pasm), I did this experiment:
- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers

Here is the JIT optimized PASM output of

$ imcc -Oj -o p.pasm primes.pasm
$ cat p.pasm
set ri2, 1
set I5, 50
set I4, 0
print "N primes up to "
print I5
print " is: "
time N1
set rn1, N1 # load
REDO:
set ri0, 2
div ri3, ri2, 2
LOOP:
cmod ri1, ri2, ri0
if ri1, OK			# with -O1j unless ri1, NEXT
branch NEXT			# deleted
OK:		 
		# deleted
inc ri0
le ri0, ri3, LOOP
inc I4
set I6, ri2
NEXT:
inc ri2
le ri2, I5, REDO
time N0
set rn0, N0 # load
print I4
print "\nlast is: "
print I6
print "\n"
sub rn0, rn1
set N0, rn0 # save
print "Elapsed time: "
print N0
print "\n"
end

The ri? and rn? are processor registers, above is for intel (4 mapped 
int/float regs), you can translate the ri? to [%ebx, %edi, %esi, %edx).
The processor regs are represented as (-1 - parrot_reg),
i.e. %ebx == -1, %edi == -2 ...

The MAP macro in jit_emit.h would then be:
# define MAP(i) ((i)>= 0 ? 0 : ...map_branch[jit_info->op_i -1-(i)])
where the mappings are directly intval_map or floatval_map. JIT wouldn't 
need any further calculations.

The load/save instructions get inserted by looking at op_jit[].extcall, 
i.e. if the instruction reads or writes a register, it gets saved/loaded 
before/after and the parrot register is used instead. (Only the print 
and time ops are external in i386).

I currently have the imcc part for some common cases, emough for above 
output.

What do people think?

For reference: a similar idea: "Of mops and microops"

leo
PS: -O3 C 3.64s, JIT ~3.55.

41 matches

Mail list logo