Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-17 Thread Egor Pasko
On the 0x224 day of Apache Harmony Alex Astapchuk wrote:
 Hi Egor,
 
 Thanks for your reply. Please, find my answers inlined.
 
 Egor Pasko wrote:
  On the 0x222 day of Apache Harmony Alex Astapchuk wrote:
  Hi all,
 
  Among other things listed on the JIT Dev tasks, there is a need for
  calling convention (CC) fix-up for IA-32 [1].
 
  Current problems are:
 
  1. The calling convention(s) used are stack-based - this adds a memory
  access overhead on calls.
  2. The convention currently used for managed code neither allow to pass
  float-point values on XMM registers, nor it provides callee-saved XMM
  registers.
  3. FPU stack is used to return float/double values
 
 
  Both 2) and 3) affect register allocation for float point values in a
  bad manner.
  Fixing even the 1) looks promising for hot vm helpers like monitor
  enter/exit and resolve_interface_vtable.
 
  So, I'm going to implement register-based calling convention for IA-32.
 
  The current proposal is:
   - make it possible to switch between existing and new conventions
 for investigation and tuning purposes
   - implement 2 calling conventions:
 1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
 rest is on stack)
 2. DRLVM-specific convention: which involves ECX, EDX (and may
 be EAX) for integer/parameters passing and also use XMMs for
 float-point parameters and produce callee-save XMMs.
 
  The #1 may be used to call internal C-based helpers. It may also be used
  to call VM helpers where XMM callee save regs may add unnecessary
  overhead on the helper itself. The example I can think of is
  resolve_interface helper - preserving XMMs there looks overkill.
  Alex, is there some mechanism to annotate helpers' with calling
  conventions that you would prefer? Or are you going to hardcode
 
 Agh... Good question. And I don't have the right answer now.
 
 I'm going to make the switch between old and new conventions
 controllable from the command line, but that's almost all I can do in
 current environment.

That would be good. Need to keep versions of compiled native calls for
various calling conventions, heh?

 The heplers infos like signatures and calling conventions used
 is quite-long-head-ache history.

:)

 What I'm going to implement is quite orthogonal to how the info may be
 passed between VM and JIT. I'm only going to support the possibility
 of calling convention usage.

...yes, this is orthogonal, but needs TBD for configurability and
completeness of the design solution. We can return to this as soon as
your performance experiments show up.

 The helpers infos may be related with Mikhail's work with helpers inlining.
 I recall some discussions about Java-based annotations that may be
 used to describe helpers (not only for inlining, but in general),
 including convention used, the library/module location, etc.

again, we should collect the approaches and decide

  #2 will help to speed-up managed code both call-intensive and (I hope)
  FP-intesive - together with register allocator tuning.
  I would REALLY love to see it implemented!! It is a long-awaited
  performance feature. FP performance of DRLVM is poor if compared to
  HotSpot, and the most probable reason for that is problem-2 above.
  A microbenchmark would be great to have. I would be also happy to see
  the whole design proposal here in the mailing list. Is it possible?
 
 Sure, I'll do the micro benchmark.
 
 I don't have a design proposal - since I'm not going to change design.
 I'm only extend existing functionality a bit.

By design proposal I mean something you should not be afraid of. The
list of tuning parameters and that kind of stuff.

 Mostly the requirement's I'm going to meet are described in my answer
 to Rana - there are things that will be tunable there.

-- 
Egor Pasko



Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Slava Shakin
Alex,

It's great you're going to do that. I like the proposal.

 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes

I think such configurability is a very important feature as we lose nothing 
but acquire both more opportunities for tuning and some performance win even 
with the current optimization set. IMO it is important to expose as many 
options for tuning as possible because even if the proposal doesn't 
immediately bring considerable boost we might well expect more from synergy 
with future optimizations and non-default options of existing ones.

--
Thanks,
Slava Shakin.


Alex Astapchuk [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Hi all,

 Among other things listed on the JIT Dev tasks, there is a need for
 calling convention (CC) fix-up for IA-32 [1].

 Current problems are:

 1. The calling convention(s) used are stack-based - this adds a memory
 access overhead on calls.
 2. The convention currently used for managed code neither allow to pass
 float-point values on XMM registers, nor it provides callee-saved XMM
 registers.
 3. FPU stack is used to return float/double values


 Both 2) and 3) affect register allocation for float point values in a
 bad manner.
 Fixing even the 1) looks promising for hot vm helpers like monitor
 enter/exit and resolve_interface_vtable.

 So, I'm going to implement register-based calling convention for IA-32.

 The current proposal is:
 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes
 - implement 2 calling conventions:
 1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
 rest is on stack)
 2. DRLVM-specific convention: which involves ECX, EDX (and may
 be EAX) for integer/parameters passing and also use XMMs for
 float-point parameters and produce callee-save XMMs.

 The #1 may be used to call internal C-based helpers. It may also be used
 to call VM helpers where XMM callee save regs may add unnecessary
 overhead on the helper itself. The example I can think of is
 resolve_interface helper - preserving XMMs there looks overkill.

 #2 will help to speed-up managed code both call-intensive and (I hope)
 FP-intesive - together with register allocator tuning.


 Any comments are welcome.


 [1] 
 http://wiki.apache.org/harmony/JIT_Development_Tasks#head-bffdfbc80108641ca9a8bc29ea871c67fb3b82b9


 -- 
 Thanks,
   Alex

 





Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Pavel Ozhdikhin

Good proposal, Alex! Do you know if other VM use register-based fast calling
convention and what gain we can get from it? Can we see that
using micro-benchmarks?

Thanks,
Pavel



On 11/16/06, Slava Shakin [EMAIL PROTECTED] wrote:


Alex,

It's great you're going to do that. I like the proposal.

 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes

I think such configurability is a very important feature as we lose
nothing
but acquire both more opportunities for tuning and some performance win
even
with the current optimization set. IMO it is important to expose as many
options for tuning as possible because even if the proposal doesn't
immediately bring considerable boost we might well expect more from
synergy
with future optimizations and non-default options of existing ones.

--
Thanks,
Slava Shakin.


Alex Astapchuk [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Hi all,

 Among other things listed on the JIT Dev tasks, there is a need for
 calling convention (CC) fix-up for IA-32 [1].

 Current problems are:

 1. The calling convention(s) used are stack-based - this adds a memory
 access overhead on calls.
 2. The convention currently used for managed code neither allow to pass
 float-point values on XMM registers, nor it provides callee-saved XMM
 registers.
 3. FPU stack is used to return float/double values


 Both 2) and 3) affect register allocation for float point values in a
 bad manner.
 Fixing even the 1) looks promising for hot vm helpers like monitor
 enter/exit and resolve_interface_vtable.

 So, I'm going to implement register-based calling convention for IA-32.

 The current proposal is:
 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes
 - implement 2 calling conventions:
 1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
 rest is on stack)
 2. DRLVM-specific convention: which involves ECX, EDX (and may
 be EAX) for integer/parameters passing and also use XMMs for
 float-point parameters and produce callee-save XMMs.

 The #1 may be used to call internal C-based helpers. It may also be used
 to call VM helpers where XMM callee save regs may add unnecessary
 overhead on the helper itself. The example I can think of is
 resolve_interface helper - preserving XMMs there looks overkill.

 #2 will help to speed-up managed code both call-intensive and (I hope)
 FP-intesive - together with register allocator tuning.


 Any comments are welcome.


 [1]

http://wiki.apache.org/harmony/JIT_Development_Tasks#head-bffdfbc80108641ca9a8bc29ea871c67fb3b82b9


 --
 Thanks,
   Alex








Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Egor Pasko
On the 0x222 day of Apache Harmony Alex Astapchuk wrote:
 Hi all,
 
 Among other things listed on the JIT Dev tasks, there is a need for
 calling convention (CC) fix-up for IA-32 [1].
 
 Current problems are:
 
 1. The calling convention(s) used are stack-based - this adds a memory
 access overhead on calls.
 2. The convention currently used for managed code neither allow to pass
 float-point values on XMM registers, nor it provides callee-saved XMM
 registers.
 3. FPU stack is used to return float/double values
 
 
 Both 2) and 3) affect register allocation for float point values in a
 bad manner.
 Fixing even the 1) looks promising for hot vm helpers like monitor
 enter/exit and resolve_interface_vtable.
 
 So, I'm going to implement register-based calling convention for IA-32.
 
 The current proposal is:
  - make it possible to switch between existing and new conventions
   for investigation and tuning purposes
  - implement 2 calling conventions:
   1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
   rest is on stack)
   2. DRLVM-specific convention: which involves ECX, EDX (and may
   be EAX) for integer/parameters passing and also use XMMs for
   float-point parameters and produce callee-save XMMs.
   
 The #1 may be used to call internal C-based helpers. It may also be used
 to call VM helpers where XMM callee save regs may add unnecessary
 overhead on the helper itself. The example I can think of is
 resolve_interface helper - preserving XMMs there looks overkill.

Alex, is there some mechanism to annotate helpers' with calling
conventions that you would prefer? Or are you going to hardcode

 #2 will help to speed-up managed code both call-intensive and (I hope)
 FP-intesive - together with register allocator tuning.

I would REALLY love to see it implemented!! It is a long-awaited
performance feature. FP performance of DRLVM is poor if compared to
HotSpot, and the most probable reason for that is problem-2 above.
A microbenchmark would be great to have. I would be also happy to see
the whole design proposal here in the mailing list. Is it possible?

 
 Any comments are welcome.
 
 
 [1]
 http://wiki.apache.org/harmony/JIT_Development_Tasks#head-bffdfbc80108641ca9a8bc29ea871c67fb3b82b9
 
 
 -- 
 Thanks,
Alex
 
 

-- 
Egor Pasko



Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Rana Dasgupta

Hi Alex,
   This is good, thanks. Please see below...


On 11/15/06, Alex Astapchuk [EMAIL PROTECTED]  wrote:


Hi all,

Among other things listed on the JIT Dev tasks, there is a need for
calling convention (CC) fix-up for IA-32 [1].

Current problems are:

1. The calling convention(s) used are stack-based - this adds a memory
access overhead on calls.
2. The convention currently used for managed code neither allow to pass
float-point values on XMM registers, nor it provides callee-saved XMM
registers.
3. FPU stack is used to return float/double values

So, I'm going to implement register-based calling convention for IA-32.

The current proposal is:
- make it possible to switch between existing and new conventions
   for investigation and tuning purposes



 So does this mean one specific convention, fastcall, for C helpers and a
second custom DRLVM convention for managed code?


   - implement 2 calling conventions:
   1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
   rest is on stack)





   2. DRLVM-specific convention: which involves ECX, EDX (and may
   be EAX) for integer/parameters passing and also use XMMs for
   float-point parameters and produce callee-save XMMs.



Passing a bounded number of fp args using XMM sounds like a good idea, but
why callee-saves XMM's? My recollection is that the Intel Software
Development Manual recommends caller saved SSE and SSE2 registers for
performance. Primarily because there are all kinds of optimized move
instructions to and from XMM registers like MOVAPS, MOVUPS, MOVAPD, MOVDQA
etc.  for packed/unpacked, single/double precision fp types. The callee does
not know the datatype in a register. The caller can save only what it wants
to preserve, using the best move. My recollection is that the unaligned move
penalties are high.

I  did not fully understand your comment about the resolve_interface()
helper. In the custom convention(2), is the proposal for all XMM registers
to be saved by the callee, even if there are no fp operands in the method?

Thanks,
Rana


Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Alex Astapchuk

Hi Rana,

Thank you for your comments. Please, find my answers inlined.

Rana Dasgupta wrote:

Hi Alex,
   This is good, thanks. Please see below...


On 11/15/06, Alex Astapchuk [EMAIL PROTECTED]  wrote:


Hi all,

Among other things listed on the JIT Dev tasks, there is a need for
calling convention (CC) fix-up for IA-32 [1].

Current problems are:

1. The calling convention(s) used are stack-based - this adds a memory
access overhead on calls.
2. The convention currently used for managed code neither allow to pass
float-point values on XMM registers, nor it provides callee-saved XMM
registers.
3. FPU stack is used to return float/double values

So, I'm going to implement register-based calling convention for IA-32.

The current proposal is:
- make it possible to switch between existing and new conventions
   for investigation and tuning purposes



 So does this mean one specific convention, fastcall, for C helpers and a
second custom DRLVM convention for managed code?


Right.
I'm going to implement both - the IA-32 fastcall and introduce another 
one convention.
The fastcall is indeed *primarily targeted* to C-based helpers - this is 
most easy way to declare a function as '__fastcall' and let compiler do 
the rest of job.
Despite of its target, the fastcall still can be used for managed code 
if we find it productive.


The reason behind the 'custom' convention is that I'm going to make it 
tunable - to see how it fits into different workloads.


The parameters that I'm going to make changeable are: number of GP 
registers for args, number of XMM registers for args, number of 
callee-save XMMs.




   - implement 2 calling conventions:
   1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
   rest is on stack)





   2. DRLVM-specific convention: which involves ECX, EDX (and may
   be EAX) for integer/parameters passing and also use XMMs for
   float-point parameters and produce callee-save XMMs.



Passing a bounded number of fp args using XMM sounds like a good idea, but
why callee-saves XMM's? My recollection is that the Intel Software
Development Manual recommends caller saved SSE and SSE2 registers for
performance. Primarily because there are all kinds of optimized move
instructions to and from XMM registers like MOVAPS, MOVUPS, MOVAPD, MOVDQA
etc.  for packed/unpacked, single/double precision fp types. The callee 
does not know the datatype in a register. The caller can save only what it wants
to preserve, using the best move. My recollection is that the unaligned 
move  penalties are high.


The optimization guide recommends on the very generic case.
In a program that mixes all the wealth of SSE/SSE2 the guide 
recommendations may be the best choice.


In our particular case, we completely control the managed code and its 
behavior so we may play with more fine grained control.
For example, we're currently neither use packed things, nor we do 
anything with 128bits. So we may relax requirement to preserve only 
lower 64 bits - even the simple MOVQ should fit well.


The caller knows the type, but the callee knows whether it changes a 
particular register - the main reason to play with callee-save XMMs is 
*to avoid the need for saving at all*.


Currently, the FP-intensive code must spill every used XMM register, 
before a call, even if the XMMs registers are not touched in the callee.


This is what we would like to avoid - the unnecessary spill code and 
memory accesses.


Also, I'm going to make this parameter (number of callee-save XMM 
registers) tunable. If find it hurts anything, we'll switch it off.



I  did not fully understand your comment about the resolve_interface()
helper. In the custom convention(2), is the proposal for all XMM registers
to be saved by the callee, even if there are no fp operands in the method?


Sorry for not being clear.
Actually, the proposal is exactly opposite. :-)

I mentioned resolve_interface() as the example of code where the XMMs 
[most likely] are not touched so there is no need to spill them.



--
Thanks,
  Alex



Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Alex Astapchuk

Hi Pavel,

Thank you for your interest.

Pavel Ozhdikhin wrote:
Do you know if other VM use register-based fast 
calling  convention and what gain we can get from it? Can we see that

using micro-benchmarks?
Well, I guess this is quite low level details of an implementation, I 
did not hear much info. Some infos [1][2] shows that HotSpot uses some 
kind of register-based convention, but without much details on it.


As for the gain - well, it's hard to predict, and with all the the 
standard disclaimers like YMMV :-), I would expect the PUSH/POP overhead 
elimination may add about 20%.

Again, on some workloads and YMMV. :-)

This is why I'm going to make the things tunable - to measure the exact 
gain when it's implemented.



[1]
HotSpot porting guide, thought it's actually about ARM and CLDC.
http://java.sun.com/javame/reference/docs/cldc-hi-1.1.3-web/doc/porting/html/ARM-FP.html
[2] It's about server HotSpot.
Some notes in global register allocator section (12) tend me think they 
use registers for passing arguments

http://www.usenix.org/events/jvm01/full_papers/paleczny/paleczny.pdf


--
Thanks,
  Alex




On 11/16/06, Slava Shakin [EMAIL PROTECTED] wrote:


Alex,

It's great you're going to do that. I like the proposal.

 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes

I think such configurability is a very important feature as we lose
nothing
but acquire both more opportunities for tuning and some performance win
even
with the current optimization set. IMO it is important to expose as many
options for tuning as possible because even if the proposal doesn't
immediately bring considerable boost we might well expect more from
synergy
with future optimizations and non-default options of existing ones.

--
Thanks,
Slava Shakin.


Alex Astapchuk [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Hi all,

 Among other things listed on the JIT Dev tasks, there is a need for
 calling convention (CC) fix-up for IA-32 [1].

 Current problems are:

 1. The calling convention(s) used are stack-based - this adds a memory
 access overhead on calls.
 2. The convention currently used for managed code neither allow to pass
 float-point values on XMM registers, nor it provides callee-saved XMM
 registers.
 3. FPU stack is used to return float/double values


 Both 2) and 3) affect register allocation for float point values in a
 bad manner.
 Fixing even the 1) looks promising for hot vm helpers like monitor
 enter/exit and resolve_interface_vtable.

 So, I'm going to implement register-based calling convention for IA-32.

 The current proposal is:
 - make it possible to switch between existing and new conventions
 for investigation and tuning purposes
 - implement 2 calling conventions:
 1. well known standard fastcall (fisrt 2 params on ECX+EDX, the
 rest is on stack)
 2. DRLVM-specific convention: which involves ECX, EDX (and may
 be EAX) for integer/parameters passing and also use XMMs for
 float-point parameters and produce callee-save XMMs.

 The #1 may be used to call internal C-based helpers. It may also be 
used

 to call VM helpers where XMM callee save regs may add unnecessary
 overhead on the helper itself. The example I can think of is
 resolve_interface helper - preserving XMMs there looks overkill.

 #2 will help to speed-up managed code both call-intensive and (I hope)
 FP-intesive - together with register allocator tuning.


 Any comments are welcome.


 [1]

http://wiki.apache.org/harmony/JIT_Development_Tasks#head-bffdfbc80108641ca9a8bc29ea871c67fb3b82b9 




 --
 Thanks,
   Alex














Re: [drlvm][jit][ia-32]register-based fast calling convention

2006-11-16 Thread Rana Dasgupta

Thanks for the clarifications Alex.

On 11/16/06, Alex Astapchuk [EMAIL PROTECTED] wrote:


Rana Dasgupta wrote:

  So does this mean one specific convention, fastcall, for C helpers and
a
 second custom DRLVM convention for managed code?

Right.
I'm going to implement both - the IA-32 fastcall and introduce another
one convention.
The fastcall is indeed *primarily targeted* to C-based helpers - this is
most easy way to declare a function as '__fastcall' and let compiler do
the rest of job.



I see, so complete __fastcall support then...return in EDX:EAX , preserve
EDI, ESI, EBP, EBX etc. and leave it to the compiler, makes sense.


Despite of its target, the fastcall still can be used for managed code
if we find it productive.



The reason behind the 'custom' convention is that I'm going to make it
tunable - to see how it fits into different workloads.

The parameters that I'm going to make changeable are: number of GP
registers for args, number of XMM registers for args, number of
callee-save XMMs.



Tunable only during experimentation, or expose a tunable knob and/or
multiple annotation choices? One thing to remember is that the existing
compiler calling conventions have been arrived at in almost exactly the same
waytrying various options across a broad range of applications and
choosing the best ones. So some of this work has been done upfront for us.
Also a thing to note is that on x64 ( at least on Windows ) there is a
single __fastcall convention ...almost identical to the ABI. A single,
efficient convention may sound limiting, but is great for debuggability  for
example.




 Passing a bounded number of fp args using XMM sounds like a good idea,
but
 why callee-saves XMM's? My recollection is that the Intel Software
 Development Manual recommends caller saved SSE and SSE2 registers for
 performance. Primarily because there are all kinds of optimized move
 instructions to and from XMM registers like MOVAPS, MOVUPS, MOVAPD,
MOVDQA
 etc.  for packed/unpacked, single/double precision fp types. The callee
 does not know the datatype in a register. The caller can save only what
it wants
 to preserve, using the best move. My recollection is that the unaligned
 move  penalties are high.

The optimization guide recommends on the very generic case.
In a program that mixes all the wealth of SSE/SSE2 the guide
recommendations may be the best choice.

In our particular case, we completely control the managed code and its
behavior so we may play with more fine grained control.
For example, we're currently neither use packed things, nor we do
anything with 128bits. So we may relax requirement to preserve only
lower 64 bits - even the simple MOVQ should fit well.



We don't yet have a good grasp of all the application types we are dealing
with. Remember that codegen for some well known benchmarks may not provide
all the data. However, MOVQ for the lower 64 is reasonable to start with, I
agree.


The caller knows the type, but the callee knows whether it changes a
particular register - the main reason to play with callee-save XMMs is
*to avoid the need for saving at all*.

Currently, the FP-intensive code must spill every used XMM register,
before a call, even if the XMMs registers are not touched in the callee.

This is what we would like to avoid - the unnecessary spill code and
memory accesses.

Also, I'm going to make this parameter (number of callee-save XMM
registers) tunable. If find it hurts anything, we'll switch it off.



One could also come up with a reverse argument...the caller needs no state
to be preserved ( it already saves the parameter XMM registers anyway ) and
the callee does a lot of unnecessary work :-) But making this tunable is a
good ideatill we know.


I  did not fully understand your comment about the resolve_interface()

 helper. In the custom convention(2), is the proposal for all XMM
registers
 to be saved by the callee, even if there are no fp operands in the
method?

Sorry for not being clear.
Actually, the proposal is exactly opposite. :-)



:-) Good, thanks.