Re[2]: factorial: let's get ahead of jhc! :)

2006-03-02 Thread Bulat Ziganshin
Hello Simon,

Friday, February 24, 2006, 7:18:30 PM, you wrote:

 1) add for/if/while to the C-- statement types list (data CmmStmt)

SM Please don't extend the C-- data type - it's very small and simple 
SM because that makes it easy to manipulate and reason about.

SM I don't see why you need to, either: you can already express for, if and
SM while in C-- using conditional branches.  I don't think gcc cares 
SM whether you write your code using labels and goto or while/for/if, it 
SM generates the same code either way.

i tested gcc (3.4.2, included in ghc 6.4.1) and my tests show that gcc
unroll loops only when a loop variable modified directly. This loop
unrolls:

loop: ...
  i=i-1;
  if(i1) goto loop;

and this loop don't unrolls:

loop: ...
  x=i-1;
  i=x;
  if(i1) goto loop;

but using of while/goto don't matters. i attached fac.c that shows
this - here only fac() and fac3() are unrolled


 2) implement recognizer for such simple STG functions (leaf unboxed
 procedures recognizer)
 
 3) implement the translation itself

SM By all means try this.  What you want is to compile recursive functions
SM like this:

SMf() {
SM x = arg1;
SM y = arg2;
SML:
SM ... body of f, with args mapped to x  y,
SM and recursive calls jumping to L passing args in x  y.
SM}

SM It's quite hard to do this as a C-- to C-- optimisation, at least 
SM without implementing a lot of other optimisations that we don't already
SM have.  I was hoping that gcc would do it for us, if we compile code like
SM this:

SMf() {
SML:
SM ... body of f ...
SM goto L;
SM}

SM but sadly it doesn't do the whole job.  (see cmmLoopifyForC in 
SM cmm/CmmOpt.hs, which I added today).

SM So you might try the hacky way of doing this transformation at a higher
SM level, when generating the C-- in the first place.  Putting the args in
SM temporaries is the easy bit; generating different code for the recursive
SM call will be more tricky.

i just downloaded fresh sources. cmmLoopifyForC may be not what we
need. the whole idea of transformation is to move f() parameters to
local variables, and then update contents of these local variables and
perform `goto` on each recursive call. why cmmLoopifyForC is not good
for this? only because before processing by this function the code
should look like this:

  f() {
   x = arg1;
   y = arg2;
  L:
   ... body of f, with args mapped to x  y,
   x = newarg1;
   y = newarg2
   jump f();
  }

what is just not proper code, especially if xy is local f()
variables. so to use this function we should start from generating
improper code and anyway we should change STG-C-- code generation


i think that the discussed optimization can be done in another way:

1) in function mkFunEntryCode add check that f() is a leaf unboxed
function

2) if it's true then replace call to bindArgsToStack with generation
of x=sp[0]; y=sp[4]; L: assignments and bind corresponding args to
xy (instead of binding them to sp[0]/sp[4])

3) in cgTailCall function check that it's a self-recursive call and
in this case generate sequence x=newarg1; y=newarg2; goto L instead
of jump f(). i think that this check will require adding info about
f() to the FCode monad. so, the whole story should look like this:

CgClosure.lhs:
-  ; bindArgsToStack stk_args
+  ; if isLeafUnboxedFunction cl_info body
+  then do local_args - makeNewLocals (length stk_args)  -- generate xy 
locals
+  generateLoads local_args stk_args-- generate x=sp[0]; 
y=sp[4] assignments
+  lbl - generateNewLabel  -- generate L:
+  bindStkArgstoLocals local_args stk_args  -- bind arguments to 
xy local variables
+  modifyFCodeState (Just cl_info) lbl  -- put in FCode monad 
state info about
+   --  now-processed leaf 
function and label
+   --  marking start of 
its self-recursive cycle
+  else do bindArgsToStack stk_args
+  modifyFCodeState Nothing undefined

CgTailCall.lhs:
-  = do  { fun_info - getCgIdInfo fun
+  = do  { fun_info - getCgIdInfo fun
+; (optimized_fun, lbl) - getFromFCodeState-- 
optimized_fun/=Nothing only if we
+   --  process leaf 
unboxed function

-  else -- Normal case, fun is boxed
+  else if (Just fun == optimized_fun) then
+   do generateAssignments-- generate x=newarg1; y=newarg2
+  generateGoto lbl   -- generate goto L
+  else -- Normal case, fun is boxed

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

fac.c
Description: Binary data
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: factorial: let's get ahead of jhc! :)

2006-03-01 Thread Simon Marlow
FWIW, here's the inner loop of the accumulating parameter factorial 
compiled with yesterday's HEAD on x86_64, via C:


Fac_zdwfac_info:
movq%rsi, %rax
testq   %rsi, %rsi
jne .L4
movq%rdi, %rbx
jmp *(%rbp)
.L4:
leaq-1(%rsi), %rsi
imulq   %rax, %rdi
jmp Fac_zdwfac_info

It's not perfect, but note the absence of memory operations.

The NCG version is similar, but has a couple of extra reg-to-reg moves 
(we need to beef up the NCG optimisations a bit).


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


factorial: let's get ahead of jhc! :)

2006-02-24 Thread Bulat Ziganshin
Hello glasgow-haskell-users,

i propose to do for start rather small change that will allow to
make things go several times faster and in particular outperform jhc
for the small leaf loops (i.e. loops that use only GHC primitives
and don't call other functions). that include factorial, a[]+=b[i]
loop that was discussed here several weeks ago, some shootout
examples, my own arrays serialization code and so on, so on

the idea is that the following STG code

f :: Int# - Double# - ... - Int#

(i.e. function use only unboxed values/results)

f x y z = code1 in
  case expr1 of
A - codeA
B - codeB
C - codeC in f x_c y_c z_c
D - codeD in f x_d y_d z_d
_ - code0 in f x0 y0 z0

can be compiled to the following C/C-- code:

f() {
  x = sp[0];
  y = sp[4];
  z = sp[8];
  sp+=12;

  code1;
  while (expr1!=A) {
if (expr1==B) then return codeB;
else if (expr1==C) then {x=x_c; y=y_c; z=z_c;}
else if (expr1==D) then {x=x_d; y=y_d; z=z_d;}
else {x=x0; y=y0; z=z0;}
code1;
  }
  return codeA;
}

this compilation don't require any changes in GHC memory model. all we
need:

1) add for/if/while to the C-- statement types list (data CmmStmt)

2) implement recognizer for such simple STG functions (leaf unboxed
procedures recognizer)

3) implement the translation itself

as i said, it's should be no more than one day of work. i even think
that it's one day for me and one hour for Simon Marlow :)  Simon, how
about this? i can even make the patches over current 6.6 sources and
you will apply them at morning and we will see whether it work :) i
yet never tried to rebuild the whole ghc :)

-- 
Best regards,
 Bulat  mailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[3]: factorial: let's get ahead of jhc! :)

2006-02-24 Thread Bulat Ziganshin
Hello Bulat,

Friday, February 24, 2006, 6:37:42 PM, you wrote:

SPJ Perhaps you may consider doing this transformation on the C-- data type
SPJ only, without involving the (already very complicated) STG - C-- code
SPJ generator?

i have found my investiations in this area. that is the C-- code
generated for fac worker:

Fac_zdwfac_entry() {
c1C0:
_s1BD = I32[Sp + 0];
if (_s1BD != 1) goto c1C4;
R1 = I32[Sp + 4];
Sp = Sp + 8;
jump (I32[Sp + 0]);
c1C4:
_s1BI = _s1BD * I32[Sp + 4];
_s1BF = _s1BD - 1;
I32[Sp + 4] = _s1BI;
I32[Sp + 0] = _s1BF;
jump c1C0;
}

first, we convert jump to the explicit loop:

_s1BD = I32[Sp + 0];
while (_s1BD != 1) {
_s1BI = _s1BD * I32[Sp + 4];
_s1BF = _s1BD - 1;
I32[Sp + 4] = _s1BI;
I32[Sp + 0] = _s1BF;
_s1BD = I32[Sp + 0];
}
R1 = I32[Sp + 4];
Sp = Sp + 8;
jump (I32[Sp + 0]);

then, we cache contents of sp[*] variables in the local ones:

sp4 = I32[Sp + 4];
sp0 = I32[Sp + 0];
_s1BD = sp0;
while (_s1BD != 1) {
_s1BI = _s1BD * sp4;
_s1BF = _s1BD - 1;
sp4 = _s1BI;
sp0 = _s1BF;
_s1BD = sp0;
}
I32[Sp + 4] = sp4;
I32[Sp + 0] = sp0;
R1 = I32[Sp + 4];
Sp = Sp + 8;
jump (I32[Sp + 0]);

and then we wipe out all the superfluous variables:

sp4 = I32[Sp + 4];
sp0 = I32[Sp + 0];
while (sp0 != 1) {
sp4 = sp0 * sp4;
sp0 = sp0 - 1;
}
R1 = sp4;
Sp = Sp + 8;
jump (I32[Sp + 0]);

and it is even for simple fac() function!!! instead of straightforward
generation of code that we need we should make all these nontrivial
studying and hard transformations on already generated code. on the
other side, we can just generate the following code right from the
STG:

int fac () {
sp4 = I32[Sp + 4];
sp0 = I32[Sp + 0];
while (sp0 != 1) {
sp4 = sp0 * sp4;
sp0 = sp0 - 1;
}
R1 = sp4;
Sp = Sp + 8;
jump (I32[Sp + 0]);
}

i think that my way is 100x simpler

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: factorial: let's get ahead of jhc! :)

2006-02-24 Thread Simon Marlow

Bulat Ziganshin wrote:


i propose to do for start rather small change that will allow to
make things go several times faster and in particular outperform jhc
for the small leaf loops (i.e. loops that use only GHC primitives
and don't call other functions). that include factorial, a[]+=b[i]
loop that was discussed here several weeks ago, some shootout
examples, my own arrays serialization code and so on, so on

the idea is that the following STG code

f :: Int# - Double# - ... - Int#

(i.e. function use only unboxed values/results)

f x y z = code1 in
  case expr1 of
A - codeA
B - codeB
C - codeC in f x_c y_c z_c
D - codeD in f x_d y_d z_d
_ - code0 in f x0 y0 z0

can be compiled to the following C/C-- code:

f() {
  x = sp[0];
  y = sp[4];
  z = sp[8];
  sp+=12;

  code1;
  while (expr1!=A) {
if (expr1==B) then return codeB;
else if (expr1==C) then {x=x_c; y=y_c; z=z_c;}
else if (expr1==D) then {x=x_d; y=y_d; z=z_d;}
else {x=x0; y=y0; z=z0;}
code1;
  }
  return codeA;
}

this compilation don't require any changes in GHC memory model. all we
need:

1) add for/if/while to the C-- statement types list (data CmmStmt)


Please don't extend the C-- data type - it's very small and simple 
because that makes it easy to manipulate and reason about.


I don't see why you need to, either: you can already express for, if and 
while in C-- using conditional branches.  I don't think gcc cares 
whether you write your code using labels and goto or while/for/if, it 
generates the same code either way.



2) implement recognizer for such simple STG functions (leaf unboxed
procedures recognizer)

3) implement the translation itself


By all means try this.  What you want is to compile recursive functions 
like this:


  f() {
   x = arg1;
   y = arg2;
  L:
   ... body of f, with args mapped to x  y,
   and recursive calls jumping to L passing args in x  y.
  }

It's quite hard to do this as a C-- to C-- optimisation, at least 
without implementing a lot of other optimisations that we don't already 
have.  I was hoping that gcc would do it for us, if we compile code like 
this:


  f() {
  L:
   ... body of f ...
   goto L;
  }

but sadly it doesn't do the whole job.  (see cmmLoopifyForC in 
cmm/CmmOpt.hs, which I added today).


So you might try the hacky way of doing this transformation at a higher 
level, when generating the C-- in the first place.  Putting the args in 
temporaries is the easy bit; generating different code for the recursive 
call will be more tricky.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Re[3]: factorial: let's get ahead of jhc! :)

2006-02-24 Thread Simon Peyton-Jones

| i have found my investiations in this area. that is the C-- code
| generated for fac worker:
| 
| Fac_zdwfac_entry() {
| c1C0:
| _s1BD = I32[Sp + 0];
| if (_s1BD != 1) goto c1C4;
| R1 = I32[Sp + 4];
| Sp = Sp + 8;
| jump (I32[Sp + 0]);
| c1C4:
| _s1BI = _s1BD * I32[Sp + 4];
| _s1BF = _s1BD - 1;
| I32[Sp + 4] = _s1BI;
| I32[Sp + 0] = _s1BF;
| jump c1C0;
| }

Once we do the A/B split of the code generator that I referred to, we
should get something like this from the A part

fac(n,m) {
  if n!=1 goto L
  return(m)
L: jump fac( n-1, n*m ) 
}

The B part will introduce the Sp nonsense.

I think you'll find this a much easier optimisation target.

Simon
-users
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users