Re: inside the GHC code generator

2006-03-12 Thread Lemmih
On 2/24/06, Simon Peyton-Jones [EMAIL PROTECTED] wrote:
 | last days i studied GHC code generator, reading -ddumps of all sorts,
 | GHC sources, various papers and John's appeals :)
 |
 | what i carried from this investigation:
 |
 | GHC has great high-level optimization. moreover, GHC now allows to
 | program STG almost directly. when i look at STG code i don't see
 | anything that can be done better - at least for these simple loops i
 | tried to compile. i see just unboxed variables, primitive operations
 | and simple loops represented as tail recursion. fine.
 |
 | then STG compiled to the simplified C (called abstract C earlier and
 | quasi C-- now), what is next can be translated:
 |
 | * to real C and then compiled by gcc
 | * to assembler by build-in simple C-- compiler
 | * to assembler by some external C-- compiler (at least it is
 | theoretically possible)

 Good stuff Bulat.  There's plenty of interesting stuff to be done here.

 However, let me strongly urge you *not* to focus attention primarily on
 the gcc route.  Compiling via C has received a lot of attention over the
 years, and there are many papers describing cool hacks for doing so.
 GHC does not do as well as it could.

 But there are serious obstacles.  That's not gcc's fault -- it wasn't
 designed for this.  Accurate GC is one of them, tail calls is another,
 and there are plenty more smaller things that bite you only after you've
 invested a lot of time.  This way lies madness.

 C-- was *designed* for this purpose.  GHC uses C-- as its intermediate
 language (just before emitting C).  So a good route is this:

 * Write C-- to C-- optimisations

 * Then, if you like, translate that code to C.  Already you will be
 doing better than GHC does today, because the C-- to C-- optimiser will
 let you generate better C

 * But you can also emit C--, or native code; both of these paths will
 directly benefit from your C-- optimisations.

 The point is that none of this relies on Quick C--; you can always use
 an alternative back end.



 You can almost do this today. GHC uses C-- as an intermediate language.
 But alas, GHC's code generator does not take advantage of C--'s native
 calls or parameter passing.  Instead, it pushes things on an auxiliary
 stack etc.  (Those Sp memory operations you see.)  This isn't necessary.
 We are planning to split GHC's code generator into two parts
 A) Generate C-- with native calls, with an implicit C-- stack
 B) Perform CPS conversion, to eliminate all calls in favour of
 jumps
 using an explicit stack
 The current code gen is A followed by B.  But A is a much more suitable
 optimisation platform, and gives more flexibility.

 Chris Thompson, and undergrad at Cambridge, is doing (B) as his
 undergrad project, although it remains to be seen whether he'll have
 enough complete to be usable.

 Another shortcoming is that the native code generator in GHC isn't
 capable of dealing with backward jumps to labels (because GHC hasn't
 needed that so far).  But if you did C-- optimisation, you'd probably
 generate such jumps.  It'd be great to beef up the native code gen to
 handle that.


 Many of the optimisations you describe (perhaps all) are readily
 expressible in the C-- intermediate language, and by working at that
 level you will be independent of with the back end is gcc, a native code
 generator, or Quick C--, or some other C-- compiler.  Much better.

What optimizations are we talking about here? The loop optimizations
that Bulat implicitly proposed would only affect recursion over
unboxed arguments, and, since that's fairly rare, wouldn't give Joe
Hacker any noticeable speed up.
Are we at the end of what we can get without whole-program
optimizations or are there other optimizations that apply to C--
representing a lazy PL?

--
Friendly,
  Lemmih
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-28 Thread Bulat Ziganshin
Hello Simon,

Friday, February 24, 2006, 12:52:36 PM, you wrote:
SM Wow!  You make a lot of good suggestions.  I think some of them are
SM unfortunately unworkable, some others have been tried before, but there
SM are definitely some to try.  I'll suggest a few more in this email.

can you comment about unworkable/hard to implement ones? may be, ghc
users can propose some solutions that you are skipped

SM First of all I should say that we don't *want* to use gcc as a code 
SM generator.  Everything we've been doing in the back end over the last 
SM few years has been aiming towards dropping the dependency on gcc and 
SM removing the Evil Mangler.  Now is not the time to be spending a lot of
SM effort in beefing up the C back end :-)  Our preference is to work on 
SM both the native and C-- back ends; the C-- back end has the potential to
SM produce the best code and be the most portable.  We also plan to keep 
SM the unregisterised C back end for the purposes of porting to a new 
SM platform, it's only the registerised path we want to lose.

i know that is the hardest part of discussion. but sometimes we need
to change our decisions. i think that it's better to compare
efforts/results of going toward gcc/c-- way starting from current
state of ghc, avoiding counting the previous efforts :(  on the c--
side:

1) time/efforts required to implement many low-level optimizations
2) more or less efficient code

on the gcc side:

1) time required to implement generation of native C code instead of
current portable asm
2) one of the best optimization possible

gcc way already exists and works, while C-- way is still unfinished
even in its basic, unoptimized state. next, unregisterized C way
provides the maximum possible level of portability


SM Having said all that, if we can get better code out of the C back end
SM without too much effort, then that's certainly worthwhile.

moreover, many changes, proposed on this thread, perfectly
implementable even on C-- way

 translated to something like this
 
 factorial:
 _s1BD = *(Sp + 0);
 if (_s1BD != 1) goto c1C4; // n!=1 ?
 R1 = *(Sp + 4);
 Sp = Sp + 8;
 return (*(Sp + 0))();  // return from factorial
 c1C4:
 _s1BI = _s1BD * (*(Sp + 4));   // n*r
 _s1BF = _s1BD - 1; // n-1
 *(Sp + 4) = _s1BI;
 *(Sp + 0) = _s1BF;
 goto factorial;

SM To be fair, this is only the translation on an architecture that has no
SM argument registers (x86, and currently x86_64 but hopefully not for long).

this just emphasizes main problem of portable asm code generated by
ghc - you can't optimize it as good as C compiler optimizes idiomatic
C code. x86 has 7 registers, after all, and gcc generates great code
for this procedure - but only if it is written in idiomatic manner.
you will need to develop your own optimization techniques, and it will
be hard and long way comparing to just generation of more high-level C
code

just to compare - i've rewritten my binary serialization library two
times. the first version just used string concatenation (!), the
second was the carefully optimized using my C experience.
nevertheless, it was slow enough. and the final 3rd was optimized
according to ghc features. and the result was astonishing

portable asm looking great in theory, and C-- as portable low-level
language is great theoretical idea. but reality has its own laws. want
to know why my second version of serialization library, highly
optimized according to C experience, was so slow? it just returns
tuples and constructing/analyzing these lazy tuples used 80% of the
time

ghc generates not so fast code and that is known for many years. you
have developed the proper theoretical way to solve this problem, but
there is also much easier backdoor :) at least, it seems like this

SM The simple optimisation of changing the final tail call to factorial 
SM into a goto to a label at the beginning of the function (as suggested by
SM John Meacham I think) should improve the code generated by gcc for 
SM functions like this, and is easy to do.

the main detail is to work with local variables instead of Sp[xx]. gcc
can't optimize access to Sp[xx]

 1) because it just not asked. you can enable gcc optimization by
 adding -optc-O6 to the cmdline, but this leads to compilation errors
 on part of modules. it seems that gcc optimization is not compatible
 with evil mangler that is the ghc's own optimization trick.

SM -O3 works, I haven't tried -O6.

i've attached optc-O2-bug.hs that don't work both with -optc-O2 and
-optc-O3

 * C++ calling stack should be used as much as possible
 * parameters are passed in C++ way: factorial(int n, int r)

SM This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:

SMhttp://citeseer.ist.psu.edu/peytonjones92implementing.html

SM there are two main reasons we don't use the C stack: garbage collection
SM and tail calls.  What do you plan to do about GC?


Re: inside the GHC code generator

2006-02-28 Thread Simon Marlow

Bulat Ziganshin wrote:


* C++ calling stack should be used as much as possible
* parameters are passed in C++ way: factorial(int n, int r)



SM This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:

SMhttp://citeseer.ist.psu.edu/peytonjones92implementing.html

SM there are two main reasons we don't use the C stack: garbage collection
SM and tail calls.  What do you plan to do about GC?

minimalistic solution i see - just use two stacks. unboxed values are
passed using the C conventions and C compiler will be able to
optimize using of this stack. boxed values are passed using the
current Sp[] machinery and therefore GC will not be changed


How do you propose to do thread switching?

Incedentally, GHC had two stacks a long time ago (before 4.00) I rewrote 
the runtime and code generator to replace it with one stack.  It was 
about 6 months work, if I remember correctly.  I think you misused the 
words just and minimalistic :-)


I know gcc does (some) tail-call optimisation these days.  You don't get 
any *guarantee* that a call will be implemented as a tail-call, though, 
and the conditions are quite complex.  The gcc folks treat it as an 
optimisation, rather than a fundamental part of the operational 
semantics, which is what you need for Haskell.


SM This is called semi-tagging, it was tried a long time ago.  Certainly 
SM worth trying again, but I very much doubt the gains would be anything 
SM like you claim, and it increases code size.


we will replace 2 unpredictable jumps with one that will not be even
executed if value is in WHNF. at least, i'm very displeased by slowness
of lazy values evaluation in GHC. for example, my binary serialization
package writes 60 mb/s working with unboxed arrays and only 20 mb/s
working with the lists


You need to *try* these things and measure the difference.  Quite often 
I'm surprised with the actual results I get from an optimisation that 
looks on paper to be worthwhile.  Today's processors with 3-level caches 
are complicated beasts.


Semi tagging increases code size, so unless you can make some 
significant gains to make up for it, you've lost already.


Let me give you another example: GHC puts info tables next to the entry 
code for a closure.  On paper this looks like it might be a bad idea: it 
pollutes the instruction cache with data that is rarely accessed during 
execution.  It's hard to arrange (this is one thing the mangler does), 
so we'd like to avoid it if possible.  However, doing this eliminates an 
indirection when entering a closure, reduces the size of info tables by 
one word, and reduces code size.  These effects together are more 
beneficial than the code-cache-pollution effect - last time I measured 
it the difference was ~10%.  (turn off TABLES_NEXT_TO_CODE and try it 
yourself).



SM There are other schemes, including this one that we particularly like:
SM use a spare bit in a pointer to indicate points to an evaluated value.
SM   Since there are two spare bits, you can go further and use the 4 
SM values to mean:


SM0: possibly unevaluated
SM1: evaluated, tag 0
SM2: evaluated, tag 1
SM3: evaluated, tag 2

SM I believe Robert Ennals tried this as part of his spec eval 
SM implementation, but IIRC he didn't get separate measurements for it (or

SM it didn't improve things much).  Note that this increases code size too.

to be exact, we had only 2^30-1 valid pointers, all other values are
ready to use! :)


True, but I'm assuming you often want to store a pointer in addition to 
the tag (eg. for a cons cell).  For enumerations, you're quite right 
that the whole word can be used for the tag as long as it is 
distinguished from a pointer.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-27 Thread Bulat Ziganshin
Hello Simon,

Friday, February 24, 2006, 12:30:22 PM, you wrote:

SPJ | GHC has great high-level optimization.

SPJ However, let me strongly urge you *not* to focus attention primarily on
SPJ the gcc route.  Compiling via C has received a lot of attention over the
SPJ years, and there are many papers describing cool hacks for doing so.
SPJ GHC does not do as well as it could.

btw, it's well known that Clean sometimes make better code than GHC
and sometimes vice versa. now I'm almost sure that GHC outperforms
Clean in high-level optimizations and give away on the low-level code
generation, so the grand result depends on that is more important for
particular algorithm. on the hand-coded programs Clean should almost
always outperform GHC

SPJ But there are serious obstacles.  That's not gcc's fault -- it wasn't
SPJ designed for this.  Accurate GC is one of them,

we ca use two stacks - for boxed and for unboxed values

SPJ tail calls is another,

nowadays gcc optimize tail calls

SPJ and there are plenty more smaller things that bite you only after you've
SPJ invested a lot of time.  This way lies madness.

but nevertheless ghc already compiles to gcc and it's the fastest
backend, with help of Evil Mangler. afaik, code generated by ghc will
work even w/o Evil Mangler, so it's just an optimization hack. can you
please say what an optimizations done via EM? i think that part of
them can be implemented via changing C generation, so may be even we can
omit EM at long last

SPJ C-- was *designed* for this purpose.  GHC uses C-- as its intermediate
SPJ language (just before emitting C).  So a good route is this:

SPJ * Write C-- to C-- optimisations

SPJ * Then, if you like, translate that code to C.  Already you will be
SPJ doing better than GHC does today, because the C-- to C-- optimiser will
SPJ let you generate better C

SPJ * But you can also emit C--, or native code; both of these paths will
SPJ directly benefit from your C-- optimisations.

SPJ The point is that none of this relies on Quick C--; you can always use
SPJ an alternative back end.

yes, C-- was designed to solve all our problems - provide world-class
optimization, code generation for different cpus and so on. but does
it fulfil it's promises? no. and now you propose to write these
optimization as part of Haskell project. great idea! i've started from
the same idea and even wrote a sequence of optimizations what will
transform initial C-- code for fac to the efficient one. but then i
realized that the core problem is what ghc just generates low-level
code in portable asm style. and optimizing of low-level code that
tries to describe how to IMPLEMENT this algorithm instead of just
describe the ALGORITHM itself is a hard task. noone writes optimizers
for the ASM language, but you propose to do exact this.

instead of coping with current portable asm code generated from STG
i propose to generate idiomatic C or C-- code and leave its
optimization to the appropriate compiler. the SEPARATE problem is what
gcc/icc can generate much better code that qc, so i propose to direct
efforts of generating idiomatic C[--] toward C compilers

SPJ You can almost do this today. GHC uses C-- as an intermediate language.
SPJ But alas, GHC's code generator does not take advantage of C--'s native
SPJ calls or parameter passing.  Instead, it pushes things on an auxiliary
SPJ stack etc.  (Those Sp memory operations you see.)  This isn't necessary.
SPJ We are planning to split GHC's code generator into two parts
SPJ A) Generate C-- with native calls, with an implicit C-- stack
SPJ B) Perform CPS conversion, to eliminate all calls in favour of
SPJ jumps
SPJ using an explicit stack
SPJ The current code gen is A followed by B.  But A is a much more suitable
SPJ optimisation platform, and gives more flexibility.

i propose to implement only A and leave B to the C[--] compiler
itself. that will require to return to 2-stacks model, but in return
we will get 1st-class optimization instead of making itself it's
rather restricted subset. instead of permanently trying to catch up C
compiler's optimization we can just jump to the common bandwagon :)

SPJ Chris Thompson, and undergrad at Cambridge, is doing (B) as his
SPJ undergrad project, although it remains to be seen whether he'll have
SPJ enough complete to be usable.

SPJ Another shortcoming is that the native code generator in GHC isn't
SPJ capable of dealing with backward jumps to labels (because GHC hasn't
SPJ needed that so far).  But if you did C-- optimisation, you'd probably
SPJ generate such jumps.  It'd be great to beef up the native code gen to
SPJ handle that.

i propose to add if/for/while statements to the CmmStmt datatype to
allow generation of higher-level code. as John Meacham wrote several
months ago, gcc can't unroll loops written with gotos. so it's anyway
better to generate explicit loops. it should be not too hard to
generate asm code for these constructs?

although anyway 

Re: inside the GHC code generator

2006-02-27 Thread Niklas Sorensson

Hello,


SPJ tail calls is another,

nowadays gcc optimize tail calls


I found this thesis on the subject by Andreas Bauer:

 Compilation of Functional Programming Languages using GCC -- Tail 
Calls


Do you know if it is his work that was incorporated into GCC?

I browsed trough some parts of the thesis, and he mentions GHC as one of 
the motivations for his work (the thesis also seems to be a nice 
tutorial on the internals of GCC). From this thread it seems that GHC 
cannot currently take advantage of his work. Is that correct?


/Niklas
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-25 Thread Christof Douma
 But there are serious obstacles.  That's not gcc's fault -- it wasn't
 designed for this.  Accurate GC is one of them, tail calls is another,
 and there are plenty more smaller things that bite you only after you've
 invested a lot of time.  This way lies madness.

What about llvm[1]? It was primary designed to support GCC based C and
C++ front-ends but currently it also supports:
- Accurate GC
- tail calls
- output for x86, PowerPC, IA-64, Alpha, SPARC V9, and portable C.
- a lot of (program wide and/or profile-guided) analyses and
  optimizations

Are there arguments against llvm?

[1] llvm website: http://llvm.cs.uiuc.edu/

Regards,
Christof
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-25 Thread Ken Whitegoat

[EMAIL PROTECTED] wrote:

...

as you can see here, gcc isn't unrolled the loop and retained all the
stack-register moves. why? the short answer is what gcc just don't
know that data pointed by the Sp pointer is non-volatile - they can't
be changed in middle of the program by some async computation or
exception and don't required to be updated immediately. may be there
is some way to tell gcc this? this would immediately SUBSTANTIALLY
improve all ghc code generation
  

Would the C99 restrict keyword help in this situation?

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-24 Thread Bulat Ziganshin
Hello kyra,

Friday, February 24, 2006, 12:37:02 AM, you wrote:

 i prefer to see the asm code. this may be because of better high-level
 optimization strategies (reusing fib values). the scheme about i say
 will combine advantages of both worlds
k no strategies, plain exponential algorithm,

yes, the ocaml compiler works better with stack. but i sure that in
most cases gcc will outperform ocaml because it has large number of
optimizations which is not easy to implement (unrolling, instruction
scheduling and so on)

k also, Clean is *EXACTLY* in line with ocaml. This is interesting,
k because Clean is so much similar to Haskell.

clean differs from Haskell in support of unique types and strictness
annotations. the last is slowly migrates into GHC in form of shebang
patters, but i think that it is a half-solution. i mentioned in
original letter my proposals to add strictness annotations to
function types declarations and to declare strict datastructures, such
as ![Int]

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-24 Thread Bulat Ziganshin
Hello Ben,

Friday, February 24, 2006, 2:04:26 AM, you wrote:

 * multiple results can be returned via C++ paira,b template, if this is
 efficiently implemented in gcc

BRG There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off
BRG by default on many architectures.

thank you! -1 problem. moreover, we can use just plain C for our
translation instead of C++


 * recursive definitions translated into the for/while loops if possible

BRG I think recent versions of GCC do a good job of this. See

BRG  http://home.in.tum.de/~baueran/thesis/

there is no problem with proper handling of tail calls. the problem is
what these loops are not unrolled, generating significantly worse
code than explicit loop. you can see this in the files i attached to
the original letter

BRG All of this efficiency stuff aside, there's a big problem you're 
neglecting:
BRG GARBAGE COLLECTION. For a language that allocates as much as Haskell I 
think
BRG a conservative collector is absolutely out of the question, because they
BRG can't compact the heap and so allocation becomes very expensive. A copying
BRG collector has to be able to walk the stack and find all the roots, and that
BRG interacts very badly with optimization. All the languages I've seen that
BRG compile via C either aren't garbage collected or use a conservative or
BRG reference-count collector.

as i said, we can just use 2 stacks - one, pointed by EBP register,
contains all boxed values, second is hardware stack, pointed of course
by ESP, contains unboxed values and managed by gcc as for any other C
programs. so, the boxed parameters to functions are go through
EBP-pointed stack and unboxed values passed via usual C conventions:

int fac(int n, int r)

currently, EBP is used for all data and ESP is just not used.
moreover, until 1999 the same two stacks scheme was used in GHC.
comparing to current one stack scheme, we will need more space for
stacks and may lose something because memory usage patterns will be
slightly less localized


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[4]: inside the GHC code generator

2006-02-24 Thread Bulat Ziganshin
Hello Jean-Philippe,

Friday, February 24, 2006, 2:04:57 AM, you wrote:

JPB From my (limited) knowledge of GHC backend, the difficult part of your
JPB plan is that STG is not suited to compilation to native C at all. You
JPB might need to do quite advanced translation from STG to another
JPB intemediate language, (as GRIN for example), and then some more
JPB advanced analysis/optimization before C can be generated.

your knowledge is very limited, because STG at all times was compiled
to C and then compiled by gcc :)  moreover, it's very easy to
implement efficient compilation of fac() definition from STG to
natural C. it's one-hour task, maximum. the real problem is changing
the whole environment, including AST representing C code inside the GHC,
RTS and so on

JPB iirc, the tricky part is the handling of lazyness. At any point you
JPB may end up with a thunk (closure), which ghc handles easily by
JPB evaluating it: it's always a function pointer. (That's the tagless
JPB part) When in WHNF, it just calls a very simple function that fills
JPB registers with the evaluated data. Otherwise, the function call
JPB performs the reduction to WHNF.

STG attaches explicit typing information to any var. i propose:

1) compile code that works with unboxed values to equivalent natural
C code
2) for the boxed values, MAXIMALLY simplify test that they are already
in WHNF. current solution leads to two jumps, what is a very expensive
on modern cpus. i propose to make instead one check and one
conditional jump what will be much cheaper. then, if value is in WHNF,
we just load all the needed data himself. if it is not in WHNF, we
perform just the same operations as in current GHC. for example,
wrapper that calls the int fac(int n, int r) worker, can be
something like this:

if (p = n-closure) then goto eval_n;
n_evaluated:
if (p = r-closure) then goto eval_r;
r_evaluated:
return fac (n-value, r-value);

eval_n:
(*p) (); // sets n-closure=NULL; n-value to n's value
goto n_evaluated;

eval_r:
(*p) ();
goto r_evaluated;

JPB If you want to use the same trick, you'll end up with the same
JPB problems (bad C). So, you'll want to eliminate thunks statically,
JPB finally requiring a 'high level' analysis as I suggested above.

really? :)

JPB Also, allocation + garbage collection is extremely efficient in
JPB current GHC... C/C++ alloc doesn't even come close. It's entirely
JPB possible that with even a very fast backend, the better ghc allocator
JPB will be enough to swing the balance. (see jhc)
JPB It might be possible to re-use most of it however.

i don't propose to replace everything in ghc backend, just the way
unboxed code is compiled and something around it. i just don't know
enough about other parts of ghc backend/rts :)))

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: inside the GHC code generator

2006-02-24 Thread Simon Peyton-Jones
| last days i studied GHC code generator, reading -ddumps of all sorts,
| GHC sources, various papers and John's appeals :)
| 
| what i carried from this investigation:
| 
| GHC has great high-level optimization. moreover, GHC now allows to
| program STG almost directly. when i look at STG code i don't see
| anything that can be done better - at least for these simple loops i
| tried to compile. i see just unboxed variables, primitive operations
| and simple loops represented as tail recursion. fine.
| 
| then STG compiled to the simplified C (called abstract C earlier and
| quasi C-- now), what is next can be translated:
| 
| * to real C and then compiled by gcc
| * to assembler by build-in simple C-- compiler
| * to assembler by some external C-- compiler (at least it is
| theoretically possible)

Good stuff Bulat.  There's plenty of interesting stuff to be done here.

However, let me strongly urge you *not* to focus attention primarily on
the gcc route.  Compiling via C has received a lot of attention over the
years, and there are many papers describing cool hacks for doing so.
GHC does not do as well as it could. 

But there are serious obstacles.  That's not gcc's fault -- it wasn't
designed for this.  Accurate GC is one of them, tail calls is another,
and there are plenty more smaller things that bite you only after you've
invested a lot of time.  This way lies madness.

C-- was *designed* for this purpose.  GHC uses C-- as its intermediate
language (just before emitting C).  So a good route is this:

* Write C-- to C-- optimisations

* Then, if you like, translate that code to C.  Already you will be
doing better than GHC does today, because the C-- to C-- optimiser will
let you generate better C

* But you can also emit C--, or native code; both of these paths will
directly benefit from your C-- optimisations.

The point is that none of this relies on Quick C--; you can always use
an alternative back end.



You can almost do this today. GHC uses C-- as an intermediate language.
But alas, GHC's code generator does not take advantage of C--'s native
calls or parameter passing.  Instead, it pushes things on an auxiliary
stack etc.  (Those Sp memory operations you see.)  This isn't necessary.
We are planning to split GHC's code generator into two parts
A) Generate C-- with native calls, with an implicit C-- stack
B) Perform CPS conversion, to eliminate all calls in favour of
jumps
using an explicit stack
The current code gen is A followed by B.  But A is a much more suitable
optimisation platform, and gives more flexibility.

Chris Thompson, and undergrad at Cambridge, is doing (B) as his
undergrad project, although it remains to be seen whether he'll have
enough complete to be usable.  

Another shortcoming is that the native code generator in GHC isn't
capable of dealing with backward jumps to labels (because GHC hasn't
needed that so far).  But if you did C-- optimisation, you'd probably
generate such jumps.  It'd be great to beef up the native code gen to
handle that.


Many of the optimisations you describe (perhaps all) are readily
expressible in the C-- intermediate language, and by working at that
level you will be independent of with the back end is gcc, a native code
generator, or Quick C--, or some other C-- compiler.  Much better.

Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-24 Thread Simon Marlow

Hi Bulat,

Wow!  You make a lot of good suggestions.  I think some of them are 
unfortunately unworkable, some others have been tried before, but there 
are definitely some to try.  I'll suggest a few more in this email.


First of all I should say that we don't *want* to use gcc as a code 
generator.  Everything we've been doing in the back end over the last 
few years has been aiming towards dropping the dependency on gcc and 
removing the Evil Mangler.  Now is not the time to be spending a lot of 
effort in beefing up the C back end :-)  Our preference is to work on 
both the native and C-- back ends; the C-- back end has the potential to 
produce the best code and be the most portable.  We also plan to keep 
the unregisterised C back end for the purposes of porting to a new 
platform, it's only the registerised path we want to lose.


Having said all that, if we can get better code out of the C back end 
without too much effort, then that's certainly worthwhile.



translated to something like this

factorial:
_s1BD = *(Sp + 0);
if (_s1BD != 1) goto c1C4; // n!=1 ?
R1 = *(Sp + 4);
Sp = Sp + 8;
return (*(Sp + 0))();  // return from factorial
c1C4:
_s1BI = _s1BD * (*(Sp + 4));   // n*r
_s1BF = _s1BD - 1; // n-1
*(Sp + 4) = _s1BI;
*(Sp + 0) = _s1BF;
goto factorial;


To be fair, this is only the translation on an architecture that has no 
argument registers (x86, and currently x86_64 but hopefully not for long).


The simple optimisation of changing the final tail call to factorial 
into a goto to a label at the beginning of the function (as suggested by 
John Meacham I think) should improve the code generated by gcc for 
functions like this, and is easy to do.



1) because it just not asked. you can enable gcc optimization by
adding -optc-O6 to the cmdline, but this leads to compilation errors
on part of modules. it seems that gcc optimization is not compatible
with evil mangler that is the ghc's own optimization trick.


-O3 works, I haven't tried -O6.


* C++ calling stack should be used as much as possible
* parameters are passed in C++ way: factorial(int n, int r)


This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:

  http://citeseer.ist.psu.edu/peytonjones92implementing.html

there are two main reasons we don't use the C stack: garbage collection 
and tail calls.  What do you plan to do about GC?



* recursive definitions translated into the for/while loops if possible


Certainly a good idea.


* groups of mutually recursive definitions should be also naturally
translated as much as possible
* functions marked INLINE in Haskell code should be marked the same in
C++ code


an inline function will have been inlined, so not sure what you mean here!


* maximally speed up using of already evaluated values. instead of
unconditional jumping to the closure-computation function just test
this field and continue computation if it contains NULL (indicating
that value is already in WHNF). This change will make time penalty of
boxing data structures significantly, at least 10 times, smaller


This is called semi-tagging, it was tried a long time ago.  Certainly 
worth trying again, but I very much doubt the gains would be anything 
like you claim, and it increases code size.


There are other schemes, including this one that we particularly like: 
use a spare bit in a pointer to indicate points to an evaluated value. 
 Since there are two spare bits, you can go further and use the 4 
values to mean:


  0: possibly unevaluated
  1: evaluated, tag 0
  2: evaluated, tag 1
  3: evaluated, tag 2

I believe Robert Ennals tried this as part of his spec eval 
implementation, but IIRC he didn't get separate measurements for it (or 
it didn't improve things much).  Note that this increases code size too.



* in addition to ability to define strict fields, allow to define
strict data structures (say, lists) and strict types/structures of
functions parameters and results. that is a long-standing goal, but
the ![!Int] - !Int function can be used inside higher-order code a
lot faster than [Int] - Int one. the ultimate goal is to allow
Haskell to be as fast as ML dialects in every situation and this means
that laziness should be optional in every place. this will also
allow to make efficient emulation of C++ virtual methods


Strict types are interesting, and I know several people (including me) 
who have put some thought into how they would work.  It turns out to be 
surprisingly difficult, though.


You might look into doing something simple: suppose you could declare a 
type to be unlifted:


  data !T = ..

The type T doesn't have a bottom element.  Its kind is !, not *, and it 
can't be used to instantiate a polymorphic type variable, just like 
GHC's primitive types.


This is quite restrictive, but it's simple and pretty easy to implement 
in GHC as it stands.  It gives you a nice 

Re: Re[2]: inside the GHC code generator

2006-02-24 Thread Lemmih
On 2/24/06, Bulat Ziganshin [EMAIL PROTECTED] wrote:
 Hello kyra,

 Friday, February 24, 2006, 12:37:02 AM, you wrote:

  i prefer to see the asm code. this may be because of better high-level
  optimization strategies (reusing fib values). the scheme about i say
  will combine advantages of both worlds
 k no strategies, plain exponential algorithm,

 yes, the ocaml compiler works better with stack. but i sure that in
 most cases gcc will outperform ocaml because it has large number of
 optimizations which is not easy to implement (unrolling, instruction
 scheduling and so on)

 k also, Clean is *EXACTLY* in line with ocaml. This is interesting,
 k because Clean is so much similar to Haskell.

 clean differs from Haskell in support of unique types and strictness
 annotations. the last is slowly migrates into GHC in form of shebang
 patters, but i think that it is a half-solution. i mentioned in
 original letter my proposals to add strictness annotations to
 function types declarations and to declare strict datastructures, such
 as ![Int]

As I've understood it, Clean's strictness annotations are a bit of a
hack which only works on certain built-in types. Am I mistaking here?

--
Friendly,
  Lemmih
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: inside the GHC code generator

2006-02-24 Thread Simon Peyton-Jones

| The closest thing I've seen to a solution is the technique used in
Urban
| Boquist's thesis, which is to put a static table in the executable
with
| information about register and stack usage indexed by procedure return
| point, and use that to unwind the stack and find roots. 

Every accurate garbage collector (including GHC's) uses a technique like
this, but the solution is invariably tightly-coupled to the garbage
collector, compiler and run-time system.  

A major feature of C-- is that it squarely addresses this problem in a
*reusable* way.

Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-24 Thread Ben Rudiak-Gould

Rene de Visser wrote:
Integer is about 30 times slower than it needs to be on GHC if you have 
over half the values between 2^-31 and 2^31. On i386 can you basically 
can test the low bit for free (it runs in parallel to the integer add 
instruction). This would allow only values outside this range to 
required calling the long integer code. Such an optimization is not 
easily done in C.


Currently GHC defines

data Integer = S# Int# | J# Int# ByteArray#

So it already avoids using GMP for small integers. There's a will this 
multiplication overflow primitive, but I'm not sure how it's implemented.


I think that changing this to use magic pointers would be pretty easy. You'd 
need to introduce a new primitive type, say Int31#, and then:


1. anytime you previously constructed a WHNF S# on the heap, make a
   magic pointer instead

2. anytime you dereference a pointer that might be an S#, check for a
   magic pointer first.

Even if a lot of code needs to be changed, it's straightforward because the 
changes are local. You're just changing the meaning of a pointer such that 
there's a statically allocated S# n at address 2n+1.


It would also be worth trying this for Char#, which is already a 31-bit 
type, to see if it would speed up text-processing code.


If only simple loops are optimized it will encourage people to always 
code loops in their haskell rather than perhaps using more appropriate 
constructs.


You could say that about any language. When your code needs to be fast it 
needs to be fast. I'd rather write ugly Haskell code than ugly C code, if it 
comes to that.


Also take the Maybe data type with Nothing and Just ... or any other 
datatypes with 0 and 1 variable constructors. Here these could be 
represent by special values for the 0 variable case and bit marking on 
the single constructor values.


I'm not sure this would help much. Nullary constructors are already 
allocated statically, so they're already represented by special values. You 
could check for those values instead of dereferencing the pointer, but in 
time-critical code the static object will presumably be in L1 cache anyway. 
I could be wrong of course, and it would be easy enough to extend the Int31# 
idea to nullary constructors (Int0#).


-- Ben

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[4]: inside the GHC code generator

2006-02-24 Thread Bulat Ziganshin
Hello Lemmih,

Friday, February 24, 2006, 1:15:51 PM, you wrote:

 clean differs from Haskell in support of unique types and strictness
 annotations. the last is slowly migrates into GHC in form of shebang
 patters, but i think that it is a half-solution. i mentioned in
 original letter my proposals to add strictness annotations to
 function types declarations and to declare strict datastructures, such
 as ![Int]

L As I've understood it, Clean's strictness annotations are a bit of a
L hack which only works on certain built-in types. Am I mistaking here?

i don't know Clean very well, although i've seen rumors that it
supports strict datastructures. after all, this don't need changes in
language itself, it's just a syntax sugar:

data [a] = [] | a:[a]
x :: ![Int]

translates to the

data StrictList a = Nil | Cons a !(StrictList a)
x :: !(StrictList a)



... i've found the following in the 6 feb letter in cafe by Brian Hulley:

Bulat Ziganshin wrote:
 yes, i remember this SPJ's question :)  [!a] means that list
 elements are strict, it's the same as defining new list type with
 strict elements and using it here. ![a] means strict list, it is
 the same as defining list with next field strict:

 data List1 a = Nil1 | List1 !a (List1 a)
 data List2 a = Nil2 | List2 a !(List2 a)
 data List3 a = Nil3 | List3 !a !(List3 a)

Clean allows (AFAIK) several distinctions to be made:

1) ![a] means that the list of a's is a strict argument, just like writing 
!b

2) [!a] means that the list is head strict (List1 a)

3) [a!] means that the list is tail strict (List2 a)

4) [!a!] means that the list is head and tail strict (List3 a)

5) ![!a!] means that the head-and-tail-strict-list-argument is strict!!!

I think also (though I'm not entirely sure) that these distinctions are 
generalized for other data types by talking about element strictness and 
spine strictness.

One motivation seems to be that in the absence of whole program 
optimization, the strictness annotations on a function's type can allow the 
compiler to avoid creating thunks at the call site for cross-module calls 
whereas using seq in the function body itself means that the thunk still has 
to be created at the call site because the compiler can't possibly know that 
it's going to be immediately evaluated by seq.




and my own letter earlier on the 6 feb:

 foo :: !Int - !Int

KM (Is the second ! actually meaningful?)

yes! it means that the function is strict in its result - i.e. can't return
undefined value when strict arguments are given. this sort of knowledge
should help a compiler to propagate strictness and figure out the
parts of program that can be compiled as strict code. really, i think
ghc is able to figure functions with strict result just like it is able to
figure strict function arguments

KM Personally, I think is much nicer than sprinkling seq's around, and
KM generally sufficient.  However, there could perhaps be disambiguities?

btw, it's just implemented in the GHC HEAD

KM Last time this came up, I think examples resembling these were brought
KM up:

KM   foo :: [!a] - ![a] - a

yes, i remember this SPJ's question :)  [!a] means that list
elements are strict, it's the same as defining new list type with
strict elements and using it here. ![a] means strict list, it is
the same as defining list with next field strict:

data List1 a = Nil1 | List1 !a (List1 a)
data List2 a = Nil2 | List2 a !(List2 a)
data List3 a = Nil3 | List3 !a !(List3 a)

the type List3 is a simple strict list, like in any strict programming
language.

foo :: [!a] - ![a] - ![!a] - a

translates to

foo :: List1 a - List2 a - List3 a - a



KM   foo' :: Map !Int String - Int - String

that means that keys in this map saved as strict values. for example,
the following definition

type Map a b = [(a,b)]

will be instantiated to

Map !Int String == [(!Int, String)]


KM Anyway, if a reasonable semantics can be formulated, I think
KM strictness type annotations would be a great, useful, and
KM relatively non-intrusive (AFAICT, entirely backwards compatible)
KM addtion to Haskell'. 

such proposal already exists and supported by implementing this in GHC
HEAD


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Re[4]: inside the GHC code generator

2006-02-24 Thread Lemmih
On 2/24/06, Bulat Ziganshin [EMAIL PROTECTED] wrote:
 Hello Lemmih,

 Friday, February 24, 2006, 1:15:51 PM, you wrote:

  clean differs from Haskell in support of unique types and strictness
  annotations. the last is slowly migrates into GHC in form of shebang
  patters, but i think that it is a half-solution. i mentioned in
  original letter my proposals to add strictness annotations to
  function types declarations and to declare strict datastructures, such
  as ![Int]

 L As I've understood it, Clean's strictness annotations are a bit of a
 L hack which only works on certain built-in types. Am I mistaking here?

 i don't know Clean very well, although i've seen rumors that it
 supports strict datastructures. after all, this don't need changes in
 language itself, it's just a syntax sugar:

 data [a] = [] | a:[a]
 x :: ![Int]

 translates to the

 data StrictList a = Nil | Cons a !(StrictList a)
 x :: !(StrictList a)

Let's try this:

x :: ![Int] - Int

It would translate to something like this:

mkStrictList :: [a] - StrictList a
x = xStrict . mkStrictList
xStrict = ...

Wouldn't it be very expensive to strictify the list?

--
Friendly,
  Lemmih
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[6]: inside the GHC code generator

2006-02-24 Thread Bulat Ziganshin
Hello Lemmih,

Friday, February 24, 2006, 8:55:37 PM, you wrote:

 x :: ![Int]

 translates to the

 data StrictList a = Nil | Cons a !(StrictList a)
 x :: !(StrictList a)

L Let's try this:

L x :: ![Int] - Int

L It would translate to something like this:

L mkStrictList :: [a] - StrictList a
L x = xStrict . mkStrictList
L xStrict = ...

L Wouldn't it be very expensive to strictify the list?

yes, it would. but evaluating list as lazy one on EACH repetition of
some loop will cost much more. just for example - i found that the
cycle

repeat 666 (putStr (replicate 15 ' '))

works several times slower than

repeat (10^8) (putChar ' ')

if argument of putStr will be evaluated only one time then much of
difference will gone. of course, that is just benchmark. but there are
cases when i prefer to work with strict datastructures and specialize
my functions accordingly. typically it's the cases where time/space
are really critical

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-24 Thread Wolfgang Thaller

Another shortcoming is that the native code generator in GHC isn't
capable of dealing with backward jumps to labels (because GHC hasn't
needed that so far).  But if you did C-- optimisation, you'd probably
generate such jumps.  It'd be great to beef up the native code gen to
handle that.


I'm already working on that. It's basically done, I think I only need  
to get around to one more session with the code for final cleanup.


(Just to avoid any duplication of effort).

Cheers,

Wolfgang

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-24 Thread Ben Rudiak-Gould

Simon Peyton-Jones wrote:

| [...] put a static table in the executable with
| information about register and stack usage indexed by procedure return
| point, and use that to unwind the stack and find roots. 


Every accurate garbage collector (including GHC's) uses a technique like
this, but the solution is invariably tightly-coupled to the garbage
collector, compiler and run-time system.


Okay, I don't know what I was thinking when I wrote that no languages that 
compile via C use compacting collectors, since obviously lots of them do. 
But they do it by using various hacks to force local heap pointers into 
locations where the GC can find them. Most of this effort is wasted, because 
the local variables disappear before the GC runs. What I'm talking about is 
idiomatic C code which manipulates heap pointers like any other object, and 
which can be optimized as usual (e.g. holding heap pointers in callee-saved 
registers across function calls) without causing GC problems. There's no 
reason in principle that this couldn't be done.


-- Ben

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


inside the GHC code generator

2006-02-23 Thread Rene de Visser

The following link gives reasons for not generating via C

http://groups.google.com/groups?hl=enlr=ie=UTF-8selm=4zp8kn7xe.fsf_-_%40beta.franz.com

Naturally a number of these are common lisp specific, however I think that 
Haskell and GCC are quite semantically different, so using GCC might prevent 
a lot of optimizations, that aren't possible in C, but would be in GHC.


The OCAML-OPT backend is supposed to produce i386 assembly that is 
competitive with GCC. Maybe this could be ported to GHC?


Rene.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Rene,

Thursday, February 23, 2006, 4:19:15 PM, you wrote:

RdV The following link gives reasons for not generating via C
RdV 
http://groups.google.com/groups?hl=enlr=ie=UTF-8selm=4zp8kn7xe.fsf_-_%40beta.franz.com

i read it now

RdV Naturally a number of these are common lisp specific, however I think that
RdV Haskell and GCC are quite semantically different, so using GCC might 
prevent
RdV a lot of optimizations, that aren't possible in C, but would be in GHC.

seems that you don;t understand the situation. ghc compiles Haskell to
language called core, do almost all optimizations at level of this
language, then translates final result to the STG language from that
the C-- code is generated. changing the translation of STG can't
prevent ANY ghc optimization. although iy is not so easy because ghc
code generation and RTS closely tied together

RdV The OCAML-OPT backend is supposed to produce i386 assembly that is 
RdV competitive with GCC. Maybe this could be ported to GHC?

can you please show the code for the factorial accumulating function:

factorial n r  =  if n=1  then r  else (factorial (n - 1) (n*r))

i think that ocaml can't generate code better than gcc and especially
icc (intel C/C++ compiler), but may be i'm wrong? ;)

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Rene de Visser

seems that you don;t understand the situation. ghc compiles Haskell to
language called core, do almost all optimizations at level of this
language, then translates final result to the STG language from that
the C-- code is generated. changing the translation of STG can't
prevent ANY ghc optimization. although iy is not so easy because ghc
code generation and RTS closely tied together


I should have been a bit clearer here. I meant that optimizations that are 
available from

STG - Assembler, are better than STG - C - Assembler.

GHC currently doesn't do most of the optimizations I am thinking of.
-- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to 
memory latency and speed it is quicker to do bit masking rather than memory 
reads
-- Parameter passing and regisgter usage opimizations that rely on the 
structure of the RTS.

-- Multiple stacks with custom frame layout.
-- dynamic code optimization etc.
-- Taking advantage of special assember instructions and flags.

Though I have also seen comments that you can do a lot of these with GCC if 
you do your own stack and parameter management. i.e. don't use the C stack 
at all.


Though your suggestions are probably better than nothing, which is probably 
what the alternative is (for instance I have not sufficient time to work on 
these things).


Note that I didn't say that the assembly generation of OCAML was better than 
GCC, just that it was comparable.


Rene.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread kyra

Bulat Ziganshin wrote:

i think that ocaml can't generate code better than gcc and especially
icc (intel C/C++ compiler), but may be i'm wrong? ;)
  
didn't try factorial, but exponential fib in ocaml is *FASTER* than both 
gcc and intel c/c++ with highest optimization levels

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Claus Reinke

the long answer is: are you ever heard promises that gcc is best
cpu-independent assembler? no? and you know why? because gcc is not
cpu-independent assembler. gcc was strongly optimized to make
efficient asm from the code usually written by the C programmers. but
code generated by ghc has nothing common with it. so we are stay with
all these register-memory moves, non-unrolled loops and all other
results of naive compilation. gcc is just don't know how to
efficiently manage such code!


would there be any advantage to targetting gcc's backend directly?

I notice that Mercury seems to support this:
http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html
http://gcc.gnu.org/frontends.html

that is, does using C as assembler disable possible optimizations,
or is going through the C frontend enabling more optimizations than
going to the backend directly?

just curious,
claus
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello kyra,

Thursday, February 23, 2006, 5:38:54 PM, you wrote:

k Bulat Ziganshin wrote:
 i think that ocaml can't generate code better than gcc and especially
 icc (intel C/C++ compiler), but may be i'm wrong? ;)
   
k didn't try factorial, but exponential fib in ocaml is *FASTER* than both
k gcc and intel c/c++ with highest optimization levels

i prefer to see the asm code. this may be because of better high-level
optimization strategies (reusing fib values). the scheme about i say
will combine advantages of both worlds


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Rene,

Thursday, February 23, 2006, 5:32:21 PM, you wrote:

seems that you don;t understand the situation. ghc compiles Haskell to
language called core, do almost all optimizations at level of this
language, then translates final result to the STG language from that
the C-- code is generated. changing the translation of STG can't
prevent ANY ghc optimization. although iy is not so easy because ghc
code generation and RTS closely tied together

RdV I should have been a bit clearer here. I meant that optimizations that are
RdV available from STG - Assembler, are better than STG - C - Assembler.

theoretically - yes. in practice, it is hard to compete with gcc. ghc
wait for such code generation 15 years (wait for the mountain to go in
our direction :) and i think that the REAL way to reach gcc level
optimization is to compile to the idiomatic C instead of continue
waiting while this theoretically possible optimization will arise

RdV GHC currently doesn't do most of the optimizations I am thinking of.
RdV -- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to
RdV memory latency and speed it is quicker to do bit masking rather than memory
RdV reads
RdV -- Parameter passing and regisgter usage opimizations that rely on the
RdV structure of the RTS.
RdV -- Multiple stacks with custom frame layout.
RdV -- dynamic code optimization etc.
RdV -- Taking advantage of special assember instructions and flags.

i think that all these things can be done with idiomatic C way

RdV Though I have also seen comments that you can do a lot of these with GCC if
RdV you do your own stack and parameter management. i.e. don't use the C stack
RdV at all.

as i see in the current code generation, attaempts to do our own
stack management lead to that gcc just don't understand such code and
don't optimize it as good as possible. please compare code generated
for fac() function with all other variants

RdV Though your suggestions are probably better than nothing, which is probably
RdV what the alternative is (for instance I have not sufficient time to work on
RdV these things).

moreover, i sure that you can't compete with gcc :)

RdV Note that I didn't say that the assembly generation of OCAML was better 
than
RdV GCC, just that it was comparable.

what mean comparable? and what we should do to reuse this code
generation in ghc? at least i know what to do to reuse gcc code
generation. can you propose similar plan to reuse ocaml code
generation?


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Rene,

Thursday, February 23, 2006, 4:19:15 PM, you wrote:

RdV The following link gives reasons for not generating via C

RdV 
http://groups.google.com/groups?hl=enlr=ie=UTF-8selm=4zp8kn7xe.fsf_-_%40beta.franz.com

i done reading. my question - is YOU read this? the lisp problems have almost
nothing in common with haskell


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Kevin Glynn
Claus Reinke writes:
   the long answer is: are you ever heard promises that gcc is best
   cpu-independent assembler? no? and you know why? because gcc is not
   cpu-independent assembler. gcc was strongly optimized to make
   efficient asm from the code usually written by the C programmers. but
   code generated by ghc has nothing common with it. so we are stay with
   all these register-memory moves, non-unrolled loops and all other
   results of naive compilation. gcc is just don't know how to
   efficiently manage such code!
  
  would there be any advantage to targetting gcc's backend directly?
  
  I notice that Mercury seems to support this:
  http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html
  http://gcc.gnu.org/frontends.html
  
  that is, does using C as assembler disable possible optimizations,
  or is going through the C frontend enabling more optimizations than
  going to the backend directly?
  

On a related point, Mercury has two C backends a low level one at the
level of GHC's and a high level one. Bulat might want to read this for
a description of the high level C implementation:

  http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc

Also, ghc used to be faster than gcc for a naive, recursive factorial
function (once the cpr analysis and optimisation was added).  From
what Bulat wrote it seems that gcc got better ...

k

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Claus,

Thursday, February 23, 2006, 8:56:57 PM, you wrote:

 the long answer is: are you ever heard promises that gcc is best
 cpu-independent assembler? no? and you know why? because gcc is not
 cpu-independent assembler. gcc was strongly optimized to make
 efficient asm from the code usually written by the C programmers. but
 code generated by ghc has nothing common with it. so we are stay with
 all these register-memory moves, non-unrolled loops and all other
 results of naive compilation. gcc is just don't know how to
 efficiently manage such code!

CR would there be any advantage to targetting gcc's backend directly?

CR I notice that Mercury seems to support this:
CR http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html
CR http://gcc.gnu.org/frontends.html

CR that is, does using C as assembler disable possible optimizations,
CR or is going through the C frontend enabling more optimizations than
CR going to the backend directly?

i will read this. but one disadvantage is obvious - we can't use other
C compilers, including more efficient intel c/c++ (for my code it was
constantly 10% faster)

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Kevin,

Thursday, February 23, 2006, 9:06:25 PM, you wrote:

KG On a related point, Mercury has two C backends a low level one at the
KG level of GHC's and a high level one. Bulat might want to read this for
KG a description of the high level C implementation:

KG  
KG http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc

citating from this paper's annotation: Many logic programming
implementations compile to C, but they compile to very low-level C,
and thus discard many of the advantages of compiling to a high-level
language. it's the same as i think

KG Also, ghc used to be faster than gcc for a naive, recursive factorial
KG function (once the cpr analysis and optimisation was added).  From
KG what Bulat wrote it seems that gcc got better ...

i don't say that we must compile recursive Haskell/STG functions
naively to recursive C ones (as jhc does). no - i say that we should
translate recursive Haskell definitions to explicit C loops, what is
NATURAL C PROGRAMMING STYLE and therefore optimized much better as you
can see in the files i attached

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Kevin Glynn

Bulat Ziganshin writes:
  Hello Kevin,
  

  
  KG Also, ghc used to be faster than gcc for a naive, recursive factorial
  KG function (once the cpr analysis and optimisation was added).  From
  KG what Bulat wrote it seems that gcc got better ...
  
  i don't say that we must compile recursive Haskell/STG functions
  naively to recursive C ones (as jhc does). no - i say that we should
  translate recursive Haskell definitions to explicit C loops, what is
  NATURAL C PROGRAMMING STYLE and therefore optimized much better as you
  can see in the files i attached
  

Ahhh, OK. I misunderstood.  I don't claim ghc beat a C loop.  ghc was
just more optimised at making function calls :-).

k
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Rene de Visser

From: Bulat Ziganshin [EMAIL PROTECTED]

Hello Rene,



i done reading. my question - is YOU read this? the lisp problems have 
almost

nothing in common with haskell


It is a long time since i read this, but some things come to mind. Listed 
below.


Maybe GHC should generate better C. I am just not sure whether this will 
bring the best global (as opposed to micro-optimizations) performance. 
However doing anything else might be difficult.


Generating better C might also be much more difficult for idioms that don't 
exist in C, and might encourage people to write Haskell that looks like C 
(in order to get C like performance).


I prefer idiomatic Haskell. It would be nice if this could be fast. But can 
idiomatic Haskell be translated to efficient C? It might not have many 
direct loops for example.


-- Other notes

Integer is about 30 times slower than it needs to be on GHC if you have over 
half the values between 2^-31 and 2^31. On i386 can you basically can test 
the low bit for free (it runs in parallel to the integer add instruction). 
This would allow only values outside this range to required calling the long 
integer code. Such an optimization is not easily done in C.


This encourages Haskell programmers to always use Int, even if the value 
might get too big, because Integer is too slow.


Also Tail recursion is more general than looping. A general tail call 
optimization will bring better returns than a loop optimization in GHC 
(though I could be wrong here). This requires special stack handling. Also 
not so easy in C.


If only simple loops are optimized it will encourage people to always code 
loops in their haskell rather than perhaps using more appropriate 
constructs.


Also take the Maybe data type with Nothing and Just ... or any other 
datatypes with 0 and 1 variable constructors. Here these could be represent 
by special values for the 0 variable case and bit marking on the single 
constructor values. This could lead to good optimizations on case 
expressions.

Also not so easy in C.

The lack of this feature encourages people to encode their datatypes as 
Int's to get speed. Also not good.


Whether we can communicate the non aliasing and aliasing properties of GHC 
to the C compiler, I am also not so sure.


Also in GHC, I am not sure whether stack base locals are the best move. It 
might be best to have a fixed page as register spill in some cases.


If you wish to pass and return unboxed tuples without reboxing you will 
probably required a special function interface with double entry points (I 
think this can be done in C, but it is  a bit tricky).


Summary: If only C like Haskell is fast, we end up with Haskell that looks 
like C. Yuck.


Rene.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re[2]: inside the GHC code generator

2006-02-23 Thread Bulat Ziganshin
Hello Rene,

Thursday, February 23, 2006, 10:17:40 PM, you wrote:

RdV Maybe GHC should generate better C. I am just not sure whether this will
RdV bring the best global (as opposed to micro-optimizations) performance.

i answered in the original letter (search for Cray :)

RdV Generating better C might also be much more difficult for idioms that don't
RdV exist in C,

i will be glad to see concrete examples

RdV and might encourage people to write Haskell that looks like C
RdV (in order to get C like performance).

yes, they do just it now (see shootout entries or my Streams library)
but still don't get C performance. so i think that we will go to
something better position :)

RdV I prefer idiomatic Haskell. It would be nice if this could be fast. But can
RdV idiomatic Haskell be translated to efficient C? It might not have many
RdV direct loops for example.

sorry, i don't understand that you mean? in general, idiomatic Haskell
has big efficiency problem and this problem is laziness. the strict
Haskell code is, like ML code, can be compiled efficient

RdV -- Other notes

RdV Integer is about 30 times slower than it needs to be on GHC if you have 
over
RdV half the values between 2^-31 and 2^31. On i386 can you basically can test
RdV the low bit for free (it runs in parallel to the integer add instruction).
RdV This would allow only values outside this range to required calling the 
long
RdV integer code. Such an optimization is not easily done in C.

RdV This encourages Haskell programmers to always use Int, even if the value
RdV might get too big, because Integer is too slow.

this optimization is not implemented in ghc until now and i think that
amount of work required to implement it is bigger than requirements to
do faster Integer processing. moreover, i hope that good optimizing C
compiler can avoid additional testing

RdV Also Tail recursion is more general than looping. A general tail call 
RdV optimization will bring better returns than a loop optimization in GHC
RdV (though I could be wrong here). This requires special stack handling. Also
RdV not so easy in C.

seems that you don't seen the attached files. tail calls are optimized
in gcc

RdV If only simple loops are optimized it will encourage people to always code
RdV loops in their haskell rather than perhaps using more appropriate 
RdV constructs.

you are prefer that people will not have any option to make fast code?
:)  also i want to emphasize that C loops is the Haskell recursion. we
always said that these two constructs are equivalent, now it's the
time to generate code with the same speed

RdV Also take the Maybe data type with Nothing and Just ... or any other 
RdV datatypes with 0 and 1 variable constructors. Here these could be represent
RdV by special values for the 0 variable case and bit marking on the single
RdV constructor values. This could lead to good optimizations on case 
RdV expressions.
RdV Also not so easy in C.

1) it is far from current ghc optimization facilities
2) i don't see problems with using if()

RdV The lack of this feature encourages people to encode their datatypes as
RdV Int's to get speed. Also not good.

RdV Whether we can communicate the non aliasing and aliasing properties of GHC
RdV to the C compiler, I am also not so sure.

concrete examples?

RdV Also in GHC, I am not sure whether stack base locals are the best move. It
RdV might be best to have a fixed page as register spill in some cases.

it's the optimization that gcc can handle much better :)  just see at
the code generated for the fac() function

RdV If you wish to pass and return unboxed tuples without reboxing you will
RdV probably required a special function interface with double entry points (I
RdV think this can be done in C, but it is  a bit tricky).

as i said, probably paira,b can be used, but i don't tested this yet



-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Re[2]: inside the GHC code generator

2006-02-23 Thread Rene de Visser




From: Bulat Ziganshin [EMAIL PROTECTED]

i answered in the original letter (search for Cray :)


Re-reading this, I see that you have a well defined goals that cover most of 
my points.


seems that you don't seen the attached files. tail calls are optimized
in gcc


No I don't see any attached files (on any of your emails).

Best of luck with your optimisations.
I will look forward to using them.

(Although I don't like the style of the shootout code,
I find it very useful in helping me speed up my own code,
and yes I find it better in write Haskell in such a style, than
to use FFI and call C).

Rene.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread kyra

Bulat Ziganshin wrote:


i prefer to see the asm code. this may be because of better high-level
optimization strategies (reusing fib values). the scheme about i say
will combine advantages of both worlds

no strategies, plain exponential algorithm,

ocaml:
excerpt
_camlFibo__fib_57:
   subesp, 8
L101:
   cmpeax, 5
   jgeL100
   moveax, 3
   addesp, 8
   ret
L100:
   movDWORD PTR 0[esp], eax
   addeax, -4
   call_camlFibo__fib_57
L102:
   movDWORD PTR 4[esp], eax
   moveax, DWORD PTR 0[esp]
   addeax, -2
   call_camlFibo__fib_57
L103:
   movebx, DWORD PTR 4[esp]
   addeax, ebx
   deceax
   addesp, 8
   ret
/excerpt

visual C++ 7.1 (next fastest):
excerpt
_fibPROC NEAR
   pushesi
   movesi, DWORD PTR 8[esp]
   cmpesi, 2
   jgeSHORT $L945
   moveax, 1
   popesi
   ret0
$L945:
   leaeax, DWORD PTR [esi-2]
   pushedi
   pusheax
   call_fib
   decesi
   pushesi
   movedi, eax
   call_fib
   addesp, 8
   addeax, edi
   popedi
   popesi
   ret0
_fibENDP
/excerpt

also, Clean is *EXACTLY* in line with ocaml. This is interesting, 
because Clean is so much similar to Haskell.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: inside the GHC code generator

2006-02-23 Thread Ben Rudiak-Gould

[EMAIL PROTECTED] wrote:

* multiple results can be returned via C++ paira,b template, if this is
efficiently implemented in gcc


There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off 
by default on many architectures.



* recursive definitions translated into the for/while loops if possible


I think recent versions of GCC do a good job of this. See

http://home.in.tum.de/~baueran/thesis/

All of this efficiency stuff aside, there's a big problem you're neglecting: 
GARBAGE COLLECTION. For a language that allocates as much as Haskell I think 
a conservative collector is absolutely out of the question, because they 
can't compact the heap and so allocation becomes very expensive. A copying 
collector has to be able to walk the stack and find all the roots, and that 
interacts very badly with optimization. All the languages I've seen that 
compile via C either aren't garbage collected or use a conservative or 
reference-count collector.


The closest thing I've seen to a solution is the technique used in Urban 
Boquist's thesis, which is to put a static table in the executable with 
information about register and stack usage indexed by procedure return 
point, and use that to unwind the stack and find roots. Some C compilers 
(including gcc) support debugging of optimized code, so it might be possible 
to parse the debugging tables and extract this information. As far as I 
know, no one has ever looked into the feasibility of doing this, which is 
kind of surprising since root-finding is probably the single biggest 
obstacle to compiling functional languages via C.


-- Ben

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Re[2]: inside the GHC code generator

2006-02-23 Thread Jean-Philippe Bernardy
Hello Bulat,

From my (limited) knowledge of GHC backend, the difficult part of your
plan is that STG is not suited to compilation to native C at all. You
might need to do quite advanced translation from STG to another
intemediate language, (as GRIN for example), and then some more
advanced analysis/optimization before C can be generated.

iirc, the tricky part is the handling of lazyness. At any point you
may end up with a thunk (closure), which ghc handles easily by
evaluating it: it's always a function pointer. (That's the tagless
part) When in WHNF, it just calls a very simple function that fills
registers with the evaluated data. Otherwise, the function call
performs the reduction to WHNF.

If you want to use the same trick, you'll end up with the same
problems (bad C). So, you'll want to eliminate thunks statically,
finally requiring a 'high level' analysis as I suggested above.

Also, allocation + garbage collection is extremely efficient in
current GHC... C/C++ alloc doesn't even come close. It's entirely
possible that with even a very fast backend, the better ghc allocator
will be enough to swing the balance. (see jhc)
It might be possible to re-use most of it however.

I certainly don't want to discourage you (we all want faster code :),
but there is no easy path to climb the mountain ;)

Cheers,
JP.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users