Re: Loop unrolling + fusion ?

2009-03-27 Thread Claus Reinke

I can't see any issues with this version of the spec.


Thanks. From the silence, we seemed to have lost the innocent 
bystanders? Anyway, for those who haven't noticed, there is now 
a feature request ticket (for that good feeling of closing it when this
is finally implemented;-) as well as a wiki page describing the issues, 
spec, and examples:


http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining


I think in the implementation it makes most sense to do this as a
core2core pass at an early stage in the pipeline, probably via plugins
(so will have to wait until I get those into HEAD). 


What are the plans for plugin support? I do think plugins will be
useful, but inlining is pretty central to the existing optimizer 
transformations,
isn't it? Would the transformation code differ much between in-GHC
and via-plugins? Perhaps the transformation pass could be implemented
now, and later moved out into a plugin, possibly along with other passes.

I have also been wondering about the relation between rewrite RULES
and plugins. Assuming we can find a more convenient syntax, aren't
plugin+syb-based rewrites going to be more expressive, with more
control than RULES? Or is the syntactic/compiletime overhead going 
to remain so high that both RULES and plugins will be kept in GHC?


(cf the recent thread on "optimization and rewrite rules questions"
http://www.haskell.org/pipermail/glasgow-haskell-users/2009-February/016702.html
 )


In the case of
PEEL, we don't want to identify all call sites directly and do the
substitution in the pass so we should just output some bindings which
will certainly be inlined into call sites later on. So, the
transformation should be bottom up on the Core syntax tree and when it
meets a recursive group of bindings we should do something like this:

{-# INLINE f g PEEL 3 UNROLL 2 #-}
f = ... g ... f ... h ...
g = ... g ... f ... h ...
h = ... g ... f ... h ...

=(my pass)=>

-- Temporary copies of f and g - dead code
f_old = ... g_old ... f_old ... h ...
g_old = ... g_old ... f_old ... h ...
-- H unchanged for now, might get PEELed stuff inlined later
h = ... g .. f ... h ...


You mean UNROLLed stuff (PEEL is only for entries into the group).


-- Top level unrolled definiiton - if we weren't doing peeling, these
would be the new f and g
f_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ...
g_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ...

-- Unrolled iteration. Will get inlined into f_unrolled / g_unrolled soon
{-# INLINE f_unrolled_1 g_unrolled_1 #-}
f_unrolled_1 = ... g_unrolled ... f_unrolled ... h ...
g_unrolled_1 = ... g_unrolled ... f_unrolled ... h ...


Ah, yes, we need to be unambiguous about the interpretation of the
counters:-) I was thinking of n+1 (adding n copies to the original), you 
are thinking of n (adding copies until there are n).



-- One level of peeling
{-# INLINE f_1 g_1 #-}
f_1 = ... g_unrolled ... f_unrolled ... h ...
g_1 = ... g_unrolled ... f_unrolled ... h ...

-- Second level of peeling
{-# INLINE f_2 g_2 #-}
f_2 = ... g_1 ... f_1 ... h ...
g_2 = ... g_1 ... f_1 ... h ...

-- Final level of peeling and new definitions for f and g. Inline pragmas
-- make sure all of this gets inlined at the call site
{-# INLINE f g #-}
f = ... g_2 ... f_2 ... h ...
g = ... g_2 ... f_2 ... h ...


Wait, now you are counting to n+1 for PEEL and to n for UNROLL?


=(after the simplifier has run - effectively - there are a few
harmless lies here)=>

-- NB: I haven't shown inlining of the new f and g here, but it /will/ happen
h = ... g .. f ... h ...


Since we are interpreting recursive groups as single entities, and there
is usually no inlining into definitions that will get inlined, we will have to
specify this carefully.


-- I've inlined the inner unrolled iteration at every /call site/
within the top level unrolled iteration, as per
-- the pragmas. Noone actualy calls this unrolled thing directly
though, since we used PEEL as well
f_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...
g_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...


Again, we have to make sure of this interpretation.


-- This huge chunk of code gets inlined at every call site, which in
turn call through to the unrolled bodies
{-# INLINE f g #-}
f = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (...
g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ...
f_unrolled ... h ...) ... h ...) ... h ...
g = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (...
g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ...
f_unrolled ... h ...) ... h ...) ... h ...


So this would be the result of inlining all the PEEL instances into 'f' and 'g'. 


By ensuring that f and g are tagged 

Re: Loop unrolling + fusion ?

2009-03-19 Thread Claus Reinke

Dear Simon*,

thanks for answering my concerns about -fvia-C replacement. Are these 
answers somewhere in the ghc wiki, or perhaps they'd make a good basis

for a useful ghc blog post?

So, -fasm will soon be up to speed with -fvia-C in all cases, new native
backends are not more difficult than more mangler branches, and tools
for generating wrappers for CPP-based APIs should be provided.

Just one additional point re:
We haven't had a single new registerised port of GHC in many years now. 


Interest in new platforms is increasing again, though. People have been
talking about PS3, internet tablets, multicore machines, .. Personally,
I'd like to be able to use GHC on, or at least for, coming smartphone 
generations, etc. (I don't see myself looking at native backends there,

but probably I wouldn't have braved the mangler, either; still, someone
else might prefer one over the other). And I don't understand how 
people can be happy with unregisterised GHC ports for long, given

how many optimizations GHC is not doing even in best form!-)

Thanks again,
Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-19 Thread Max Bolingbroke
2009/3/19 Claus Reinke :
> Recursion unfolding spec, 2nd attempt.
>
> 
>
> If this is an improvement on the first version, and after correcting any
> obvious issues, I should put it on the ghc trac wiki somewhere, and create a
> feature request ticket.

I can't see any issues with this version of the spec.

I think in the implementation it makes most sense to do this as a
core2core pass at an early stage in the pipeline, probably via plugins
(so will have to wait until I get those into HEAD). In the case of
PEEL, we don't want to identify all call sites directly and do the
substitution in the pass so we should just output some bindings which
will certainly be inlined into call sites later on. So, the
transformation should be bottom up on the Core syntax tree and when it
meets a recursive group of bindings we should do something like this:

{-# INLINE f g PEEL 3 UNROLL 2 #-}
f = ... g ... f ... h ...
g = ... g ... f ... h ...
h = ... g ... f ... h ...

=(my pass)=>

-- Temporary copies of f and g - dead code
f_old = ... g_old ... f_old ... h ...
g_old = ... g_old ... f_old ... h ...
-- H unchanged for now, might get PEELed stuff inlined later
h = ... g .. f ... h ...

-- Top level unrolled definiiton - if we weren't doing peeling, these
would be the new f and g
f_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ...
g_unrolled = ... g_unrolled_1 ... f_unrolled_1 ... h ...

-- Unrolled iteration. Will get inlined into f_unrolled / g_unrolled soon
{-# INLINE f_unrolled_1 g_unrolled_1 #-}
f_unrolled_1 = ... g_unrolled ... f_unrolled ... h ...
g_unrolled_1 = ... g_unrolled ... f_unrolled ... h ...

-- One level of peeling
{-# INLINE f_1 g_1 #-}
f_1 = ... g_unrolled ... f_unrolled ... h ...
g_1 = ... g_unrolled ... f_unrolled ... h ...

-- Second level of peeling
{-# INLINE f_2 g_2 #-}
f_2 = ... g_1 ... f_1 ... h ...
g_2 = ... g_1 ... f_1 ... h ...

-- Final level of peeling and new definitions for f and g. Inline pragmas
-- make sure all of this gets inlined at the call site
{-# INLINE f g #-}
f = ... g_2 ... f_2 ... h ...
g = ... g_2 ... f_2 ... h ...

=(after the simplifier has run - effectively - there are a few
harmless lies here)=>

-- NB: I haven't shown inlining of the new f and g here, but it /will/ happen
h = ... g .. f ... h ...

-- I've inlined the inner unrolled iteration at every /call site/
within the top level unrolled iteration, as per
-- the pragmas. Noone actualy calls this unrolled thing directly
though, since we used PEEL as well
f_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...
g_unrolled = ... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...

-- This huge chunk of code gets inlined at every call site, which in
turn call through to the unrolled bodies
{-# INLINE f g #-}
f = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (...
g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ...
f_unrolled ... h ...) ... h ...) ... h ...
g = ... (... (... g_unrolled ... f_unrolled ... h ...) ... (...
g_unrolled ... f_unrolled ... h ...) ... h ...) ... (... (...
g_unrolled ... f_unrolled ... h ...) ... (... g_unrolled ...
f_unrolled ... h ...) ... h ...) ... h ...


By ensuring that f and g are tagged INLINE we get the existing INLINE
restrictions automatically in later Core passes.

I think that this example transformation matches your spec - am I right?

Cheers,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-19 Thread Max Bolingbroke
2009/3/19 Claus Reinke :
> If the map, filter, fold, etc can be unrolled, then the unrolled
> definitions would be fused, right? So what is missing is fine
> control ("how much to unroll this particular call to map here").

The issues is that In stream fusion the combinators like "map" are all
non-recursive and so unrolling/peeling doesn't make any sense. In
fact, their being non-recursive is almost the whole point, because it
lets GHC inline them like crazy and hence build a nice efficient fused
pipeline of combinators eventually. The recursion is introduced purely
in one place - unstream - and even then it doesn't go through unstream
but through a locally recursive wrapper (so GHC can see the structure
of the stream).

So, it might be sufficient if:
1) You changed stream fusion so unstream was directly recursive, but
added an INLINE PEEL 1 annotation to it, so if the call site doesn't
do any unrollling at least you will still be able to spot the
structure of the stream
2) You could, at the call site, add an INLINE PEEL 1 UNROLL n
annotation that took the /original/ RHS for unstream and unrolled it
however many times the user specifies (you still need a PEEL 1 so you
can spot the stream structure in the unrolled loop)

Unfortunately, this all feels quite fragile :-(

Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-18 Thread Claus Reinke

{-# INLINE f PEEL n UNROLL m #-}


The problem here is that this only works for directly recursive  
functions which I, for instance, don't normally use in high- 
performance code. Most of my loops are pipelines of collective  
combinators like map, filter, fold etc. because these are the ones  
that can be fused automatically. Unless I'm misunderstanding  
something, this approach doesn't handle such cases.


If the map, filter, fold, etc can be unrolled, then the unrolled
definitions would be fused, right? So what is missing is fine
control ("how much to unroll this particular call to map here").

Would it help to allow {-# INLINE map PEEL n UNROLL m #-}
in the caller modules as well as the definition modules, with the latter
providing a general case/upper limit, and the former providing finer
control? If you wanted even finer control, one would need a way to
specify named copies of inlineable recursion combinators, with
PEEL/UNROLL attached to the copies..

I see how this would need addressing, but I don't yet see a good
way to specify call-site-specific PEEL/UNROLL for recursion
combinators. Unless you want to control it by adding combinators
for the purpose?-)

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-18 Thread Claus Reinke
Recursion unfolding spec, 2nd attempt. 


The main difference is to look at groups of mutually recursive
definitions as a whole, rather than trying to think about individual
definitions. That step actually seems sufficient to address most of
the shortcomings raised so far, such as avoiding runaway INLINE
or using PEEL/UNROLL also for mutually recursive definitions. 
I've also interpreted Max's comments as most of the existing 
INLINE restriction still making sense for recursive INLINE,

with small clarifications.

In the following, let REC({f g ..}) denote the set of all identifiers
belonging to the recursion involving f, g, .. (f, g, .. in REC({f g ..})
or in {-# INLINE f g .. #-} are required to belong to the same 
recursion).


{-# NOINLINE f #-}
  as now: no unfolding of f

{-# INLINE f #-} 
  as now: for non-recursive f only, unfold definition of f at call

  sites of f (might in future be taken as go-ahead for analysis-based
  recursion unfolding)

{-# INLINE f g .. PEEL n #-}
  new: unfold definitions of the named identifiers at their call
  sites *outside* their recursion group REC({f g ..}). In other
  words, *entries into* REC({f g ..}) via f, g, .. are unfolded.
  
  (for the special case of loops this corresponds to loop peeling)


{-# INLINE f g .. UNROLL m #-}
  new: unfold definitions of the named identifiers at their call
  sites *inside* their recursion group REC({f g ..}). In other
  words, *cross-references inside* REC({f g ..}) via f, g, .. are
  unfolded.
  
  (for the special case of loops this corresponds to loop unrolling)


{-# INLINE f g .. PEEL n UNROLL m #-}
  combine the previous two

  The numeric parameters are to be interpreted as if each call to
  f, g, .. was annotated with both PEEL and UNROLL limits for the 
  whole recursion group REC({f g ..}), starting with the limits from

  the pragmas (write f_n_m for a call to f with PEEL limit n and
  UNROLL limit m), to be decreased for every PEEL or UNROLL 
  action, as follows (REC({f g}) = {f g h}, in these examples):


1. let {-# INLINE f g PEEL n UNROLL m #-}
  f .. = .. f_?_? .. g_?_? .. h_0_0 ..
  g .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  h .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  in ..|f_n_m|..


  --PEEL-->

  let {-# INLINE f g PEEL n UNROLL m #-}
  f .. = .. f_?_? .. g_?_? .. h_0_0 ..
  g .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  h .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  in ..|.. f_(n-1)_0 .. g_(n-1)_0 .. h_0_0 ..|..


  Notes: - unfolding produces copies of definition bodies
 - the PEEL limit at the call site decides the PEEL
   limit for all calls to REC({f g}) in the inlined
   copy; this limit decreases with each PEEL step
- since peeling unfolds code into call sites from outside
   the recursion, the UNROLL limits of calls to REC({f g})
   are effectively 0 in the inlined copy
 - only calls to identifiers named in the INLINE pragma
   can be peeled (f and g here), calls to other members of
   the same recursion remain unaffected (h here), having
   effective limits of 0

2. let {-# INLINE f g PEEL n UNROLL m #-}
  f .. = .. f_0_m .. g_?_? .. h_0_0 ..
  g .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  h .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  in ..


  --UNROLL-->

  let {-# INLINE f g PEEL n UNROLL m #-}
  f .. = .. .. f_0_(m-1) .. g_0_(m-1) .. h_0_0 .. .. g_?_? .. h_0_0 ..
  g .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  h .. = .. f_?_? .. g_?_? .. h_0_0 .. 
  in ..


  Notes: - unfolding produces copies of definition bodies
 - the UNROLL limit at the call site decides the UNROLL
   limit for all calls to REC({f g}) in the inlined copy; this 
   limit decreases with each UNROLL step
 - peeling conceptually precedes unrolling (PEEL limit needs 
   to reach 0 before unrolling commences), to avoid peeling 
   unrolled definitions (this corresponds to an existing restriction
   of no inlining into definitions to be inlined; 
 - unrolling unfolds copies of the original definitions, not the

already unrolled ones, again corresponding to the existing
inlining restriction (TODO: how to specify this avoidance 
of unrolling unrolled defs in this form of local rule spec?)

 - only calls to identifiers named in the INLINE pragma
   can be unrolled (f and g here), calls to other members of
   the same recursion remain unaffected (h here), having
   effective limits of 0

  Peeling and unrolling stop when the respective count annotation has
  reached 0. Peeling precedes unrolling, to avoid ambiguities in the
  size of the peeled definitions. Note that mutual recursion is the
  domain of PEEL, while UNROLL only applies to (mutual) recursion.

  {-# INLINE f PEEL n #-}, for n>0, corresponds to worker/
  wrapper transforms (previously done manually) + inline wrapper,
  and should therefore also be take

RE: Loop unrolling + fusion ?

2009-03-11 Thread Simon Peyton-Jones
Claus, Max

| > My preferred spec would be roughly
| >
| > {-# NOINLINE f #-}
| >as now
| >
| > {-# INLINE f #-}
| >works as now, which is for non-recursive f only (might in future
| >be taken as go-ahead for analysis-based recursion unfolding)
| >
| > {-# INLINE f PEEL n #-}
| >inline calls *into* recursive f (called loop peeling for loops)
| >
| > {-# INLINE f UNROLL m #-}
| >inline recursive calls to f *inside* f (called loop unrolling for loops)
| >
| > {-# INLINE f PEEL n UNROLL m #-}
| >combine the previous two

Sounds as if you two are evolving a good design, thank you.  I am not following 
the details closely, but I have the advantage of being able to chat to Max 
directly.

Suggestion: if after discussion you think this is a valuable thing to do, write 
a GHC-Trac-Wiki page describing the design as precisely as possible (eg with 
examples; I find the above one-liners hard to grok). Along with any major 
design alternatives.  Ideally with a few indicative measurements gotten by 
by-hand transformations, that show there are real benefits to be had.

For implementation, there are two routes.  Either totally built-in, or using a 
Core-to-Core plug-in.  The thing I like about the latter is that it can be done 
without having GHC HQ in the critical path, because we (I) tend to slow things 
down, being a uniprocesor.  We don't have the plug-in capability yet, but I'm 
encouraging Max to polish it up so that we do.  I think it'd be a very valuable 
facility.

Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Loop unrolling + fusion ?

2009-03-10 Thread Simon Peyton-Jones
| What I don't understand yet is the routemap for replacing -fvia-C

Good points, Claus.  I think the story is as follows:

* -fvia-C does not produce much better code, except in exceptionally tight 
loops, because GHC gives gcc very little scope for optimisation.  Simon 
mentioned something like 1% improvement.

* -fvia-C does not give substantially improved portability, because the Evil 
Mangler must have lots of new (Perl) code for each new platform.  (And each new 
version of gcc changes the details.)

* -fvia-C does impose maintenance costs, as this thread has rehearsed.

* -fasm has the potential for producing *better* code than gcc, because we can 
temporarily re-use registers that we must nail down as far as gcc is concerned.

| In other words, what is the plan wrt to backends, especially wrt
| recovering the optimizations and portability issues previously left to
| gcc?

I think you may be over-optimistic about the portability and optimisation 
benefits. As to other back end plans, it's a fairly active place.  Ben L is 
doing great stuff on refactoring the native code back end as part of his Sparc 
NCG.  And John and Norman and I are actively (albeit diverted recently by ICFP 
submissions) working on getting the refactored STG...flat C-- story into 
mainstream.

Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Brandon S. Allbery KF8NH

On 2009 Mar 9, at 9:32, Claus Reinke wrote:
One way out would be to treat the whole mutual recursion as a single  
entity, either implicitly, as I indicated, or explicitly, as I  
interpret Brandon's somewhat ambiguous comment. In other words, the  
peel/unroll limits would apply to a whole group of mutually


Sorry, yes, I intended that the unrolling applied explicitly to a  
group of mutually recursive functions.  I'm not sure if the unroll/ 
peel counts should be multiplied by the number of functions, though.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Max Bolingbroke
2009/3/9 Claus Reinke :
 But if you annotate all your unrolled and peeled new definitions as
> NOINLINE, do you still get the optimizations you want? There are probably a
> few GHC optimizations that can "look through" non-recursive
> lets, but RULES are not among those.

The benefit that comes immediately to mind is extra freedom for the
code generator. If we have several copies of the body of e.g. a loop
it may be able to schedule instructions much better. This is why GCC
unrolls loops, of course. Of course, Core may not be the best place to
do this sort of unrolling as Roman pointed out earlier in the thread.
But yeah, beyond this I don't /think/ that non-inlined duplications
would help GHC at all (it might be a different story if we did partial
inlining).

All the best,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Claus Reinke

let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..



Probably what you intend here is that you create one copy of the
definition every round rather than one per call site, is that right?


I don't think so - ultimately, the point of both peeling and unrolling is to
unfold a definition into a use site, to enable further optimizations, not
just to move from a recursive to a non-recursive definition. We could try to
do it in two steps, as you suggest, but that would expose us to the
heuristics of GHC inlining again (will or won't it inline the
new shared definition?), in the middle of a user-annotation-based unfolding.


Ah - I was thinking of something a bit different, where:

* PEEL / UNROLL pragmas duplicate the method body once per level of
peeling / unrolling and fix up the recursive calls as appropriate
* The user optionally adds an INLINE pragma to the function if he
additionally wants to be SURE that those duplicates get inlined at the
use sites


Ok, I suspected as much. You'd need to make the 'INLINE f' apply
to the generated 'fN', of course.


This means that PEEL / UNROLL represent nice logically-orthogonal bits
of functionality to INLINE-ing.


Usually, I'm all for orthogonality, and for more knobs to allow hand-tuning
of things that have no automatically reachable optimal solutions. In this case, 
however, I'm not sure anything would be gained. I recall that your hand-
unrolled code was written in a similar style, and assumed that it was a 
question of style, which GHC would inline into the same code.


But if you annotate all your unrolled and peeled new definitions as 
NOINLINE, do you still get the optimizations you want? There are 
probably a few GHC optimizations that can "look through" non-recursive

lets, but RULES are not among those.

For loop-style recursion, there'd be only one use per definition, so inlining
would be the default and there'd be no difference, but for non-loop-style
recursion, inlining might not happen, and so no further optimizations would
be enabled. Off the top of my head, I can't think of a case where that
would lead to improved code, but as I'm discovering, I'm not very familiar
with the details of what optimizations GHC is actually doing (though this
is quite helpful: 
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/HscMain )

so I might be missing something?



Furthermore, I'm not too keen on duplicating method bodies at call
sites willy-nilly because it may lead to increased allocations (of the
function closures) in inner loops. At least if you bind the duplicated
methods at the same level as the thing you are duplicating you only
increase the dynamic number of closures created by a constant factor!


Yes, every form of INLINE has its limits. But if users say they want 
inlining (or peeling or unrolling or any other form of unfolding), that's 
what they should get, including those worrysome duplications. The 
idea is to create lots of added code (in order to remove abstractions 
that might hide optimization opportunities), which will then be simplified 
to something smaller (or at least better performing) than what we 
started out with. Providing the means to fine tune the amount of 
duplications might be useful, but preventing them entirely is not an option.


Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Max Bolingbroke
2009/3/9 Claus Reinke :
>>> let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..
>
>> Probably what you intend here is that you create one copy of the
>> definition every round rather than one per call site, is that right?
>
> I don't think so - ultimately, the point of both peeling and unrolling is to
> unfold a definition into a use site, to enable further optimizations, not
> just to move from a recursive to a non-recursive definition. We could try to
> do it in two steps, as you suggest, but that would expose us to the
> heuristics of GHC inlining again (will or won't it inline the
> new shared definition?), in the middle of a user-annotation-based unfolding.

Ah - I was thinking of something a bit different, where:

* PEEL / UNROLL pragmas duplicate the method body once per level of
peeling / unrolling and fix up the recursive calls as appropriate
* The user optionally adds an INLINE pragma to the function if he
additionally wants to be SURE that those duplicates get inlined at the
use sites

This means that PEEL / UNROLL represent nice logically-orthogonal bits
of functionality to INLINE-ing.

Furthermore, I'm not too keen on duplicating method bodies at call
sites willy-nilly because it may lead to increased allocations (of the
function closures) in inner loops. At least if you bind the duplicated
methods at the same level as the thing you are duplicating you only
increase the dynamic number of closures created by a constant factor!
I've actually been thinking about using a different strategy for case
liberation (which duplicates method bodies at call sites) to make it
more constructor-specialisation like (which duplicates method bodies
at the definition site) partly for this reason.

Cheers,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Simon Marlow

Claus Reinke wrote:

That was one of my questions in the optimization and rewrite rules
thread: shouldn't -fvia-C be supported (as a non-default option)
for at least as long as the alternative isn't a clear win in all cases?


The trouble with supporting multiple backends is that the cost in 
terms of testing and maintenance is high.  And the registerised 
-fvia-C backend is particularly nasty, coming as it does with 
thousands of lines of Perl 4 that regularly get broken by new versions 
of gcc.


Yes, I can understand that you'd like to leave that part behind sometime
before yesterday:-) I assume that this very complexity means that the
-fvia-C route doesn't really get all the way to its most interesting
promises (easy portability, and full backend optimizations inherited
from gcc). And with that in mind, I can also understand that you don't 
want to put in any further work into trying to improve it, if that 
distracts from a better long-term solution.

What I don't understand yet is the routemap for replacing -fvia-C. We've
seen -fvia-C being demoted from default to backup (fine by me), we've
seen a feature supported only by -fvia-C removed completely, instead of 
seeing support for it added to the -fasm route (macro-based APIs

used to work with ffi, would now require a wrapper generator, which
doesn't exist yet).
Indications are that -fvia-C still tends to produce better code (even 
though it is not the best that ghc+gcc could produce) than -fasm (is 
that any better for the new backend?). And last, but not least, ghc has 
more limited resources than gcc, so how is ghc going to beat gcc at the 
portability and backend optimizations game while still making progress
in its core competencies (ie, higher-level improvements; there's also 
the interesting side-issue of how the two stages of optimizations are 
going to interact in ghc, if there is a barrier that can only be crossed 
in one direction)?


Ok, thanks for bringing these points up.  Hopefully I'll be able to lay 
your fears to rest:


1. Performance.

-fvia-c currently produces code that is on average about 1% faster than -fasm:

  http://www.cse.unsw.edu.au/~dons/nobench/x86_64/results.html

There's one notable exception: floating-point code on x86 (not x86_64) is 
terrible with -fasm, because our native code generator has a particularly 
simple/stupid implementation of the x87 instruction set.  So we need to 
make the SSE2 code generator in the x86_64 backend work for x86, too.


Having said that, the native backend has much more potential for generating 
faster code than we can with gcc.  Firstly, it can re-use fixed registers 
(e.g. argument registers) within a basic block, whereas gcc can't.  We 
don't do this currently because the C-- lacks the liveness information on 
jumps, but the new backend will be able to do it.  I bet this alone will be 
worth more than that 1%.  Secondly we have a much better handle on aliasing 
inside GHC than gcc does, and there's no good way to tell gcc what we know 
about aliasing.


On x86, gcc has a grand total of 2 spare registers, which means it has 
virtually no scope for generating good code.  There's also not much room 
for generating C that is more amenable to gcc's optimisations.  The obvious 
thing to do is to make recursive functions look like loops.  We've tried it 
(there's some experimental code in GHC to do it), IIRC it didn't buy very 
much.  The lack of registers, and the lack of knowledge about aliasing 
(heap doesn't alias with stack) meant that gcc didn't do some 
obvious-looking optimisations.  Trying to do better here is a dead end.


2. Portability.

We haven't had a single new registerised port of GHC in many years now. 
While the via-C backend seems at first glance to offer some portability 
benefits, in practice porting the mangler is still a pain unless your 
platform is very similar to an existing one (e.g. vanilla ELF).


The only C-only registerised port we had was Sparc, and thanks to Ben 
Lippmeier we now have a native backend for that too.  Dropping the C 
backend won't harm any of our existing ports, and it doesn't seem like 
people are making new ports of GHC this way either.


We'll still have the unregisterised porting route, whose only drawback is 
performance.  Still, lots of platforms are successfully using 
unregisterised GHC ports (via Debian).


One day maybe we'll have an LLVM backend, or similar.  My impression is 
that right now we can't make an LLVM backend with as good performance as 
our native backend, without changes in LLVM.  Maybe that will change. 
Nothing that we're doing now precludes adding an LLVM backend later, I believe.


3. Features.

This is a non-issue: -fvia-C vs. -fasm should not affect what programs 
work.  Up until 6.10.1 we had a bug whereby you could use -fvia-C to bind 
to CPP-based C APIs, but that bug was removed in 6.10.1.  Ok, I realise 
that some people considered this to be a feature and its removal to be a 
regression.  However, I bel

Re: Loop unrolling + fusion ?

2009-03-09 Thread Claus Reinke

{-# INLINE f PEEL n #-}
  inline calls *into* recursive f (called loop peeling for loops)
{-# INLINE f UNROLL m #-}
  inline recursive calls to f *inside* f (called loop unrolling for  
loops)


{-# INLINE f PEEL n UNROLL m #-}
  combine the previous two


The problem here is that this only works for directly recursive  
functions which I, for instance, don't normally use in high- 
performance code. Most of my loops are pipelines of collective  
combinators like map, filter, fold etc. because these are the ones  
that can be fused automatically. Unless I'm misunderstanding  
something, this approach doesn't handle such cases.


Actually, my first sketch had a problem in that it would work
only too well for mutually recursive functions, making it necessary
to use loop breakers in spite of the explicit limits (even if we limit
unroll to direct recursion, as I intended originally, peeling would 
then apply to the calls into other functions in the recursion). 

One way out would be to treat the whole mutual recursion as a 
single entity, either implicitly, as I indicated, or explicitly, as I 
interpret Brandon's somewhat ambiguous comment. In other 
words, the peel/unroll limits would apply to a whole group of 
mutually recursive definitions, ensuring termination of the inline 
process without additional loop breakers. If we do that, then

it might make sense to talk about peeling/unrolling wrt the whole
recursion group.

In any case, I need to refine my spec!-) But this discussion is
very helpful in finding the issues that need to be addressed and
clarified. Another issue that I ran into in manual unrolling is that
I sometimes want to unroll wrt a specific parameter of a multi-
parameter function, usually because that parameter can only
have a very small numer of possible values, or just because the
original function encodes multiple loops that I want to disentangle.

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Claus Reinke

let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..



Probably what you intend here is that you create one copy of the
definition every round rather than one per call site, is that right?


I don't think so - ultimately, the point of both peeling and unrolling 
is to unfold a definition into a use site, to enable further optimizations, 
not just to move from a recursive to a non-recursive definition. We 
could try to do it in two steps, as you suggest, but that would expose 
us to the heuristics of GHC inlining again (will or won't it inline the
new shared definition?), in the middle of a user-annotation-based 
unfolding.


As for the remainder of your useful reply, I'll have to think more
about how to make a local-rule-based approach work properly
(without the hickups of my first sketch) before I can think about
the interactions. I still think it would be useful to have such a
rule-based description, even if a monolithic core2core pass may
be easier to implement: having two independent specs makes it
easier to spot inconsistencies, and if the rule-based form doesn't
get too complicated, it should be more suited for documentation.

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-09 Thread Max Bolingbroke
2009/3/9 Roman Leshchinskiy :
> The problem here is that this only works for directly recursive functions
> which I, for instance, don't normally use in high-performance code. Most of
> my loops are pipelines of collective combinators like map, filter, fold etc.
> because these are the ones that can be fused automatically. Unless I'm
> misunderstanding something, this approach doesn't handle such cases.

Yep, I think this is an orthogonal piece of functionality. I believe
Claus is concerned with getting the compiler to perform some of the
transformations people currently might want to do for their directly
recursive functions. Of course, you could still UNROLL your unstream
definition, but that doesn't give the user any control over the amount
of unrolling that takes place, which as you have pointed out earlier
may not be a great idea!

Cheers,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-08 Thread Roman Leshchinskiy

On 07/03/2009, at 09:26, Claus Reinke wrote:


My preferred spec would be roughly

{-# NOINLINE f #-}
  as now

{-# INLINE f #-}works as now, which is for non-recursive f only  
(might in future

  be taken as go-ahead for analysis-based recursion unfolding)

{-# INLINE f PEEL n #-}
  inline calls *into* recursive f (called loop peeling for loops)
{-# INLINE f UNROLL m #-}
  inline recursive calls to f *inside* f (called loop unrolling for  
loops)


{-# INLINE f PEEL n UNROLL m #-}
  combine the previous two


The problem here is that this only works for directly recursive  
functions which I, for instance, don't normally use in high- 
performance code. Most of my loops are pipelines of collective  
combinators like map, filter, fold etc. because these are the ones  
that can be fused automatically. Unless I'm misunderstanding  
something, this approach doesn't handle such cases.


Roman


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-07 Thread Max Bolingbroke
2009/3/7 Claus Reinke :
> hmm, "appropriate" is one of those words that shouldn't occur in specs,
> not even rough ones, so let's flesh this out a bit, by abstract example.
>
> let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..

Probably what you intend here is that you create one copy of the
definition every round rather than one per call site, is that right?
In the case of mutual recursion, I suppose something like this should
happen:

f = ... g ...
g = ... f ...

==>

f = ... g ...
g = ... f ...
f1 = ... g ...
g1 = ... f ...

i.e. after peeling f1 and g1 are free to be inlined into the use site
if GHC decides that is a good idea. Similarly for two rounds of
peeling you would get:

f = ... g ...
g = ... f ...
f1 = ... g ...
g1 = ... f ...
f2 = ... g1 ...
g2 = ... f1 ...

> let f = ..f{n,m}.. in .. -UNROLL-> let f = ..|..f{n,m-1}..|.. in ..

Similarly I suppose you intended that you get one copy of the body per
UNROLL, rather than per call-site? i.e:

f = ... f ...

==>

f = ... f1 ...
f1 = ... f2 ...
f2 = ... f ...

I'm not completely convinced that this doesn't make sense for mutual recursion:

f = ... g ...
g = ... f ...

==>

f = ... g ...
g = ... f1 ...
f1 = ... g1 ...
g1 = ... f ...

I'm not quite sure how to generalize that though :-)

>>   Non-supporting implementations should treat these as INLINE
>>   pragmas (same warning/ignore or automatic unfold behaviour).

Maybe they SHOULD do, but there are a lot of compilers out there in
the real world that won't :-). Making these entirely new pragmas feels
better to me.

I spoke to Simon PJ about these pragmas and he didn't sound terribly
enthusiatic - but he suggested they would be a nice use case for
compiler plugins :-). Plugins would only be capable of dealing with
UNROLL / PEEL as new pragmas. Of course, this kind of relies on us
getting plugins into the HEAD sometime...

>> - no functions inlined into f: should be subject to override by
>>   INLINE pragmas (even for the non-recursive case?)

If UNROLL / PEEL are seperate annotations we won't prevent inlining
into the UNROLLed/PEELed thing. But that might be bad! What if we
have:

x = BIG

{-# UNROLL f 3 #-}
f = ... x ... f ...

Now if we unconditionally inline x into f as the only use site we will
end up bloating up the code if we later run the unroller. However, if
we unroll first then the simplifier won't inline x and things will be
good. So perhaps you are right to say that this should be an extension
of INLINE.

As for the more general question about whether you should inline stuff
inside INLINEs at all - well, AFAIK the latest work on this by Simon
means that stuff /will/ be inlined inside them, but if that body is
subsequently inlined what gets inlined is the /original/ body as the
user wrote it in his source code. This improves performance when for
some reason a value doesn't get inlined.

>> - no float-in/float-out/cse: ??

The restriction on CSE is principally for NOINLINE things, to prevent
messing with RULEs by changing identifiers around. I'm not sure if
that is relevant here.

No float-in makes sure we don't increase the size of things we are
going to INLINE. This is important with UNROLL / PEEL for the same
reason as above - another argument for this being an extension of
INLINE so we inherit its semantics.

I don't actually know why the no-float-out restriction exists (after
all, it only makes the body smaller!) so I'm not sure what the right
thing to do would be there.

>> - no worker/wrapper transform in strictness analyser: we do get the   same
>> effect from INLINE PEEL, so this should be okay, right?

Maybe I don't understand what you mean, but I don't think this is
true. For example, w/w can unpack a strict argument of product type,
but I dont' think PEEL will let you achieve that.

This restriction exists to prevent losing INLINE pragmas:

"""
Note [Don't w/w inline things]
~~
It's very important to refrain from w/w-ing an INLINE function
If we do so by mistake we transform
f = __inline (\x -> E)
into
f = __inline (\x -> case x of (a,b) -> fw E)
fw = \ab -> (__inline (\x -> E)) (a,b)
and the original __inline now vanishes, so E is no longer
inside its __inline wrapper.  Death!  Disaster!
"""

So we might want to prevent w/wing UNROLL/PEEL for the same reasons..
but if we do the UNROLL/PEEL "pass" early enough (i.e. before
strictness - which is quite late in the pipeline) then this issue will
go away.

>> - loop breakers: PEEL/UNROLL have their own limits, creating
>>   an intrinsic loop breaker when the counters run out

> This might be easier to
> handle in your "unfolding as a separate core2core pass" scenario, where the
> pass might keep track of unfoldings already done (instead of trying to
> encode that information locally, in annotations).

I think that makes most sense. If we run it early enough we would be
reasonably sure our program was close to what the user intended and
hence could sidestep some of 

Re: Loop unrolling + fusion ?

2009-03-06 Thread Brandon S. Allbery KF8NH

On 2009 Mar 6, at 19:07, Claus Reinke wrote:
Loop breakers are still needed, in spite of the explicit limits.  
Consider


let {odd x = ..even{1,0}..; even x = ..odd{1,0}..} in odd{1,0} n


{-# INLINE odd even PEEL n #-} ?

--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-06 Thread Claus Reinke

My preferred spec would be roughly

{-# NOINLINE f #-}
   as now

{-# INLINE f #-} 
   works as now, which is for non-recursive f only (might in future

   be taken as go-ahead for analysis-based recursion unfolding)

{-# INLINE f PEEL n #-}
   inline calls *into* recursive f (called loop peeling for loops)

{-# INLINE f UNROLL m #-}
   inline recursive calls to f *inside* f (called loop unrolling for loops)

{-# INLINE f PEEL n UNROLL m #-}
   combine the previous two

   The numeric parameters are to be interpreted as if each call to
   f was annotated with both PEEL and UNROLL limits, to be
   decreased as appropriate for every PEEL or UNROLL action.


hmm, "appropriate" is one of those words that shouldn't occur in specs,
not even rough ones, so let's flesh this out a bit, by abstract example.

let f = ..f.. in f{n,m} -PEEL-> let f = ..f.. in ..f{n-1,m}..

let f = ..f{n,m}.. in .. -UNROLL-> let f = ..|..f{n,m-1}..|.. in ..

In words: the call being peeled/unrolled disappears, being replaced by
a copy of the definition, in which the decremented counts are applied to 
the calls of the same function created by unfolding. Is that specific enough?



   Peeling and unrolling stop when the respective count annotation
   has reached 0. Note that mutual recursion is the domain of PEEL,
   while UNROLL only applies to direct recursion.

   {-# INLINE f PEEL n #-}, for n>0, corresponds to worker/
   wrapper transforms (previously done manually) + inline wrapper,
   and should therefore also be taken as a hint for the compiler to 
   try the static argument transformation for f (the "worker").


   Non-supporting implementations should treat these as INLINE
   pragmas (same warning/ignore or automatic unfold behaviour).

Since we are talking about a refinement of the INLINE pragma, we
also need to look at that pragma's existing subtleties:-(

- no functions inlined into f: should be subject to override by
   INLINE pragmas (even for the non-recursive case?)
- no float-in/float-out/cse: ??
- no worker/wrapper transform in strictness analyser: we do get the 
   same effect from INLINE PEEL, so this should be okay, right?

- loop breakers: PEEL/UNROLL have their own limits, creating
   an intrinsic loop breaker when the counters run out


Loop breakers are still needed, in spite of the explicit limits. Consider

let {odd x = ..even{1,0}..; even x = ..odd{1,0}..} in odd{1,0} n

Peeling odd gives a call to even, peeling of which gives a fresh, not
decremented, call to odd! Unless one makes a copy of the whole
mutual recursion, with the odd calls adjusted. This might be easier 
to handle in your "unfolding as a separate core2core pass" scenario, 
where the pass might keep track of unfoldings already done (instead 
of trying to encode that information locally, in annotations).


Other issues?
Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-06 Thread Claus Reinke

The implementation I'm thinking of is basically trivial. You just add
the information gathered from the pragmas onto the Ids, then have a
dedicated core pass that looks at the pragmas and does it's
worker/wrapper thing. The technology to do peeling/unrolling is
trivial and there already examples in the codebase (in case liberation
and SAT). If someone can spec out what they actually want and GHC HQ
give it the thumbs up I would be happy to do the grunt work on
implementing this feature.


Yes, please!-)

My preferred spec would be roughly

{-# NOINLINE f #-}
   as now

{-# INLINE f #-} 
   works as now, which is for non-recursive f only (might in future

   be taken as go-ahead for analysis-based recursion unfolding)

{-# INLINE f PEEL n #-}
   inline calls *into* recursive f (called loop peeling for loops)

{-# INLINE f UNROLL m #-}
   inline recursive calls to f *inside* f (called loop unrolling for loops)

{-# INLINE f PEEL n UNROLL m #-}
   combine the previous two

   The numeric parameters are to be interpreted as if each call to
   f was annotated with both PEEL and UNROLL limits, to be
   decreased as appropriate for every PEEL or UNROLL action.
   Peeling and unrolling stop when the respective count annotation
   has reached 0. Note that mutual recursion is the domain of PEEL,
   while UNROLL only applies to direct recursion.

   {-# INLINE f PEEL n #-}, for n>0, corresponds to worker/
   wrapper transforms (previously done manually) + inline wrapper,
   and should therefore also be taken as a hint for the compiler to 
   try the static argument transformation for f (the "worker").


   Non-supporting implementations should treat these as INLINE
   pragmas (same warning/ignore or automatic unfold behaviour).

About the pragma name: as far as I can tell, Hugs simply ignores
INLINE pragmas, no matter what they say, other implementations
could just ignore the PEEL/UNROLL part (possibly with a warning)
- do any of them support INLINE on recursive definitions?

The only problem is that GHC itself fails with a parse error, which
would lead to version issues (perhaps GHC should have allowed
for additional information to otherwise syntactically complete pragmas,
or warnings instead of errors, but that hitch is out in the wild now).

Having separate PEEL/UNROLL pragmas would make ignoring
the default action, but would clutter the pragma name space as well
as the source code; it also wouldn't make explicit that we are indeed 
refining the INLINE pragma for the case of recursive functions (which 
GHC currently ignores or complains about), by detailing how we want 
the recursive definition to be inlined.


Since we are talking about a refinement of the INLINE pragma, we
also need to look at that pragma's existing subtleties:-(

- no functions inlined into f: should be subject to override by
   INLINE pragmas (even for the non-recursive case?)
- no float-in/float-out/cse: ??
- no worker/wrapper transform in strictness analyser: we do get the 
   same effect from INLINE PEEL, so this should be okay, right?

- loop breakers: PEEL/UNROLL have their own limits, creating
   an intrinsic loop breaker when the counters run out

Is that sufficient?
Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-06 Thread Claus Reinke

That was one of my questions in the optimization and rewrite rules
thread: shouldn't -fvia-C be supported (as a non-default option)
for at least as long as the alternative isn't a clear win in all cases?


The trouble with supporting multiple backends is that the cost in terms of 
testing and maintenance is high.  And the registerised -fvia-C backend is 
particularly nasty, coming as it does with thousands of lines of Perl 4 
that regularly get broken by new versions of gcc.


Yes, I can understand that you'd like to leave that part behind sometime
before yesterday:-) I assume that this very complexity means that the
-fvia-C route doesn't really get all the way to its most interesting
promises (easy portability, and full backend optimizations inherited
from gcc). And with that in mind, I can also understand that you don't 
want to put in any further work into trying to improve it, if that distracts 
from a better long-term solution. 


What I don't understand yet is the routemap for replacing -fvia-C. We've
seen -fvia-C being demoted from default to backup (fine by me), we've
seen a feature supported only by -fvia-C removed completely, instead 
of seeing support for it added to the -fasm route (macro-based APIs

used to work with ffi, would now require a wrapper generator, which
doesn't exist yet). 

Indications are that -fvia-C still tends to produce better code (even 
though it is not the best that ghc+gcc could produce) than -fasm (is that 
any better for the new backend?). And last, but not least, ghc has more 
limited resources than gcc, so how is ghc going to beat gcc at the 
portability and backend optimizations game while still making progress
in its core competencies (ie, higher-level improvements; there's also the 
interesting side-issue of how the two stages of optimizations are going to 
interact in ghc, if there is a barrier that can only be crossed in one direction)?


The registerised via-C backend should have been retired long ago.  It's 
time to take it round back and shoot it.  We should spend our efforts on 
finding a good long-term solution rather than patching this dead-end, IMHO.


No disagreement there (apart from the violent metaphor). I'm just worried 
about pragmatics, ie scuttling the ship before we've counted our life boats!-) 
And I suspect that for ghc trying to do everything itself on all platforms 
(rather than trying for very good -fasm on some platforms of interest, and 
good -fvia-C as a fallback everywhere else) is going to be anything but 
more work than patching that dead-end (though no doubt more interesting).


In other words, what is the plan wrt to backends, especially wrt 
recovering the optimizations and portability issues previously left to 
gcc? When will the fast via-C route be retired, what quality of

replacement will be in place at that time, how long to catch up
to where we are now, how to keep up, etc.?

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-06 Thread Simon Marlow

Claus Reinke wrote:

its loop unroller on this guy haven't succeeded. -funroll-loops and
-funroll-all-loops doesn't  touch it,


That's because the C produced by GHC doesn't look like a loop to GCC.  
This can be fixed but given that we are moving away from -fvia-C  
anyway, it probably isn't worth doing.


That was one of my questions in the optimization and rewrite rules
thread: shouldn't -fvia-C be supported (as a non-default option)
for at least as long as the alternative isn't a clear win in all cases?

If there are small changes that could make GHC-generated code
more palatable to GCC's optimizer, wouldn't that be worth doing?
Once -fvia-C is allowed to bitrot to the level of unoptimized
bootstraps only, we might never get the good performance route
back, so why not keep it in good shape as long as it offers real benefits?


The trouble with supporting multiple backends is that the cost in terms of 
testing and maintenance is high.  And the registerised -fvia-C backend is 
particularly nasty, coming as it does with thousands of lines of Perl 4 
that regularly get broken by new versions of gcc.


The registerised via-C backend should have been retired long ago.  It's 
time to take it round back and shoot it.  We should spend our efforts on 
finding a good long-term solution rather than patching this dead-end, IMHO.


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-02 Thread Tyson Whitehead
On March 1, 2009 17:31:13 Max Bolingbroke wrote:
> I am no assembly guru and haven't seen that last form of leaq either,
> but I'm going to guess that:
>
>  leaq(%rsi,%rsi,4), %rax
>
> Says that rax is rsi * ((1 + 1) * 2 ^ 4) = rsi * 32
>
>  leaq0(,%rax,8), %rsi

If I recall correctly, leaq is load effective address (i.e., write the address 
into the destination register instead of the data at the address).

The address form is i(b,o,s) = i+b+o*s.  You have (%rsi,%rsi,4) = %rsi+%rsi*4 
into %rax followed by 0(,%rax,8) = rax*8 into %rsi, ultimately giving %rsi*40 
into %rsi (which is the multiplication you have in the ghc generated loop).

(the restrictions on the address form is that s must be one of 1, 2, 4, or 8)

Interesting discussion by the way.  : )

Cheers!  -Tyson



signature.asc
Description: This is a digitally signed message part.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-01 Thread Max Bolingbroke
2009/3/1 Claus Reinke :
> What is the issue with concatMap?

ConcatMap doesn't usually fuse under stream fusion - see
http://www.cse.unsw.edu.au/~dons/papers/stream-fusion.pdf for the gory
details.

> It sounds like your specialization
> is based on the recursion equivalent to loop peeling (unrolling the
> "non-recursive" calls, the entry points into the recursion), a close variant
> of loop unrolling (unrolling the recursive calls, inside the loop
> body).

This sounds right - to get concatMap specialised I "unpeel" the
unstream loop (which has been slightly modified) 4 iterations, after
which unstream recurses back into itself in a tight loop. This lets
GHC specialise the first 3 iterations however it likes. This is
achieved by the spec4 combinator in the code I posted.

> If followed by the static argument transformation, that might cover
> the majority of hand-written worker-wrapper pairs (replacing manual by
> compiler optimization is always nice).

Right. Since GHC is so blind to recursion at the moment this could be
a substantial win (though my gut tells me that SAT alone is a large
part of the win here).

> So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', with
> an INLINE pragma for 'fWrapper', one might in future be able just to say
> '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling
> is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And GHC would do the
> work, by unfolding the entry points once (the
> inlining of the wrapper), unfolding the recursive calls 8 times (the
> loop unrolling), and taking the INLINE PEEL pragma also as a hint to try the
> static argument transformation.

Right, and INLINE PEEL might be a nice interface for the user. Of
course, we'd probably want an automated system for working out when
this is a good idea as well - in the same way that we have INLINE
pragmas and a load of inlining heuristics.

> It seems that strength reduction could be seen as loop restructuring
> in the small: a multiplication, if seen as a repeated addition, can be
> unrolled, or the implicit adding loop can be fused with the explicit loop in
> which multiplication is called on (eg figure 7 in the ACM survey paper I
> mentioned).

I hadn't thought about it in quite those terms before - cute :-)

> That way, no separate framework
> would be needed for strength reduction. Btw, is that also what happened in
> the -fvia-C path of Don's initial example in this thread (I don't know how
> to read those leaqs, but the imulq is gone)?

I am no assembly guru and haven't seen that last form of leaq either,
but I'm going to guess that:

 leaq(%rsi,%rsi,4), %rax

Says that rax is rsi * ((1 + 1) * 2 ^ 4) = rsi * 32

 leaq0(,%rax,8), %rsi

And that this finishes it off by adding the final 8* to the mix. So it
makes the multiplication easier by breaking it into two
multiplications by powers of two. Smart, but you don't need any loop
unrolling tech to do it.

> But all these follow-on optimizations enabled by unfolding recursive
> definitions seem to require further thought and design, whereas
> user-controlled recursion unfolding (both peel and unroll) seems
> to offer immediate benefits. Is that part of your current work?

I hadn't actually considered a mechanism user-controlled peel/unroll
at all! I was totally focused on automatic transformations :-)

> Do you forsee any problems with the implementation, or with
> the API I suggested above (adding PEEL and UNROLL options
> to INLINE pragmas, to make them effective on recursive
> definitions as well)?

The implementation I'm thinking of is basically trivial. You just add
the information gathered from the pragmas onto the Ids, then have a
dedicated core pass that looks at the pragmas and does it's
worker/wrapper thing. The technology to do peeling/unrolling is
trivial and there already examples in the codebase (in case liberation
and SAT). If someone can spec out what they actually want and GHC HQ
give it the thumbs up I would be happy to do the grunt work on
implementing this feature.

I'm not so sure about the user interface - for the purposes of
compatibility with other compiler's notion of INLINE perhaps a
dedicated PEEL / UNROLL pragma is a good idea. This would be less
painful if we had a positional  notation for pragmas - which has been
mooted in the past wrt. the annotation system for compiler plugins
(which IS in HEAD). AFAIK the only reason we don't have this is that
we haven't had a discussion about how it should look. See "Future
work" on the page http://hackage.haskell.org/trac/ghc/wiki/Annotations

Incidentally, Simon PJ has just made GHC warn about INLINE pragmas on
recursive things (not something I totally sure is a good idea, since
the compiler can make things non-recursive behind your back) but which
you can justify by saying that /normally/ GHC won't INLINE recursive
things, so it's misleading to have INLINE pragmas on them accepted.
This can be taken as an argument against adding PEEL / UNROLL

Re: Loop unrolling + fusion ?

2009-03-01 Thread Claus Reinke

Yes - this is why my use of a kind of unrolling fixes concatMap for
streams, because GHC is able to specialise the "unrolled" function
body on a particular lambda abstraction. However, this is really a
somewhat seperate issue than plain unrolling, as we just want to be
able to /specialise/ recursive functions on particular arguments
rather than reduce loop overhead / reassociate arithmetic over several
iterations.


What is the issue with concatMap? It sounds like your specialization
is based on the recursion equivalent to loop peeling (unrolling the 
"non-recursive" calls, the entry points into the recursion), a close 
variant of loop unrolling (unrolling the recursive calls, inside the loop
body). If followed by the static argument transformation, that might 
cover the majority of hand-written worker-wrapper pairs (replacing 
manual by compiler optimization is always nice). 

So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', 
with an INLINE pragma for 'fWrapper', one might in future be able 
just to say '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling
is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And 
GHC would do the work, by unfolding the entry points once (the

inlining of the wrapper), unfolding the recursive calls 8 times (the
loop unrolling), and taking the INLINE PEEL pragma also as a 
hint to try the static argument transformation.



This is why the static argument transformation is such a big win (as
I've mentioned before, 12% decrease in nofib runtime if you use it) -
because it finds instances of recursive definitions where it's a
REALLY GOOD idea to specialise on a particular argument (since that
argument is actually /invariant/) and gives GHC the opportunity to
specialise on it by creating a nonrecursive wrapper around the
recursive worker loop.

In general, the compiler wants to be able to determine the structure
of the argument of a loop body in a more fine grained way than just
"invariant vs non-invariant" as SAT does. A particularly tempting
example of an optimisation you could do if we dealt with recursive
functions better is strength reduction. This is part of what I'm
looking at implementing for GHC currently.


It seems that strength reduction could be seen as loop restructuring
in the small: a multiplication, if seen as a repeated addition, can be
unrolled, or the implicit adding loop can be fused with the explicit 
loop in which multiplication is called on (eg figure 7 in the ACM 
survey paper I mentioned). That way, no separate framework
would be needed for strength reduction. Btw, is that also what 
happened in the -fvia-C path of Don's initial example in this 
thread (I don't know how to read those leaqs, but the imulq is gone)?


But all these follow-on optimizations enabled by unfolding recursive
definitions seem to require further thought and design, whereas
user-controlled recursion unfolding (both peel and unroll) seems
to offer immediate benefits. Is that part of your current work?
Do you forsee any problems with the implementation, or with
the API I suggested above (adding PEEL and UNROLL options
to INLINE pragmas, to make them effective on recursive
definitions as well)?

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-01 Thread Max Bolingbroke
2009/3/1 Claus Reinke :
> It might be useful to point out that the interaction goes both ways.
> Not only are fused loops candidates for unrolling, but unrolling
> can also enable fusion, giving one example of why Core-level
> unrolling (in addition to backend-level loop restructuring) would
> be useful.

Yes - this is why my use of a kind of unrolling fixes concatMap for
streams, because GHC is able to specialise the "unrolled" function
body on a particular lambda abstraction. However, this is really a
somewhat seperate issue than plain unrolling, as we just want to be
able to /specialise/ recursive functions on particular arguments
rather than reduce loop overhead / reassociate arithmetic over several
iterations.

This is why the static argument transformation is such a big win (as
I've mentioned before, 12% decrease in nofib runtime if you use it) -
because it finds instances of recursive definitions where it's a
REALLY GOOD idea to specialise on a particular argument (since that
argument is actually /invariant/) and gives GHC the opportunity to
specialise on it by creating a nonrecursive wrapper around the
recursive worker loop.

In general, the compiler wants to be able to determine the structure
of the argument of a loop body in a more fine grained way than just
"invariant vs non-invariant" as SAT does. A particularly tempting
example of an optimisation you could do if we dealt with recursive
functions better is strength reduction. This is part of what I'm
looking at implementing for GHC currently.

Cheers,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-01 Thread Claus Reinke

So now, since we've gone to such effort to produce a tiny loop like, this,
can't we unroll it just a little? 
it is worth unrolling this guy, so we get the win of both aggressive high level

fusion, and aggressive low level loop optimisations?


It might be useful to point out that the interaction goes both ways.
Not only are fused loops candidates for unrolling, but unrolling
can also enable fusion, giving one example of why Core-level
unrolling (in addition to backend-level loop restructuring) would
be useful. Consider this silly example (with Apply as before, in
the rewrite rules thread, just syntactically unrolling the loop, and
loop as before, but generalised to arbitrary accumulators, see below):


{-# LANGUAGE MagicHash #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE BangPatterns #-}

import Data.Array.Vector
import Data.Bits
import Apply
import GHC.Prim
import GHC.Base

main = print $ loop 1 1000 body (toU [1,2,3,4,5::Int])

body i arr = mapU (42+) arr


Here, the refusal to partially unfold recursive definitions means
there are no opportunities for fusion, whereas unrolling enables
fusion (which wouldn't work if unrolling was done only in the
backend, after fusion).


{-# INLINE loop #-}
loop :: Int -> Int -> (Int -> acc -> acc) -> acc -> acc
loop i max body acc = loopW i acc
 where
#ifdef N
 loopW !i !acc | i+N<=max  = loopW (i+N) ($(apply (0::Int) N) (\j acc->body 
(i+j) acc) acc)
#endif
 loopW !i !acc | i<=max= loopW (i+1) (body i acc)
   | otherwise = acc


Compare the versions without and with unrolling, not just for
time, but for allocation (+RTS -s).

As usual, we'd like to reassociate the sums to enable constant
folding, but this rule

{-# RULES
-- "reassoc" forall a# b# c. ((I# a#) +# ((I# b#) +# c)) = ((I# a#) +# (I# b#)) 
+# c
 #-}

is rejected.

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-03-01 Thread Claus Reinke

its loop unroller on this guy haven't succeeded. -funroll-loops and
-funroll-all-loops doesn't  touch it,


That's because the C produced by GHC doesn't look like a loop to GCC.  This can be fixed but given 
that we are moving away from -fvia-C  anyway, it probably isn't worth doing.


That was one of my questions in the optimization and rewrite rules
thread: shouldn't -fvia-C be supported (as a non-default option)
for at least as long as the alternative isn't a clear win in all cases?

If there are small changes that could make GHC-generated code
more palatable to GCC's optimizer, wouldn't that be worth doing?
Once -fvia-C is allowed to bitrot to the level of unoptimized
bootstraps only, we might never get the good performance route
back, so why not keep it in good shape as long as it offers real benefits?

The problem with low-level loop optimisations is that in general, they  should be done at a low 
level. Core is much too early for this. To  find out whether and how much to unroll a particular 
loop, you must  take things like register pressure and instruction scheduling into  account. IMO, 
the backend is the only reasonable place to do these  optimisations.


[1] is one example reference for this argument (late unrolling). And
since most compiler textbooks are oddly quiet about optimizations
and their interactions, the survey [2] might also be helpful (and [3]
has some pointers to more recent work).

However, I'd like to note that Core is rather different from
conventional language source-level code, so I would expect
benefits from source-level "unrolling", too: Core retains much
more of the high-level semantics, so both identifying loops and
applying library-specific optimizations after unrolling are much
easier here than at the basic block level in the backend.

After all, it is just the normal unfolding/inlining that forms the
starting point for so many of GHC's optimizations, which just
happens to be blind to recursive definitions at the moment.
Recursive definitions are quite widely used in Haskell code,
so this blindspot can't be good for the overall effectiveness
of GHC's optimizer. If one could mark recursive bindings
with a counter, to limit unfoldings according to a compiler
option, generalised loop unrolling would just be a consequence
of what GHC does anyway, right?

That doesn't change the point that, at the lower level, loop
unrolling interacts strongly with the target architecture, and
that some relevant information is not available at Core level.

But it might be better to do both Core-level unfolding (to
enable further Core2Core optimizations, independent of
platform, that might no longer be visible at backend level)
and backend-level unfolding and re-folding (to massage the
low-level flow graph into a shape suited for the target
architecture, about which the Core level has no information).

One might also expect that Core-level transformations
are affected by compiler flags which users select according
to their target architecture (common practice at the moment),
so Core2Core isn't entirely out of that loop, either;-)

It is worth noting that there is a third level of optimizations,
even after the backend, in modern hardware, as these notes
[4] for an Intel compiler's unrolling option document. And
since I'm collecting references, there's also Ian's TH [5] for
doing the unrolling even before Core.

Claus

[1] An Aggressive Approach to Loop Unrolling, 1995
   http://citeseer.ist.psu.edu/old/620489.html
[2] Compiler Transformations for High-Performance Computing,
   ACM Computing Surveys, 1994
   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.4885
[3] http://en.wikipedia.org/wiki/Loop_transformation
[4] 
http://www.intel.com/software/products/compilers/flin/docs/main_for/mergedprojects/optaps_for/common/optaps_hlo_unrl.htm

[5] Unrolling and simplifying expressions with Template Haskell,
   Ian Lynagh, 2003
   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9813

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-02-28 Thread Roman Leshchinskiy

On 01/03/2009, at 04:49, Don Stewart wrote:

So now, since we've gone to such effort to produce a tiny loop like,  
this,
can't we unroll it just a little? Sadly, my attempts to get GCC to  
trigger

its loop unroller on this guy haven't succeeded. -funroll-loops and
-funroll-all-loops doesn't  touch it,


That's because the C produced by GHC doesn't look like a loop to GCC.  
This can be fixed but given that we are moving away from -fvia-C  
anyway, it probably isn't worth doing.


Anyone think of a way to apply Claus' TH unroller, or somehow  
convince GCC
it is worth unrolling this guy, so we get the win of both aggressive  
high level

fusion, and aggressive low level loop optimisations?


The problem with low-level loop optimisations is that in general, they  
should be done at a low level. Core is much too early for this. To  
find out whether and how much to unroll a particular loop, you must  
take things like register pressure and instruction scheduling into  
account. IMO, the backend is the only reasonable place to do these  
optimisations. Using an exisiting backend like LLVM would really help  
here.


Roman


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-02-28 Thread Max Bolingbroke
2009/2/28 Don Stewart :
> So now, since we've gone to such effort to produce a tiny loop like, this,
> can't we unroll it just a little? Sadly, my attempts to get GCC to trigger
> its loop unroller on this guy haven't succeeded. -funroll-loops and
> -funroll-all-loops doesn't  touch it,
>
> Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC
> it is worth unrolling this guy, so we get the win of both aggressive high 
> level
> fusion, and aggressive low level loop optimisations?

For a couple of weeks, I have had a working solution for the concatMap
problem using a sort of loop unrolling. I have tweaked the approach
slightly to also unroll the worker loop to get the results you desire.

You can check out the (very rough) code with:
git clone http://www.cl.cam.ac.uk/~mb566/git/concatmap/.git/
$EDITOR concatmap/CallUnrollConcatMap.hs

Apologies if the code is somewhat cryptic, but you should be able to
get the general idea.

A sneak preview is in order. The following Core:

"""
Rec {
$wf1_s1bU [ALWAYS LoopBreaker Nothing] :: GHC.Prim.Int#
  -> GHC.Prim.Int#
[Arity 1
 Str: DmdType L]
$wf1_s1bU =
  \ (ww_s1bO :: GHC.Prim.Int#) ->
case GHC.Prim.<=# ww_s1bO 1
of wild_B1 [ALWAYS Dead Just A] {
  GHC.Bool.False -> 0;
  GHC.Bool.True ->
let {
  x_XMS [ALWAYS Just L] :: GHC.Prim.Int#
  [Str: DmdType]
  x_XMS = GHC.Prim.+# ww_s1bO 1 } in
case GHC.Prim.<=# x_XMS 1 of wild_Xx [ALWAYS Dead Just A] {
  GHC.Bool.False ->
GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2;
  GHC.Bool.True ->
let {
  x_XMX [ALWAYS Just L] :: GHC.Prim.Int#
  [Str: DmdType]
  x_XMX = GHC.Prim.+# x_XMS 1 } in
case GHC.Prim.<=# x_XMX 1 of wild_XE [ALWAYS Dead Just A] {
  GHC.Bool.False ->
GHC.Prim.+#
  (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2)
  (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2);
  GHC.Bool.True ->
let {
  x_XOf [ALWAYS Just L] :: GHC.Prim.Int#
  [Str: DmdType]
  x_XOf = GHC.Prim.+# x_XMX 1 } in
case GHC.Prim.<=# x_XOf 1 of wild_XM [ALWAYS
Dead Just A] {
  GHC.Bool.False ->
GHC.Prim.+#
  (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2)
  (GHC.Prim.+#
 (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2)
 (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2));
  GHC.Bool.True ->
case $wf1_s1bU (GHC.Prim.+# x_XOf 1) of ww_s1bS {
__DEFAULT ->
GHC.Prim.+#
  (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# ww_s1bO 2) 2)
  (GHC.Prim.+#
 (GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMS 2) 2)
 (GHC.Prim.+#
(GHC.Prim.*# (GHC.Prim.uncheckedIShiftL# x_XMX 2) 2)
(GHC.Prim.+#
   (GHC.Prim.*#
(GHC.Prim.uncheckedIShiftL# x_XOf 2) 2) ww_s1bS)))
}
}
}
}
}
end Rec }
"""

Is generated by this program:

"""
result = sumS . mapS (*2) . mapS (`shiftL` 2) $ enumFromToS 0 1
"""

Of course, my approach is far from perfect:

* Unrolling ALWAYS happens, and to a fixed depth
* RULEs aren't very good at exploiting properties of arithmetic, as
Claus has pointed out
* concatMap fuses with my library but has lingering issues with
allocation if join points don't get inlined and has some strictness
problems too (to see this in action, try compliing the program "sumS $
mapS (+10) $ concatMapS (\x -> enumFromToS x 20) $ enumFromToS 1 10"
from the same file). It also is only permitted up to a fixed depth as
defined by the level of unrolling specified in the "spec" combinator.

But it does get your unrolling with TODAYs GHC, transparently to the
user of the uvector library.

I am currently looking at other, smarter, ways that GHC can optimize
loops as part of my research - so with luck this sort of manual
unrolling hackery will become less relevant in the future.

All the best,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Loop unrolling + fusion ?

2009-02-28 Thread Claus Reinke

   import Data.Array.Vector
   import Data.Bits
   main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (1 :: Int) 
(5::Int)


and turns it into a loop like this:

   $wfold :: Int# -> Int# -> Int#
   $wfold =
 \ (ww_sWX :: Int#) (ww1_sX1 :: Int#) ->
   case ww1_sX1 of wild_B1 {
 __DEFAULT ->
   $wfold (*# ww_sWX 40) (+# wild_B1 1);
 1 -> ww_sWX
   }

..

So now, since we've gone to such effort to produce a tiny loop like, this,
can't we unroll it just a little?
Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC
it is worth unrolling this guy, so we get the win of both aggressive high level
fusion, and aggressive low level loop optimisations?


I'm not sure this is what you're after (been too long since I read assembler;-),
but it sounds as if you wanted to unroll the source of that fold, which seems
to be a local definition in foldS? Since unrolling is not always a good idea, it
would also be nice to have a way to control/initiate it from outside of the
uvector package (perhaps a RULE to redirect the call from foldS to a
foldSN, but foldS is hidden, and gets inlined away; but something
like that). If that works, you'd then run into the issue of wanting to
rearrange the *# and *# by variable and constant.

Claus

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users