Re: Availability of Coercible Instances at TH Runtime

2018-05-06 Thread David Kraeutmann
Tangentially related, but wasn't there a plan at some point to integrate 
TH more tightly with GHC?


On 5/7/2018 4:39 AM, Richard Eisenberg wrote:


No, it doesn't, but the Q monad is actually a wrapper around TcM, the 
type-checker monad. I don't think it would be all that hard to do 
this. Could be a suitable project for someone new to GHC hacking...

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Availability of Coercible Instances at TH Runtime

2018-05-06 Thread Richard Eisenberg


> On May 4, 2018, at 10:01 PM, Travis Whitaker  wrote:
> 
> I'm curious what it'd take to add a qReifyCoercible method; does the renamer 
> know anything about Coercible?

No, it doesn't, but the Q monad is actually a wrapper around TcM, the 
type-checker monad. I don't think it would be all that hard to do this. Could 
be a suitable project for someone new to GHC hacking...

Richard___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Basic Block Layout in the NCG

2018-05-06 Thread Andreas Klebinger

On branch probability:

I've actually created a patch to add more probabilities in the recent 
past which included
probabilities on CmmSwitches. Although it made little difference when 
only tagging error

branches.

Partially that also run into issues with code layout which is why I put 
that on ice for now.


The full patch is here:
https://phabricator.haskell.org/D4327

I think this really has to be done back to front as GHC currently throws
away all likelyhood information before we get to block layout.
Which makes it very hard to take advantage of this information.

Code Layout:

That seems exactly like the kind of pointers I was looking for!
I do wonder how well some of them aged. For example [5] and [6] use at 
most two way associative cache.

But as you said it should be at least a good starting point.

I will put your links into the ticket so they are easily found once I 
(or someone else!) has time to look

deeper into this.

Cheers
Andreas



Kavon Farvardin 
Sonntag, 6. Mai 2018 20:17

Does anyone have good hints for literature on basic block layout
algorithms?


Here are some thoughts:

* Branch Probability *

Any good code layout algorithm should take branch probabilities into
account.  From what I've seen, we already have a few likely-branch
heuristics baked into the generation/transformation of Cmm, though
perhaps it's worth doing more to add probabilities as in [1,3].  The
richer type information in STG could come in handy.

I think the first step in leveraging branch proability information is
to use a Float to represent the magnitude of likeliness instead of a
Bool.

Target probabilities on CmmSwitches could also help create smarter
SwitchPlans. Slides 20-21 in [2] demonstrate a lower-cost decision tree
based on these probabilities.


* Code Layout *

The best all-in-one source for static code positioning I've seen is in
[5], and might be a good starting point for exploring that space.  More
importantly, [5] talks about function positioning, which is something I
think we're missing.  A more sophisticated extension to [5]'s function
positioning can be found in [6].

Keeping in mind that LLVM is tuned to optimize loops within functions,
at at high-level LLVM does the following [4]:

 The algorithm works from the inner-most loop within a
 function outward, and at each stage walks through the
 basic blocks, trying to coalesce them into sequential
 chains where allowed by the CFG (or demanded by heavy
 probabilities). Finally, it walks the blocks in
 topological order, and the first time it reaches a
 chain of basic blocks, it schedules them in the
 function in-order.

There are also plenty of heuristics such as "tail duplication" to deal
with diamonds and other odd cases in the CFG that are harder to layout.
   Unfortunately, there don't seem to be any sources cited.  We may want
to develop our own heuristics to modify the CFG for better layout as
well.



[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do
i.org/10.1145/173262.155119)

[2] Hans Wennborg. The recent switch lowering improvements. (http://llv
m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http
s://www.youtube.com/watch?v=gMqSinyL8uk

[3] James E. Smith. A study of branch prediction strategies (https://dl
.acm.org/citation.cfm?id=801871)

[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html

[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht
tps://doi.org/10.1145/93542.93550)

[6] Hashemi et al. Efficient procedure mapping using cache line
coloring (https://doi.org/10.1145/258915.258931)


~kavon


On Sat, 2018-05-05 at 21:23 +0200, Andreas Klebinger wrote:

Does anyone have good hints for literature on basic block layout
algorithms?
I've run into a few examples where the current algorithm falls apart
while working on Cmm.

There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#
ticket
where I tracked some of the issues I ran into.

As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.

The main problem seems to be that the current codegen only considers
the
last jump
in a basic block as relevant for code layout.

This works well for linear chains of control flow but behaves badly
and
somewhat
unpredictable when dealing with branch heavy code where blocks have
more
than
one successor or calls.

In particular if we have a loop

A jmp B call C call D

which we enter into at block B from Block E
we would like something like:

E,B,C,D,A

Which means with some luck C/D might be still in cache if we return
from
the call.

However we can currently get:

E,B,A,X,D,X,C

where X are other unrelated blocks. This happens since call edges
are
invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C
to D
since these are invisible as well!

I came across cases where inverting conditions 

Re: Basic Block Layout in the NCG

2018-05-06 Thread Kavon Farvardin
> Does anyone have good hints for literature on basic block layout
> algorithms?

Here are some thoughts:

* Branch Probability *

Any good code layout algorithm should take branch probabilities into
account.  From what I've seen, we already have a few likely-branch
heuristics baked into the generation/transformation of Cmm, though
perhaps it's worth doing more to add probabilities as in [1,3].  The
richer type information in STG could come in handy.

I think the first step in leveraging branch proability information is
to use a Float to represent the magnitude of likeliness instead of a
Bool.

Target probabilities on CmmSwitches could also help create smarter
SwitchPlans. Slides 20-21 in [2] demonstrate a lower-cost decision tree
based on these probabilities.


* Code Layout *

The best all-in-one source for static code positioning I've seen is in
[5], and might be a good starting point for exploring that space.  More
importantly, [5] talks about function positioning, which is something I
think we're missing.  A more sophisticated extension to [5]'s function
positioning can be found in [6].

Keeping in mind that LLVM is tuned to optimize loops within functions,
at at high-level LLVM does the following [4]:

The algorithm works from the inner-most loop within a 
function outward, and at each stage walks through the
basic blocks, trying to coalesce them into sequential
chains where allowed by the CFG (or demanded by heavy
probabilities). Finally, it walks the blocks in 
topological order, and the first time it reaches a 
chain of basic blocks, it schedules them in the 
function in-order.

There are also plenty of heuristics such as "tail duplication" to deal
with diamonds and other odd cases in the CFG that are harder to layout.
  Unfortunately, there don't seem to be any sources cited.  We may want
to develop our own heuristics to modify the CFG for better layout as
well.



[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do
i.org/10.1145/173262.155119)

[2] Hans Wennborg. The recent switch lowering improvements. (http://llv
m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http
s://www.youtube.com/watch?v=gMqSinyL8uk

[3] James E. Smith. A study of branch prediction strategies (https://dl
.acm.org/citation.cfm?id=801871)

[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html

[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht
tps://doi.org/10.1145/93542.93550)

[6] Hashemi et al. Efficient procedure mapping using cache line
coloring (https://doi.org/10.1145/258915.258931)


~kavon


On Sat, 2018-05-05 at 21:23 +0200, Andreas Klebinger wrote:
> Does anyone have good hints for literature on basic block layout
> algorithms?
> I've run into a few examples where the current algorithm falls apart 
> while working on Cmm.
> 
> There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#
> ticket
> where I tracked some of the issues I ran into.
> 
> As it stands some cmm optimizations are far out weighted by
> accidental changes they cause in the layout of basic blocks.
> 
> The main problem seems to be that the current codegen only considers
> the 
> last jump
> in a basic block as relevant for code layout.
> 
> This works well for linear chains of control flow but behaves badly
> and 
> somewhat
> unpredictable when dealing with branch heavy code where blocks have
> more 
> than
> one successor or calls.
> 
> In particular if we have a loop
> 
> A jmp B call C call D
> 
> which we enter into at block B from Block E
> we would like something like:
> 
> E,B,C,D,A
> 
> Which means with some luck C/D might be still in cache if we return
> from 
> the call.
> 
> However we can currently get:
> 
> E,B,A,X,D,X,C
> 
> where X are other unrelated blocks. This happens since call edges
> are 
> invisible to the layout algorithm.
> It even happens when we have (conditional) jumps from B  to C and C
> to D 
> since these are invisible as well!
> 
> I came across cases where inverting conditions lead to big
> performance 
> losses since suddenly block layout
> got all messed up. (~4% slowdown for the worst offenders).
> 
> So I'm looking for solutions there.
> 
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc
Description: This is a digitally signed message part
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Basic Block Layout in the NCG

2018-05-06 Thread Kavon Farvardin
> 4% is far from being "big", look e.g. at https://dendibakh.github.io/
> blog/2018/01/18/Code_alignment_issues where changing just the
> alignment of the code lead to a 10% difference. :-/ The code itself
> or its layout wasn't changed at all. The "Producing Wrong Data
> Without Doing Anything Obviously Wrong!" paper gives more funny
> examples.

This particular instance of performance difference due to aligning
blocks/functions is not very surprising (it is mentioned in section
3.4.1.5 of the Intel Optimization Manual). I also think 4% on large,
non-trivial programs is "bigger" than 10% on the tiny example in that
blog post (especially since the alignment issues disappear above 1024
iterations of the loop).

~kavon

On Sun, 2018-05-06 at 14:42 +0200, Sven Panne wrote:
> 2018-05-05 21:23 GMT+02:00 Andreas Klebinger  t>:
> > [...] I came across cases where inverting conditions lead to big
> > performance losses since suddenly block layout
> > got all messed up. (~4% slowdown for the worst offenders). [...]
> 
>  
> 4% is far from being "big", look e.g. at https://dendibakh.github.io/
> blog/2018/01/18/Code_alignment_issues where changing just the
> alignment of the code lead to a 10% difference. :-/ The code itself
> or its layout wasn't changed at all. The "Producing Wrong Data
> Without Doing Anything Obviously Wrong!" paper gives more funny
> examples.
> 
> I'm not saying that code layout has no impact, quite the opposite.
> The main point is: Do we really have a benchmarking machinery in
> place which can tell you if you've improved the real run time or made
> it worse? I doubt that, at least at the scale of a few percent. To
> reach just that simple yes/no conclusion, you would need quite a
> heavy machinery involving randomized linking order, varying
> environments (in the sense of "number and contents of environment
> variables"), various CPU models etc. If you do not do that, modern HW
> will leave you with a lot of "WTF?!" moments and wrong conclusions.
> 
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc
Description: This is a digitally signed message part
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

2018-05-06 Thread Joachim Breitner
Hi,

Am Sonntag, den 06.05.2018, 16:41 +0200 schrieb Andreas Klebinger:
> With a high number of NoFibRuns (30+) , disabling frequency scaling,
> stopping background tasks and walking away from the computer
> till it was done I got noise down to differences of about +/-0.2% for
> subsequent runs.
> 
> This doesn't eliminate alignment bias and the like but at least it
> gives fairly reproducible results.

That’s true, but it leaves alignment bias. This bit my in my work on
Call Arity, as I write in my thesis:

   Initially, I attempted to use the actual run time measurements, but it
   turned out to be a mostly pointless endeavour. For example the knights
   benchmark would become 9% slower when enabling Call Arity (i.e. when
   comparing (A) to (B)), a completely unexpected result, given that the
   changes to the GHC Core code were reasonable. Further investigation
   using performance data obtained from the CPU indicated that with the
   changed code, the CPU’s instruction decoder was idling for more cycles,
   hinting at cache effects and/or bad program layout.
   Indeed: When I compiled the code with the compiler flag -g, which
   includes debugging information in the resulting binary, but should otherwise
   not affect the relative performance characteristics much, the unexpected
   difference vanished. I conclude that non-local changes to the
   Haskell or Core code will change the layout of the generated program
   code in unpredictable ways and render such run time measurements
   mostly meaningless.

   This conclusion has been drawn before [MDHS09], and recently, tools
   to mitigate this effect, e.g. by randomising the code layout [CB13], were
   created. Unfortunately, these currently target specific C compilers, so I
   could not use them here.

   In the following measurements, I avoid this problem by not measuring
   program execution time, but simply by counting the number of instructions 
performed.
   This way, the variability in execution time due to code
   layout does not affect the results. To obtain the instruction counts I employ
   valgrind [NS07], which runs the benchmarks on a virtual CPU and
   thus produces more reliable and reproducible measurements.

Unpleasant experience.

Cheers,
Joachim
-- 
Joachim Breitner
  m...@joachim-breitner.de
  http://www.joachim-breitner.de/


signature.asc
Description: This is a digitally signed message part
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

2018-05-06 Thread Andreas Klebinger

Joachim Breitner schrieb:

This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level
transformations.
While this matches my experience with the default settings, I had good 
results by tuning the number of measurements nofib does.
With a high number of NoFibRuns (30+) , disabling frequency scaling, 
stopping background tasks and walking away from the computer
till it was done I got noise down to differences of about +/-0.2% for 
subsequent runs.


This doesn't eliminate alignment bias and the like but at least it gives 
fairly reproducible results.


Sven Panne schrieb:
4% is far from being "big", look e.g. at 
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues 
 
where changing just the alignment of the code lead to a 10% 
difference. :-/ The code itself or its layout wasn't changed at all. 
The "Producing Wrong Data Without Doing Anything Obviously Wrong!" 
paper gives more funny examples.


I'm not saying that code layout has no impact, quite the opposite. The 
main point is: Do we really have a benchmarking machinery in place 
which can tell you if you've improved the real run time or made it 
worse? I doubt that, at least at the scale of a few percent. To reach 
just that simple yes/no conclusion, you would need quite a heavy 
machinery involving randomized linking order, varying environments (in 
the sense of "number and contents of environment variables"), various 
CPU models etc. If you do not do that, modern HW will leave you with a 
lot of "WTF?!" moments and wrong conclusions.
You raise good points. While the example in the blog seems a bit 
constructed with the whole loop fitting in a cache line the principle is 
a real concern though.
I've hit alignment issues and WTF moments plenty of times in the past 
when looking at micro benchmarks.


However on the scale of nofib so far I haven't really seen this happen. 
It's good to be aware of the chance for a whole suite to give

wrong results though.
I wonder if this effect is limited by GHC's tendency to use 8 byte 
alignment for all code (at least with tables next to code)?
If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) 
relevant this reduces the possibilities by a lot after all.


In the particular example I've hit however it's pretty obvious that 
alignment is not the issue. (And I still verified that).
In the end how big the impact of a better layout would be in general is 
hard to quantify. Hence the question if anyone has

pointers to good literature which looks into this.

Cheers
Andreas


___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Basic Block Layout in the NCG

2018-05-06 Thread Sven Panne
2018-05-05 21:23 GMT+02:00 Andreas Klebinger :

> [...] I came across cases where inverting conditions lead to big
> performance losses since suddenly block layout
> got all messed up. (~4% slowdown for the worst offenders). [...]
>

4% is far from being "big", look e.g. at https://dendibakh.github.io/
blog/2018/01/18/Code_alignment_issues where changing just the alignment of
the code lead to a 10% difference. :-/ The code itself or its layout wasn't
changed at all. The "Producing Wrong Data Without Doing Anything Obviously
Wrong!" paper gives more funny examples.

I'm not saying that code layout has no impact, quite the opposite. The main
point is: Do we really have a benchmarking machinery in place which can
tell you if you've improved the real run time or made it worse? I doubt
that, at least at the scale of a few percent. To reach just that simple
yes/no conclusion, you would need quite a heavy machinery involving
randomized linking order, varying environments (in the sense of "number and
contents of environment variables"), various CPU models etc. If you do not
do that, modern HW will leave you with a lot of "WTF?!" moments and wrong
conclusions.
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs