subject:"\[julia\-users\] Re\: Strange performance problem for array scaling"

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu

On Mon, Jul 13, 2015 at 2:20 AM, Jeffrey Sarnoff
jeffrey.sarn...@gmail.com wrote:
and this: Cleve Moler tries to see it your way
Moler on floating point denormals

On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote:

Denormals were made part of the IEEE Floating Point standard after some
very careful numerical analysis showed that accomodating them would
substantively improve the quality of floating point results and this would
lift the quality of all floating point work. Surprising it may be,
nonetheless you (and if not you today, you tomorrow and one of your
neighbors today) really do care about those unusual, and often rarely
observed values.

fyi
William Kahan on the introduction of denormals to the standard
and an early, important paper on this
Effects of Underflow on Solving Linear Systems - J.Demmel 1981

Thank you very much for the references.

Yes I definitely believe every part of the IEEE Floating Point
standard has it's reason to be there and I'm more wondering what are
the cases that they are significant. OTOH, I also believe there's
certain kind of computation that do not care about them, which is why
-ffast-math is there.

From the mathwork blog reference:

Double precision denormals are so tiny that they are rarely numerically
significant, but single precision denormals can be in the range where they
affect some otherwise unremarkable computations.

For my computation, I'm currently using double but I've already
checked that switching to single precision still give me enough
precision. Based on this, can I say that I can ignore them if I use
double precision and may need to keep them if I switch to single
precision? Using Float64 takes twice as long as using Float32 while
keeping denormals seems to take 10x time.

As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
I couldn't get it working without #11604[2]. Inline assembly in
llvmcall is working on LLVM 3.6 though[3], in case it's useful for
others.

[1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
[2] https://github.com/JuliaLang/julia/pull/11604
[3]
https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3

On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:

On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com wrote:
Further update:

I made a c++ version[1] and see a similar effect (depending on
optimization levels) so it's not a julia issue (not that I think it
really was to begin with...).

After investigating the c++ version more, I find that the difference
between the fast_math and the non-fast_math version is that the
compiler emit a function called `set_fast_math` (see below).

From what I can tell, the function sets bit 6 and bit 15 on the MXCSR
register (for SSE) and according to this page[1] these are DAZ and FZ
bits (both related to underflow). It also describe denormals as take
considerably longer to process. Since the operation I have keeps
decreasing the value, I guess it makes sense that there's a value
dependent performance (and it kind of make sense that fft also
suffers from these values)

So now the question is:

1. How important are underflow and denormal values? Note that I'm not
catching underflow explicitly anyway and I don't really care about
values that are really small compare to 1.

2. Is there a way to set up the SSE registers as done by the c
compilers? @fastmath does not seems to be doing this.

05b0 set_fast_math:
5b0:0f ae 5c 24 fc stmxcsr -0x4(%rsp)
5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp)
5bc:00
5bd:0f ae 54 24 fc ldmxcsr -0x4(%rsp)
5c2:c3 retq
5c3:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
5ca:00 00 00
5cd:0f 1f 00 nopl (%rax)

[1] http://softpixel.com/~cwright/programming/simd/sse.php

The slow down is presented in the c++ version for all optimization
levels except Ofast and ffast-math. The julia version is faster than
the default performance for both gcc and clang but is slower in the
fast case for higher optmization levels. For O2 and higher, the c++

The slowness of the julia version seems to be due to multi dimentional
arrays. Using 1d array yields similar performance with C.

version shows a ~100x slow down for the slow case.

@fast_math in julia doesn't seem to have an effect for this although
it does for clang and gcc...

[1]
https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp

On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com wrote:
Update:

I've just got an even simpler version without any complex numbers and
only has Float64. The two loops are as small as the following LLVM-IR
now and there's only simple arithmetics in the loop body.

```llvm
L9.preheader:

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu

 As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
 I couldn't get it working without #11604[2]. Inline assembly in
 llvmcall is working on LLVM 3.6 though[3], in case it's useful for
 others.


And for future references I find #789, which is not documented
anywhere AFAICT (will probably file a doc issue...)
It also supports runtime detection of cpu feature so it should be much
more portable.

[1] https://github.com/JuliaLang/julia/pull/789


 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
 [2] https://github.com/JuliaLang/julia/pull/11604
 [3] 
 https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff

Cleve Moler's discussion is not quite as contextually invariant as are
William Kahan's and James Demmel's.
In fact the numerical analysis community has made an overwhelmingly
strong case that, roughly speaking,
one is substantively better situated where denormalized floating point
values will be used whenever they may
arise than being free of those extra cycles at the mercy of an absent
smoothness shoving those values to zero.
And this holds widely for floating point centered applications or
libraries.

If the world were remade with each sunrise by fixed bitwidth floating point
computations, supporting denormals
is to have made house-calls with few numerical vaccines to everyone who
will be relying on those computations
to inform expectations about non-trivial work with fixed bitwdith floating
point types. It does not wipe out all forms
of numerical untowardness, and some will find the vaccinces more
prophylatic than others; still, the analogy holds.

We vaccinate many babies against measles even though there are some who
would never have become exposed
to that disease .. and for those who forgot why, not long ago the news was
about a Disney vaction disease nexus
and how far it spread -- then California changed its law to make it more
difficult to opt-out of childhood vaccination.
Having denormals there when the values they cover arise brings benifit that
parallels the good in that law change.
The larger social environment gets better by growing stronger and that can
happen because somethat that had
been bringing weakness (disease or bad consequences from subtile numbery
misadventures) no longer operates.

There is another way denormals have been shown to be matter -- the way
above ought to help you feel at ease
with deciding not to move your work from Float64 to Float32 for the purpose
of avoiding values that hover around
smaller magnitudes realizable with Float64s. That sounds like a headache,
and you would not have changed
the theory in a way that makes things work (or at all). Recasting the
approch to solving ot transforming at hand
to work with integer values would move the work away from any cost and
benefit that accompany denormals.
Other that that, thank your favorite floating point microarchitect for
giving you greater throughput with denormals
than everyone had a few design cycles ago.

I would like their presence without measureable cost .. just not enough to
dislike their availability.

On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote:

And for future references I find #789, which is not documented
anywhere AFAICT (will probably file a doc issue...)
It also supports runtime detection of cpu feature so it should be much
more portable.

[1] https://github.com/JuliaLang/julia/pull/789

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu

On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff
jeffrey.sarn...@gmail.com wrote:

Thanks for sharing your view about denormal values. I hope what I said
doesn't seem that I want to get rid of them completely (and if it did
sound like that, I didn't meant it...). I didn't read the more detail
analysis of their impact but I would believe you that they are
important in general.

For my specific application, I'm doing time propagation on a
wavefunction (that can decay). For my purpose, there are many other
sources of uncertainty and I'm mainly interested in how the majority
of the wavefunction behave. Therefore I don't really care about the
actually value of something smaller than 10^-10 but I do want it to
run fast. Since this is a linear problem, I can also scale the values
by a constant factor to make underflow less of a problem.

I have not looked at the specifics of what is going on ...
Dismissing denormals is particularly dicey when your functional data flow is
generating many denormalized values.

Do you what it is causing many values of very small magnitude to occur as
you run this?

Is the data holding them explicitly? If so, and you have access to
preprocess the data, and you are sure that software
cannot accumulate or reciprocate or exp etc them, clamp those values to zero
and then use the data.

Does the code operate as a denormalized value generator? If so, a small
alteration to the order of operations may help.

On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote:

If the world were remade with each sunrise by fixed bitwidth floating
point computations, supporting denormals
is to have made house-calls with few numerical vaccines to everyone who
will be relying on those computations
to inform expectations about non-trivial work with fixed bitwdith floating
point types. It does not wipe out all forms
of numerical untowardness, and some will find the vaccinces more
prophylatic than others; still, the analogy holds.

We vaccinate many babies against measles even though there are some who
would never have become exposed
to that disease .. and for those who forgot why, not long ago the news was
about a Disney vaction disease nexus
and how far it spread -- then California changed its law to make it more
difficult to opt-out of childhood vaccination.
Having denormals there when the values they cover arise brings benifit
that parallels the good in that law change.
The larger social environment gets better by growing stronger and that
can happen because somethat that had
been bringing weakness (disease or bad consequences from subtile numbery
misadventures) no longer operates.

There is another way denormals have been shown to be matter -- the way
above ought to help you feel at ease
with deciding not to move your work from Float64 to Float32 for the
purpose of avoiding values that hover around
smaller magnitudes realizable with Float64s. That sounds like a headache,
and you would not have changed
the theory in a way that makes things work (or at all). Recasting the
approch to solving ot transforming at hand
to work with integer values would move the work away from any cost and
benefit that accompany denormals.
Other that that, thank your favorite floating point microarchitect for
giving you greater throughput with denormals
than everyone had a few design cycles ago.

I would like their presence without measureable cost .. just not enough to
dislike their availability.

On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote:

[1] https://github.com/JuliaLang/julia/pull/789

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff

it is a fairer test working from a second copy of the data that has been 
prescaled(x::Float64) = x * 2.0^64

On Monday, July 13, 2015 at 12:51:33 PM UTC-4, Jeffrey Sarnoff wrote:

 Staying with Float64, see if the runtime comes way down when you prescale 
 the data using prescale(x) = x *  2.0^64


 Guessing your values to be less than 10^15,  and assuming the worst case 
 smallest magnitude  
   the base10 exponent of your largest data value is below  70 scale by 
 constant is a good strategy when the largest of the data values is not large

 On Monday, July 13, 2015 at 12:04:32 PM UTC-4, Yichao Yu wrote:

 On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff 
 jeffrey...@gmail.com wrote: 

 Thanks for sharing your view about denormal values. I hope what I said 
 doesn't seem that I want to get rid of them completely (and if it did 
 sound like that, I didn't meant it...). I didn't read the more detail 
 analysis of their impact but I would believe you that they are 
 important in general. 

 For my specific application, I'm doing time propagation on a 
 wavefunction (that can decay). For my purpose, there are many other 
 sources of uncertainty and I'm mainly interested in how the majority 
 of the wavefunction behave. Therefore I don't really care about the 
 actually value of something smaller than 10^-10 but I do want it to 
 run fast. Since this is a linear problem, I can also scale the values 
 by a constant factor to make underflow less of a problem. 

  I have not looked at the specifics of what is going on ... 
  Dismissing denormals is particularly dicey when your functional data 
 flow is 
  generating many denormalized values. 
  
  Do you what it is causing many values of very small magnitude to occur 
 as 
  you run this? 
  
  Is the data holding them explicitly?  If so, and you have access to 
  preprocess the data, and you are sure that software 
  cannot accumulate or reciprocate or exp etc them, clamp those values to 
 zero 
  and then use the data. 
  
  Does the code operate as a denormalized value generator? If so, a small 
  alteration to the order of operations may help. 
  
  
  
  On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: 
  
  Cleve Moler's discussion is not quite as contextually invariant as 
 are 
  William Kahan's and James Demmel's. 
  In fact the numerical analysis community has made an overwhelmingly 
  strong case that, roughly speaking, 
  one is substantively better situated where denormalized floating point 
  values will be used whenever they may 
  arise than being free of those extra cycles at the mercy of an absent 
  smoothness shoving those values to zero. 
  And this holds widely for floating point centered applications or 
  libraries. 
  
  If the world were remade with each sunrise by fixed bitwidth floating 
  point computations, supporting denormals 
  is to have made house-calls with few numerical vaccines to everyone 
 who 
  will be relying on those computations 
  to inform expectations about non-trivial work with fixed bitwdith 
 floating 
  point types.  It does not wipe out all forms 
  of numerical untowardness, and some will find the vaccinces more 
  prophylatic than others; still, the analogy holds. 
  
  We vaccinate many babies against measles even though there are some 
 who 
  would never have become exposed 
  to that disease .. and for those who forgot why, not long ago the news 
 was 
  about a Disney vaction disease nexus 
  and how far it spread -- then California changed its law to make it 
 more 
  difficult to opt-out of childhood vaccination. 
  Having denormals there when the values they cover arise brings benifit 
  that parallels the good in that law change. 
  The larger social environment  gets better by growing stronger and 
 that 
  can happen because somethat that had 
  been bringing weakness (disease or bad consequences from subtile 
 numbery 
  misadventures) no longer operates. 
  
  There is another way denormals have been shown to be matter -- the way 
  above ought to help you feel at ease 
  with deciding not to move your work from Float64 to Float32 for the 
  purpose of avoiding values that hover around 
  smaller magnitudes realizable with Float64s.  That sounds like a 
 headache, 
  and you would not have changed 
  the theory in a way that makes things work  (or at all).  Recasting 
 the 
  approch to solving ot transforming at hand 
  to work with integer values would move the work away from any cost and 
  benefit that accompany denormals. 
  Other that that, thank your favorite floating point microarchitect for 
  giving you greater throughput with denormals 
  than everyone had a few design cycles ago. 
  
  I would like their presence without measureable cost .. just not 
 enough to 
  dislike their availability. 
  
  On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: 
  
   As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. 
 However, 
   I couldn't get it

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff

I have not looked at the specifics of what is going on ...
Dismissing denormals is particularly dicey when your functional data flow
is generating many denormalized values.