Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu
On Mon, Jul 13, 2015 at 2:20 AM, Jeffrey Sarnoff
jeffrey.sarn...@gmail.com wrote:
 and this: Cleve Moler tries to see it your way
 Moler on floating point denormals


 On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote:

 Denormals were made part of the IEEE Floating Point standard after some
 very careful numerical analysis showed that accomodating them would
 substantively improve the quality of floating point results and this would
 lift the quality of all floating point work. Surprising it may be,
 nonetheless you (and if not you today, you tomorrow and one of your
 neighbors today) really do care about those unusual, and often rarely
 observed values.

 fyi
 William Kahan on the introduction of denormals to the standard
 and an early, important paper on this
 Effects of Underflow on Solving Linear Systems - J.Demmel 1981


Thank you very much for the references.

Yes I definitely believe every part of the IEEE Floating Point
standard has it's reason to be there and I'm more wondering what are
the cases that they are significant. OTOH, I also believe there's
certain kind of computation that do not care about them, which is why
-ffast-math is there.

From the mathwork blog reference:

 Double precision denormals are so tiny that they are rarely numerically 
 significant, but single precision denormals can be in the range where they 
 affect some otherwise unremarkable computations.

For my computation, I'm currently using double but I've already
checked that switching to single precision still give me enough
precision. Based on this, can I say that I can ignore them if I use
double precision and may need to keep them if I switch to single
precision? Using Float64 takes twice as long as using Float32 while
keeping denormals seems to take 10x time.



As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
I couldn't get it working without #11604[2]. Inline assembly in
llvmcall is working on LLVM 3.6 though[3], in case it's useful for
others.


[1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
[2] https://github.com/JuliaLang/julia/pull/11604
[3] 
https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3


 On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com wrote:
  Further update:
 
  I made a c++ version[1] and see a similar effect (depending on
  optimization levels) so it's not a julia issue (not that I think it
  really was to begin with...).

 After investigating the c++ version more, I find that the difference
 between the fast_math and the non-fast_math version is that the
 compiler emit a function called `set_fast_math` (see below).

 From what I can tell, the function sets bit 6 and bit 15 on the MXCSR
 register (for SSE) and according to this page[1] these are DAZ and FZ
 bits (both related to underflow). It also describe denormals as take
 considerably longer to process. Since the operation I have keeps
 decreasing the value, I guess it makes sense that there's a value
 dependent performance (and it kind of make sense that fft also
 suffers from these values)

 So now the question is:

 1. How important are underflow and denormal values? Note that I'm not
 catching underflow explicitly anyway and I don't really care about
 values that are really small compare to 1.

 2. Is there a way to set up the SSE registers as done by the c
 compilers? @fastmath does not seems to be doing this.

 05b0 set_fast_math:
  5b0:0f ae 5c 24 fc   stmxcsr -0x4(%rsp)
  5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp)
  5bc:00
  5bd:0f ae 54 24 fc   ldmxcsr -0x4(%rsp)
  5c2:c3   retq
  5c3:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
  5ca:00 00 00
  5cd:0f 1f 00 nopl   (%rax)

 [1] http://softpixel.com/~cwright/programming/simd/sse.php

 
  The slow down is presented in the c++ version for all optimization
  levels except Ofast and ffast-math. The julia version is faster than
  the default performance for both gcc and clang but is slower in the
  fast case for higher optmization levels. For O2 and higher, the c++

 The slowness of the julia version seems to be due to multi dimentional
 arrays. Using 1d array yields similar performance with C.

  version shows a ~100x slow down for the slow case.
 
  @fast_math in julia doesn't seem to have an effect for this although
  it does for clang and gcc...
 
  [1]
  https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
 
  On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com wrote:
  Update:
 
  I've just got an even simpler version without any complex numbers and
  only has Float64. The two loops are as small as the following LLVM-IR
  now and there's only simple arithmetics in the loop body.
 
  ```llvm
  L9.preheader: 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu
 As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
 I couldn't get it working without #11604[2]. Inline assembly in
 llvmcall is working on LLVM 3.6 though[3], in case it's useful for
 others.


And for future references I find #789, which is not documented
anywhere AFAICT (will probably file a doc issue...)
It also supports runtime detection of cpu feature so it should be much
more portable.

[1] https://github.com/JuliaLang/julia/pull/789


 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
 [2] https://github.com/JuliaLang/julia/pull/11604
 [3] 
 https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3



Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
Cleve Moler's discussion is not quite as contextually invariant as are 
William Kahan's and James Demmel's.
In fact the numerical analysis community has made an overwhelmingly 
strong case that, roughly speaking,
one is substantively better situated where denormalized floating point 
values will be used whenever they may
arise than being free of those extra cycles at the mercy of an absent 
smoothness shoving those values to zero.
And this holds widely for floating point centered applications or 
libraries. 

If the world were remade with each sunrise by fixed bitwidth floating point 
computations, supporting denormals
is to have made house-calls with few numerical vaccines to everyone who 
will be relying on those computations
to inform expectations about non-trivial work with fixed bitwdith floating 
point types.  It does not wipe out all forms
of numerical untowardness, and some will find the vaccinces more 
prophylatic than others; still, the analogy holds.

We vaccinate many babies against measles even though there are some who 
would never have become exposed
to that disease .. and for those who forgot why, not long ago the news was 
about a Disney vaction disease nexus
and how far it spread -- then California changed its law to make it more 
difficult to opt-out of childhood vaccination.
Having denormals there when the values they cover arise brings benifit that 
parallels the good in that law change.
The larger social environment  gets better by growing stronger and that can 
happen because somethat that had
been bringing weakness (disease or bad consequences from subtile numbery 
misadventures) no longer operates.

There is another way denormals have been shown to be matter -- the way 
above ought to help you feel at ease
with deciding not to move your work from Float64 to Float32 for the purpose 
of avoiding values that hover around
smaller magnitudes realizable with Float64s.  That sounds like a headache, 
and you would not have changed
the theory in a way that makes things work  (or at all).  Recasting the 
approch to solving ot transforming at hand
to work with integer values would move the work away from any cost and 
benefit that accompany denormals.
Other that that, thank your favorite floating point microarchitect for 
giving you greater throughput with denormals
than everyone had a few design cycles ago.

I would like their presence without measureable cost .. just not enough to 
dislike their availability.

On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote:

  As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, 
  I couldn't get it working without #11604[2]. Inline assembly in 
  llvmcall is working on LLVM 3.6 though[3], in case it's useful for 
  others. 
  

 And for future references I find #789, which is not documented 
 anywhere AFAICT (will probably file a doc issue...) 
 It also supports runtime detection of cpu feature so it should be much 
 more portable. 

 [1] https://github.com/JuliaLang/julia/pull/789 

  
  [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 
  [2] https://github.com/JuliaLang/julia/pull/11604 
  [3] 
 https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
  
  



Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Yichao Yu
On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff
jeffrey.sarn...@gmail.com wrote:

Thanks for sharing your view about denormal values. I hope what I said
doesn't seem that I want to get rid of them completely (and if it did
sound like that, I didn't meant it...). I didn't read the more detail
analysis of their impact but I would believe you that they are
important in general.

For my specific application, I'm doing time propagation on a
wavefunction (that can decay). For my purpose, there are many other
sources of uncertainty and I'm mainly interested in how the majority
of the wavefunction behave. Therefore I don't really care about the
actually value of something smaller than 10^-10 but I do want it to
run fast. Since this is a linear problem, I can also scale the values
by a constant factor to make underflow less of a problem.

 I have not looked at the specifics of what is going on ...
 Dismissing denormals is particularly dicey when your functional data flow is
 generating many denormalized values.

 Do you what it is causing many values of very small magnitude to occur as
 you run this?

 Is the data holding them explicitly?  If so, and you have access to
 preprocess the data, and you are sure that software
 cannot accumulate or reciprocate or exp etc them, clamp those values to zero
 and then use the data.

 Does the code operate as a denormalized value generator? If so, a small
 alteration to the order of operations may help.



 On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote:

 Cleve Moler's discussion is not quite as contextually invariant as are
 William Kahan's and James Demmel's.
 In fact the numerical analysis community has made an overwhelmingly
 strong case that, roughly speaking,
 one is substantively better situated where denormalized floating point
 values will be used whenever they may
 arise than being free of those extra cycles at the mercy of an absent
 smoothness shoving those values to zero.
 And this holds widely for floating point centered applications or
 libraries.

 If the world were remade with each sunrise by fixed bitwidth floating
 point computations, supporting denormals
 is to have made house-calls with few numerical vaccines to everyone who
 will be relying on those computations
 to inform expectations about non-trivial work with fixed bitwdith floating
 point types.  It does not wipe out all forms
 of numerical untowardness, and some will find the vaccinces more
 prophylatic than others; still, the analogy holds.

 We vaccinate many babies against measles even though there are some who
 would never have become exposed
 to that disease .. and for those who forgot why, not long ago the news was
 about a Disney vaction disease nexus
 and how far it spread -- then California changed its law to make it more
 difficult to opt-out of childhood vaccination.
 Having denormals there when the values they cover arise brings benifit
 that parallels the good in that law change.
 The larger social environment  gets better by growing stronger and that
 can happen because somethat that had
 been bringing weakness (disease or bad consequences from subtile numbery
 misadventures) no longer operates.

 There is another way denormals have been shown to be matter -- the way
 above ought to help you feel at ease
 with deciding not to move your work from Float64 to Float32 for the
 purpose of avoiding values that hover around
 smaller magnitudes realizable with Float64s.  That sounds like a headache,
 and you would not have changed
 the theory in a way that makes things work  (or at all).  Recasting the
 approch to solving ot transforming at hand
 to work with integer values would move the work away from any cost and
 benefit that accompany denormals.
 Other that that, thank your favorite floating point microarchitect for
 giving you greater throughput with denormals
 than everyone had a few design cycles ago.

 I would like their presence without measureable cost .. just not enough to
 dislike their availability.

 On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote:

  As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
  I couldn't get it working without #11604[2]. Inline assembly in
  llvmcall is working on LLVM 3.6 though[3], in case it's useful for
  others.
 

 And for future references I find #789, which is not documented
 anywhere AFAICT (will probably file a doc issue...)
 It also supports runtime detection of cpu feature so it should be much
 more portable.

 [1] https://github.com/JuliaLang/julia/pull/789

 
  [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
  [2] https://github.com/JuliaLang/julia/pull/11604
  [3]
  https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
 


Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
it is a fairer test working from a second copy of the data that has been 
prescaled(x::Float64) = x * 2.0^64

On Monday, July 13, 2015 at 12:51:33 PM UTC-4, Jeffrey Sarnoff wrote:

 Staying with Float64, see if the runtime comes way down when you prescale 
 the data using prescale(x) = x *  2.0^64


 Guessing your values to be less than 10^15,  and assuming the worst case 
 smallest magnitude  
   the base10 exponent of your largest data value is below  70 scale by 
 constant is a good strategy when the largest of the data values is not large

 On Monday, July 13, 2015 at 12:04:32 PM UTC-4, Yichao Yu wrote:

 On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff 
 jeffrey...@gmail.com wrote: 

 Thanks for sharing your view about denormal values. I hope what I said 
 doesn't seem that I want to get rid of them completely (and if it did 
 sound like that, I didn't meant it...). I didn't read the more detail 
 analysis of their impact but I would believe you that they are 
 important in general. 

 For my specific application, I'm doing time propagation on a 
 wavefunction (that can decay). For my purpose, there are many other 
 sources of uncertainty and I'm mainly interested in how the majority 
 of the wavefunction behave. Therefore I don't really care about the 
 actually value of something smaller than 10^-10 but I do want it to 
 run fast. Since this is a linear problem, I can also scale the values 
 by a constant factor to make underflow less of a problem. 

  I have not looked at the specifics of what is going on ... 
  Dismissing denormals is particularly dicey when your functional data 
 flow is 
  generating many denormalized values. 
  
  Do you what it is causing many values of very small magnitude to occur 
 as 
  you run this? 
  
  Is the data holding them explicitly?  If so, and you have access to 
  preprocess the data, and you are sure that software 
  cannot accumulate or reciprocate or exp etc them, clamp those values to 
 zero 
  and then use the data. 
  
  Does the code operate as a denormalized value generator? If so, a small 
  alteration to the order of operations may help. 
  
  
  
  On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: 
  
  Cleve Moler's discussion is not quite as contextually invariant as 
 are 
  William Kahan's and James Demmel's. 
  In fact the numerical analysis community has made an overwhelmingly 
  strong case that, roughly speaking, 
  one is substantively better situated where denormalized floating point 
  values will be used whenever they may 
  arise than being free of those extra cycles at the mercy of an absent 
  smoothness shoving those values to zero. 
  And this holds widely for floating point centered applications or 
  libraries. 
  
  If the world were remade with each sunrise by fixed bitwidth floating 
  point computations, supporting denormals 
  is to have made house-calls with few numerical vaccines to everyone 
 who 
  will be relying on those computations 
  to inform expectations about non-trivial work with fixed bitwdith 
 floating 
  point types.  It does not wipe out all forms 
  of numerical untowardness, and some will find the vaccinces more 
  prophylatic than others; still, the analogy holds. 
  
  We vaccinate many babies against measles even though there are some 
 who 
  would never have become exposed 
  to that disease .. and for those who forgot why, not long ago the news 
 was 
  about a Disney vaction disease nexus 
  and how far it spread -- then California changed its law to make it 
 more 
  difficult to opt-out of childhood vaccination. 
  Having denormals there when the values they cover arise brings benifit 
  that parallels the good in that law change. 
  The larger social environment  gets better by growing stronger and 
 that 
  can happen because somethat that had 
  been bringing weakness (disease or bad consequences from subtile 
 numbery 
  misadventures) no longer operates. 
  
  There is another way denormals have been shown to be matter -- the way 
  above ought to help you feel at ease 
  with deciding not to move your work from Float64 to Float32 for the 
  purpose of avoiding values that hover around 
  smaller magnitudes realizable with Float64s.  That sounds like a 
 headache, 
  and you would not have changed 
  the theory in a way that makes things work  (or at all).  Recasting 
 the 
  approch to solving ot transforming at hand 
  to work with integer values would move the work away from any cost and 
  benefit that accompany denormals. 
  Other that that, thank your favorite floating point microarchitect for 
  giving you greater throughput with denormals 
  than everyone had a few design cycles ago. 
  
  I would like their presence without measureable cost .. just not 
 enough to 
  dislike their availability. 
  
  On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: 
  
   As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. 
 However, 
   I couldn't get it 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
I have not looked at the specifics of what is going on ...
Dismissing denormals is particularly dicey when your functional data flow 
is generating many denormalized values.

Do you what it is causing many values of very small magnitude to occur as 
you run this?  

Is the data holding them explicitly?  If so, and you have access to 
preprocess the data, and you are sure that software
cannot accumulate or reciprocate or exp etc them, clamp those values to 
zero and then use the data.

Does the code operate as a denormalized value generator? If so, a small 
alteration to the order of operations may help.



On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote:

 Cleve Moler's discussion is not quite as contextually invariant as are 
 William Kahan's and James Demmel's.
 In fact the numerical analysis community has made an overwhelmingly 
 strong case that, roughly speaking,
 one is substantively better situated where denormalized floating point 
 values will be used whenever they may
 arise than being free of those extra cycles at the mercy of an absent 
 smoothness shoving those values to zero.
 And this holds widely for floating point centered applications or 
 libraries. 

 If the world were remade with each sunrise by fixed bitwidth floating 
 point computations, supporting denormals
 is to have made house-calls with few numerical vaccines to everyone who 
 will be relying on those computations
 to inform expectations about non-trivial work with fixed bitwdith floating 
 point types.  It does not wipe out all forms
 of numerical untowardness, and some will find the vaccinces more 
 prophylatic than others; still, the analogy holds.

 We vaccinate many babies against measles even though there are some who 
 would never have become exposed
 to that disease .. and for those who forgot why, not long ago the news was 
 about a Disney vaction disease nexus
 and how far it spread -- then California changed its law to make it more 
 difficult to opt-out of childhood vaccination.
 Having denormals there when the values they cover arise brings benifit 
 that parallels the good in that law change.
 The larger social environment  gets better by growing stronger and that 
 can happen because somethat that had
 been bringing weakness (disease or bad consequences from subtile numbery 
 misadventures) no longer operates.

 There is another way denormals have been shown to be matter -- the way 
 above ought to help you feel at ease
 with deciding not to move your work from Float64 to Float32 for the 
 purpose of avoiding values that hover around
 smaller magnitudes realizable with Float64s.  That sounds like a headache, 
 and you would not have changed
 the theory in a way that makes things work  (or at all).  Recasting the 
 approch to solving ot transforming at hand
 to work with integer values would move the work away from any cost and 
 benefit that accompany denormals.
 Other that that, thank your favorite floating point microarchitect for 
 giving you greater throughput with denormals
 than everyone had a few design cycles ago.

 I would like their presence without measureable cost .. just not enough to 
 dislike their availability.

 On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote:

  As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, 
  I couldn't get it working without #11604[2]. Inline assembly in 
  llvmcall is working on LLVM 3.6 though[3], in case it's useful for 
  others. 
  

 And for future references I find #789, which is not documented 
 anywhere AFAICT (will probably file a doc issue...) 
 It also supports runtime detection of cpu feature so it should be much 
 more portable. 

 [1] https://github.com/JuliaLang/julia/pull/789 

  
  [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 
  [2] https://github.com/JuliaLang/julia/pull/11604 
  [3] 
 https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
  
  



Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
Staying with Float64, see if the runtime comes way down when you prescale 
the data using prescale(x) = x *  2.0^64


Guessing your values to be less than 10^15,  and assuming the worst case 
smallest magnitude  
  the base10 exponent of your largest data value is below  70 scale by 
constant is a good strategy when the largest of the data values is not large

On Monday, July 13, 2015 at 12:04:32 PM UTC-4, Yichao Yu wrote:

 On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff 
 jeffrey...@gmail.com javascript: wrote: 

 Thanks for sharing your view about denormal values. I hope what I said 
 doesn't seem that I want to get rid of them completely (and if it did 
 sound like that, I didn't meant it...). I didn't read the more detail 
 analysis of their impact but I would believe you that they are 
 important in general. 

 For my specific application, I'm doing time propagation on a 
 wavefunction (that can decay). For my purpose, there are many other 
 sources of uncertainty and I'm mainly interested in how the majority 
 of the wavefunction behave. Therefore I don't really care about the 
 actually value of something smaller than 10^-10 but I do want it to 
 run fast. Since this is a linear problem, I can also scale the values 
 by a constant factor to make underflow less of a problem. 

  I have not looked at the specifics of what is going on ... 
  Dismissing denormals is particularly dicey when your functional data 
 flow is 
  generating many denormalized values. 
  
  Do you what it is causing many values of very small magnitude to occur 
 as 
  you run this? 
  
  Is the data holding them explicitly?  If so, and you have access to 
  preprocess the data, and you are sure that software 
  cannot accumulate or reciprocate or exp etc them, clamp those values to 
 zero 
  and then use the data. 
  
  Does the code operate as a denormalized value generator? If so, a small 
  alteration to the order of operations may help. 
  
  
  
  On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: 
  
  Cleve Moler's discussion is not quite as contextually invariant as 
 are 
  William Kahan's and James Demmel's. 
  In fact the numerical analysis community has made an overwhelmingly 
  strong case that, roughly speaking, 
  one is substantively better situated where denormalized floating point 
  values will be used whenever they may 
  arise than being free of those extra cycles at the mercy of an absent 
  smoothness shoving those values to zero. 
  And this holds widely for floating point centered applications or 
  libraries. 
  
  If the world were remade with each sunrise by fixed bitwidth floating 
  point computations, supporting denormals 
  is to have made house-calls with few numerical vaccines to everyone who 
  will be relying on those computations 
  to inform expectations about non-trivial work with fixed bitwdith 
 floating 
  point types.  It does not wipe out all forms 
  of numerical untowardness, and some will find the vaccinces more 
  prophylatic than others; still, the analogy holds. 
  
  We vaccinate many babies against measles even though there are some who 
  would never have become exposed 
  to that disease .. and for those who forgot why, not long ago the news 
 was 
  about a Disney vaction disease nexus 
  and how far it spread -- then California changed its law to make it 
 more 
  difficult to opt-out of childhood vaccination. 
  Having denormals there when the values they cover arise brings benifit 
  that parallels the good in that law change. 
  The larger social environment  gets better by growing stronger and that 
  can happen because somethat that had 
  been bringing weakness (disease or bad consequences from subtile 
 numbery 
  misadventures) no longer operates. 
  
  There is another way denormals have been shown to be matter -- the way 
  above ought to help you feel at ease 
  with deciding not to move your work from Float64 to Float32 for the 
  purpose of avoiding values that hover around 
  smaller magnitudes realizable with Float64s.  That sounds like a 
 headache, 
  and you would not have changed 
  the theory in a way that makes things work  (or at all).  Recasting the 
  approch to solving ot transforming at hand 
  to work with integer values would move the work away from any cost and 
  benefit that accompany denormals. 
  Other that that, thank your favorite floating point microarchitect for 
  giving you greater throughput with denormals 
  than everyone had a few design cycles ago. 
  
  I would like their presence without measureable cost .. just not enough 
 to 
  dislike their availability. 
  
  On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: 
  
   As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. 
 However, 
   I couldn't get it working without #11604[2]. Inline assembly in 
   llvmcall is working on LLVM 3.6 though[3], in case it's useful for 
   others. 
   
  
  And for future references I find #789, which is 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
Denormals were made part of the IEEE Floating Point standard after some 
very careful numerical analysis showed that accomodating them would 
substantively improve the quality of floating point results and this would 
lift the quality of all floating point work. Surprising it may be, 
nonetheless you (and if not you today, you tomorrow and one of your 
neighbors today) really do care about those unusual, and often rarely 
observed values.

fyi
William Kahan on the introduction of denormals to the standard 
https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html
and an early, important paper on this
Effects of Underflow on Solving Linear Systems - J.Demmel 1981 
http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf
  

On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com 
 javascript: wrote: 
  Further update: 
  
  I made a c++ version[1] and see a similar effect (depending on 
  optimization levels) so it's not a julia issue (not that I think it 
  really was to begin with...). 

 After investigating the c++ version more, I find that the difference 
 between the fast_math and the non-fast_math version is that the 
 compiler emit a function called `set_fast_math` (see below). 

 From what I can tell, the function sets bit 6 and bit 15 on the MXCSR 
 register (for SSE) and according to this page[1] these are DAZ and FZ 
 bits (both related to underflow). It also describe denormals as take 
 considerably longer to process. Since the operation I have keeps 
 decreasing the value, I guess it makes sense that there's a value 
 dependent performance (and it kind of make sense that fft also 
 suffers from these values) 

 So now the question is: 

 1. How important are underflow and denormal values? Note that I'm not 
 catching underflow explicitly anyway and I don't really care about 
 values that are really small compare to 1. 

 2. Is there a way to set up the SSE registers as done by the c 
 compilers? @fastmath does not seems to be doing this. 

 05b0 set_fast_math: 
  5b0:0f ae 5c 24 fc   stmxcsr -0x4(%rsp) 
  5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 
  5bc:00 
  5bd:0f ae 54 24 fc   ldmxcsr -0x4(%rsp) 
  5c2:c3   retq 
  5c3:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1) 
  5ca:00 00 00 
  5cd:0f 1f 00 nopl   (%rax) 

 [1] http://softpixel.com/~cwright/programming/simd/sse.php 

  
  The slow down is presented in the c++ version for all optimization 
  levels except Ofast and ffast-math. The julia version is faster than 
  the default performance for both gcc and clang but is slower in the 
  fast case for higher optmization levels. For O2 and higher, the c++ 

 The slowness of the julia version seems to be due to multi dimentional 
 arrays. Using 1d array yields similar performance with C. 

  version shows a ~100x slow down for the slow case. 
  
  @fast_math in julia doesn't seem to have an effect for this although 
  it does for clang and gcc... 
  
  [1] 
 https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
  
  
  On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com 
 javascript: wrote: 
  Update: 
  
  I've just got an even simpler version without any complex numbers and 
  only has Float64. The two loops are as small as the following LLVM-IR 
  now and there's only simple arithmetics in the loop body. 
  
  ```llvm 
  L9.preheader: ; preds = %L12, 
  %L9.preheader.preheader 
%#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] 
br label %L9 
  
  L9:   ; preds = %L9, 
 %L9.preheader 
%#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] 
%44 = add i64 %#s4.0, 1 
%45 = add i64 %#s4.0, -1 
%46 = mul i64 %45, %10 
%47 = getelementptr double* %7, i64 %46 
%48 = load double* %47, align 8 
%49 = add i64 %46, 1 
%50 = getelementptr double* %7, i64 %49 
%51 = load double* %50, align 8 
%52 = fmul double %51, %3 
%53 = fmul double %38, %48 
%54 = fmul double %33, %52 
%55 = fadd double %53, %54 
store double %55, double* %50, align 8 
%56 = fmul double %38, %52 
%57 = fmul double %33, %48 
%58 = fsub double %56, %57 
store double %58, double* %47, align 8 
%59 = icmp eq i64 %#s4.0, %12 
br i1 %59, label %L12, label %L9 
  
  L12:  ; preds = %L9 
%60 = add i64 %#s3.0, 1 
%61 = icmp eq i64 %#s3.0, %42 
br i1 %61, label %L14.loopexit, label %L9.preheader 
  ``` 
  
  On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc...@gmail.com 
 javascript: wrote: 
  On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin@gmail.com 
 javascript: wrote: 
  I can't really help you debug the IR code, but I can at least say I'm 
 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-13 Thread Jeffrey Sarnoff
and this: Cleve Moler tries to see it your way 
Moler on floating point denormals 
http://blogs.mathworks.com/cleve/2014/07/21/floating-point-denormals-insignificant-but-controversial-2/

On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote:

 Denormals were made part of the IEEE Floating Point standard after some 
 very careful numerical analysis showed that accomodating them would 
 substantively improve the quality of floating point results and this would 
 lift the quality of all floating point work. Surprising it may be, 
 nonetheless you (and if not you today, you tomorrow and one of your 
 neighbors today) really do care about those unusual, and often rarely 
 observed values.

 fyi
 William Kahan on the introduction of denormals to the standard 
 https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html
 and an early, important paper on this
 Effects of Underflow on Solving Linear Systems - J.Demmel 1981 
 http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf
   

 On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com wrote: 
  Further update: 
  
  I made a c++ version[1] and see a similar effect (depending on 
  optimization levels) so it's not a julia issue (not that I think it 
  really was to begin with...). 

 After investigating the c++ version more, I find that the difference 
 between the fast_math and the non-fast_math version is that the 
 compiler emit a function called `set_fast_math` (see below). 

 From what I can tell, the function sets bit 6 and bit 15 on the MXCSR 
 register (for SSE) and according to this page[1] these are DAZ and FZ 
 bits (both related to underflow). It also describe denormals as take 
 considerably longer to process. Since the operation I have keeps 
 decreasing the value, I guess it makes sense that there's a value 
 dependent performance (and it kind of make sense that fft also 
 suffers from these values) 

 So now the question is: 

 1. How important are underflow and denormal values? Note that I'm not 
 catching underflow explicitly anyway and I don't really care about 
 values that are really small compare to 1. 

 2. Is there a way to set up the SSE registers as done by the c 
 compilers? @fastmath does not seems to be doing this. 

 05b0 set_fast_math: 
  5b0:0f ae 5c 24 fc   stmxcsr -0x4(%rsp) 
  5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 
  5bc:00 
  5bd:0f ae 54 24 fc   ldmxcsr -0x4(%rsp) 
  5c2:c3   retq 
  5c3:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1) 
  5ca:00 00 00 
  5cd:0f 1f 00 nopl   (%rax) 

 [1] http://softpixel.com/~cwright/programming/simd/sse.php 

  
  The slow down is presented in the c++ version for all optimization 
  levels except Ofast and ffast-math. The julia version is faster than 
  the default performance for both gcc and clang but is slower in the 
  fast case for higher optmization levels. For O2 and higher, the c++ 

 The slowness of the julia version seems to be due to multi dimentional 
 arrays. Using 1d array yields similar performance with C. 

  version shows a ~100x slow down for the slow case. 
  
  @fast_math in julia doesn't seem to have an effect for this although 
  it does for clang and gcc... 
  
  [1] 
 https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
  
  
  On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com wrote: 
  Update: 
  
  I've just got an even simpler version without any complex numbers and 
  only has Float64. The two loops are as small as the following LLVM-IR 
  now and there's only simple arithmetics in the loop body. 
  
  ```llvm 
  L9.preheader: ; preds = %L12, 
  %L9.preheader.preheader 
%#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] 
br label %L9 
  
  L9:   ; preds = %L9, 
 %L9.preheader 
%#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] 
%44 = add i64 %#s4.0, 1 
%45 = add i64 %#s4.0, -1 
%46 = mul i64 %45, %10 
%47 = getelementptr double* %7, i64 %46 
%48 = load double* %47, align 8 
%49 = add i64 %46, 1 
%50 = getelementptr double* %7, i64 %49 
%51 = load double* %50, align 8 
%52 = fmul double %51, %3 
%53 = fmul double %38, %48 
%54 = fmul double %33, %52 
%55 = fadd double %53, %54 
store double %55, double* %50, align 8 
%56 = fmul double %38, %52 
%57 = fmul double %33, %48 
%58 = fsub double %56, %57 
store double %58, double* %47, align 8 
%59 = icmp eq i64 %#s4.0, %12 
br i1 %59, label %L12, label %L9 
  
  L12:  ; preds = %L9 
%60 = add i64 %#s3.0, 1 
%61 = icmp eq i64 %#s3.0, %42 
br i1 %61, label %L14.loopexit, label %L9.preheader 
  ``` 
  
  

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Yichao Yu
Further update:

I made a c++ version[1] and see a similar effect (depending on
optimization levels) so it's not a julia issue (not that I think it
really was to begin with...).

The slow down is presented in the c++ version for all optimization
levels except Ofast and ffast-math. The julia version is faster than
the default performance for both gcc and clang but is slower in the
fast case for higher optmization levels. For O2 and higher, the c++
version shows a ~100x slow down for the slow case.

@fast_math in julia doesn't seem to have an effect for this although
it does for clang and gcc...

[1] 
https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp

On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc1...@gmail.com wrote:
 Update:

 I've just got an even simpler version without any complex numbers and
 only has Float64. The two loops are as small as the following LLVM-IR
 now and there's only simple arithmetics in the loop body.

 ```llvm
 L9.preheader: ; preds = %L12,
 %L9.preheader.preheader
   %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
   br label %L9

 L9:   ; preds = %L9, %L9.preheader
   %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
   %44 = add i64 %#s4.0, 1
   %45 = add i64 %#s4.0, -1
   %46 = mul i64 %45, %10
   %47 = getelementptr double* %7, i64 %46
   %48 = load double* %47, align 8
   %49 = add i64 %46, 1
   %50 = getelementptr double* %7, i64 %49
   %51 = load double* %50, align 8
   %52 = fmul double %51, %3
   %53 = fmul double %38, %48
   %54 = fmul double %33, %52
   %55 = fadd double %53, %54
   store double %55, double* %50, align 8
   %56 = fmul double %38, %52
   %57 = fmul double %33, %48
   %58 = fsub double %56, %57
   store double %58, double* %47, align 8
   %59 = icmp eq i64 %#s4.0, %12
   br i1 %59, label %L12, label %L9

 L12:  ; preds = %L9
   %60 = add i64 %#s3.0, 1
   %61 = icmp eq i64 %#s3.0, %42
   br i1 %61, label %L14.loopexit, label %L9.preheader
 ```

 On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote:
 On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote:
 I can't really help you debug the IR code, but I can at least say I'm seeing
 a similar thing. It starts to slow down around just after 0.5, and doesn't

 Thanks for the confirmation. At least I'm not crazy (or not the only
 one to be more precise :P )

 get back to where it was at 0.5 until 0.87. Can you compare the IR code when
 two different values are used, to see what's different? When I tried looking
 at the difference between 0.50 and 0.51, the biggest thing that popped out
 to me was that the numbers after !dbg were different.

 This is exactly the strange part. I don't think either llvm or julia
 is doing constant propagation here and different input values should
 be using the same function. Evidents are

 1. Julia only specialize on types not values (for now)
 2. The function is not inlined. Which has to be the case in global
 scope and can be double checked by adding `@noinline`. It's also too
 big to be inline_worthy
 3. No codegen is happening except the first call. This can be seen
 from the output of `@time`. Only the first call has thousands of
 allocations. Following call has less than 5 (on 0.4 at least). If
 julia can compile a function with less than 5 allocation, I'll be very
 happy.
 4. My original version also has much more complicated logic to compute
 that scaling factor (ok. it's just a function call, but with
 parameters gathered from different arguments) I'd be really surprised
 if either llvm or julia can reason anything about it

 The difference in the debug info should just be an artifact of
 emitting it twice.


 Even 0.50001 is a lot slower:

 julia for eΓ in 0.5:0.1:0.50015
println(eΓ)
gc()
@time ψs = propagate(P, ψ0, ψs, eΓ)
end
 0.5
 elapsed time: 0.065609581 seconds (16 bytes allocated)
 0.50001
 elapsed time: 0.875806461 seconds (16 bytes allocated)

 This is actually interesting and I can confirm the same here. `0.5`
 takes 28ms while 0.5001 takes 320ms. I was thinking
 whether the cpu is doing sth special depending on the bit pattern but
 I still don't understand why it would be very bad for a certain range.
 (and is not only a function of this either, also affected by all other
 values)



 julia versioninfo()
 Julia Version 0.3.9
 Commit 31efe69 (2015-05-30 11:24 UTC)
 Platform Info:
   System: Darwin (x86_64-apple-darwin13.4.0)

 Good. so it's not Linux only.

   CPU: Intel(R) Core(TM)2 Duo CPU T8300  @ 2.40GHz
   WORD_SIZE: 64
   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
   LAPACK: libopenblas
   LIBM: libopenlibm
   LLVM: libLLVM-3.3




 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 7:40 PM, 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Yichao Yu
On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote:
 I can't really help you debug the IR code, but I can at least say I'm seeing
 a similar thing. It starts to slow down around just after 0.5, and doesn't

Thanks for the confirmation. At least I'm not crazy (or not the only
one to be more precise :P )

 get back to where it was at 0.5 until 0.87. Can you compare the IR code when
 two different values are used, to see what's different? When I tried looking
 at the difference between 0.50 and 0.51, the biggest thing that popped out
 to me was that the numbers after !dbg were different.

This is exactly the strange part. I don't think either llvm or julia
is doing constant propagation here and different input values should
be using the same function. Evidents are

1. Julia only specialize on types not values (for now)
2. The function is not inlined. Which has to be the case in global
scope and can be double checked by adding `@noinline`. It's also too
big to be inline_worthy
3. No codegen is happening except the first call. This can be seen
from the output of `@time`. Only the first call has thousands of
allocations. Following call has less than 5 (on 0.4 at least). If
julia can compile a function with less than 5 allocation, I'll be very
happy.
4. My original version also has much more complicated logic to compute
that scaling factor (ok. it's just a function call, but with
parameters gathered from different arguments) I'd be really surprised
if either llvm or julia can reason anything about it

The difference in the debug info should just be an artifact of
emitting it twice.


 Even 0.50001 is a lot slower:

 julia for eΓ in 0.5:0.1:0.50015
println(eΓ)
gc()
@time ψs = propagate(P, ψ0, ψs, eΓ)
end
 0.5
 elapsed time: 0.065609581 seconds (16 bytes allocated)
 0.50001
 elapsed time: 0.875806461 seconds (16 bytes allocated)

This is actually interesting and I can confirm the same here. `0.5`
takes 28ms while 0.5001 takes 320ms. I was thinking
whether the cpu is doing sth special depending on the bit pattern but
I still don't understand why it would be very bad for a certain range.
(and is not only a function of this either, also affected by all other
values)



 julia versioninfo()
 Julia Version 0.3.9
 Commit 31efe69 (2015-05-30 11:24 UTC)
 Platform Info:
   System: Darwin (x86_64-apple-darwin13.4.0)

Good. so it's not Linux only.

   CPU: Intel(R) Core(TM)2 Duo CPU T8300  @ 2.40GHz
   WORD_SIZE: 64
   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
   LAPACK: libopenblas
   LIBM: libopenlibm
   LLVM: libLLVM-3.3




 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com wrote:
  P.S. Given how strange this problem is for me, I would appreciate if
  anyone can confirm either this is a real issue or I'm somehow being
  crazy or stupid.
 

 One additional strange property of this issue is that I used to have
 much costy operations in the (outer) loop (the one that iterate over
 nsteps with i) like Fourier transformations. However, when the scaling
 factor is taking the bad value, it slows everything down (i.e. the
 Fourier transformation is also slower by ~10x).

 
 
  On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com wrote:
  Hi,
 
  I've just seen a very strange (for me) performance difference for
  exactly the same code on slightly different input with no explicit
  branches.
 
  The code is available here[1]. The most relavant part is the following
  function. (All other part of the code are for initialization and bench
  mark). This is a simplified version of my similation that compute the
  next array column in the array based on the previous one.
 
  The strange part is that the performance of this function can differ
  by 10x depend on the value of the scaling factor (`eΓ`, the only use
  of which is marked in the code below) even though I don't see any
  branches that depends on that value in the relavant code. (unless the
  cpu is 10x less efficient for certain input values)
 
  function propagate(P, ψ0, ψs, eΓ)
  @inbounds for i in 1:P.nele
  ψs[1, i, 1] = ψ0[1, i]
  ψs[2, i, 1] = ψ0[2, i]
  end
  T12 = im * sin(P.Ω)
  T11 = cos(P.Ω)
  @inbounds for i in 2:(P.nstep + 1)
  for j in 1:P.nele
  ψ_e = ψs[1, j, i - 1]
  ψ_g = ψs[2, j, i - 1] * eΓ #  Scaling factor
  ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
  ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
  end
  end
  ψs
  end
 
  The output of the full script is attached and it can be clearly seen
  that for scaling factor 0.6-0.8, the performance is 5-10 times slower
  than others.
 
  The assembly[2] and llvm[3] code of this function is also in the same
  repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
  and LLVM 3.6 on two different x86_64 machine 

Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Yichao Yu
Update:

I've just got an even simpler version without any complex numbers and
only has Float64. The two loops are as small as the following LLVM-IR
now and there's only simple arithmetics in the loop body.

```llvm
L9.preheader: ; preds = %L12,
%L9.preheader.preheader
  %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
  br label %L9

L9:   ; preds = %L9, %L9.preheader
  %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
  %44 = add i64 %#s4.0, 1
  %45 = add i64 %#s4.0, -1
  %46 = mul i64 %45, %10
  %47 = getelementptr double* %7, i64 %46
  %48 = load double* %47, align 8
  %49 = add i64 %46, 1
  %50 = getelementptr double* %7, i64 %49
  %51 = load double* %50, align 8
  %52 = fmul double %51, %3
  %53 = fmul double %38, %48
  %54 = fmul double %33, %52
  %55 = fadd double %53, %54
  store double %55, double* %50, align 8
  %56 = fmul double %38, %52
  %57 = fmul double %33, %48
  %58 = fsub double %56, %57
  store double %58, double* %47, align 8
  %59 = icmp eq i64 %#s4.0, %12
  br i1 %59, label %L12, label %L9

L12:  ; preds = %L9
  %60 = add i64 %#s3.0, 1
  %61 = icmp eq i64 %#s3.0, %42
  br i1 %61, label %L14.loopexit, label %L9.preheader
```

On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote:
 On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote:
 I can't really help you debug the IR code, but I can at least say I'm seeing
 a similar thing. It starts to slow down around just after 0.5, and doesn't

 Thanks for the confirmation. At least I'm not crazy (or not the only
 one to be more precise :P )

 get back to where it was at 0.5 until 0.87. Can you compare the IR code when
 two different values are used, to see what's different? When I tried looking
 at the difference between 0.50 and 0.51, the biggest thing that popped out
 to me was that the numbers after !dbg were different.

 This is exactly the strange part. I don't think either llvm or julia
 is doing constant propagation here and different input values should
 be using the same function. Evidents are

 1. Julia only specialize on types not values (for now)
 2. The function is not inlined. Which has to be the case in global
 scope and can be double checked by adding `@noinline`. It's also too
 big to be inline_worthy
 3. No codegen is happening except the first call. This can be seen
 from the output of `@time`. Only the first call has thousands of
 allocations. Following call has less than 5 (on 0.4 at least). If
 julia can compile a function with less than 5 allocation, I'll be very
 happy.
 4. My original version also has much more complicated logic to compute
 that scaling factor (ok. it's just a function call, but with
 parameters gathered from different arguments) I'd be really surprised
 if either llvm or julia can reason anything about it

 The difference in the debug info should just be an artifact of
 emitting it twice.


 Even 0.50001 is a lot slower:

 julia for eΓ in 0.5:0.1:0.50015
println(eΓ)
gc()
@time ψs = propagate(P, ψ0, ψs, eΓ)
end
 0.5
 elapsed time: 0.065609581 seconds (16 bytes allocated)
 0.50001
 elapsed time: 0.875806461 seconds (16 bytes allocated)

 This is actually interesting and I can confirm the same here. `0.5`
 takes 28ms while 0.5001 takes 320ms. I was thinking
 whether the cpu is doing sth special depending on the bit pattern but
 I still don't understand why it would be very bad for a certain range.
 (and is not only a function of this either, also affected by all other
 values)



 julia versioninfo()
 Julia Version 0.3.9
 Commit 31efe69 (2015-05-30 11:24 UTC)
 Platform Info:
   System: Darwin (x86_64-apple-darwin13.4.0)

 Good. so it's not Linux only.

   CPU: Intel(R) Core(TM)2 Duo CPU T8300  @ 2.40GHz
   WORD_SIZE: 64
   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
   LAPACK: libopenblas
   LIBM: libopenlibm
   LLVM: libLLVM-3.3




 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com wrote:
  P.S. Given how strange this problem is for me, I would appreciate if
  anyone can confirm either this is a real issue or I'm somehow being
  crazy or stupid.
 

 One additional strange property of this issue is that I used to have
 much costy operations in the (outer) loop (the one that iterate over
 nsteps with i) like Fourier transformations. However, when the scaling
 factor is taking the bad value, it slows everything down (i.e. the
 Fourier transformation is also slower by ~10x).

 
 
  On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com wrote:
  Hi,
 
  I've just seen a very strange (for me) performance difference for
  exactly the same code on slightly different input with no explicit
  branches.
 
  The code is available here[1]. The most relavant part is the 

[julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Kevin Owens
I can't really help you debug the IR code, but I can at least say I'm 
seeing a similar thing. It starts to slow down around just after 0.5, and 
doesn't get back to where it was at 0.5 until 0.87. Can you compare the IR 
code when two different values are used, to see what's different? When I 
tried looking at the difference between 0.50 and 0.51, the biggest thing 
that popped out to me was that the numbers after !dbg were different.

Even 0.50001 is a lot slower:

julia for eΓ in 0.5:0.1:0.50015
   println(eΓ)
   gc()
   @time ψs = propagate(P, ψ0, ψs, eΓ)
   end
0.5
elapsed time: 0.065609581 seconds (16 bytes allocated)
0.50001
elapsed time: 0.875806461 seconds (16 bytes allocated)


julia versioninfo()
Julia Version 0.3.9
Commit 31efe69 (2015-05-30 11:24 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU T8300  @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3




On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:

 On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com javascript: 
 wrote: 
  P.S. Given how strange this problem is for me, I would appreciate if 
  anyone can confirm either this is a real issue or I'm somehow being 
  crazy or stupid. 
  

 One additional strange property of this issue is that I used to have 
 much costy operations in the (outer) loop (the one that iterate over 
 nsteps with i) like Fourier transformations. However, when the scaling 
 factor is taking the bad value, it slows everything down (i.e. the 
 Fourier transformation is also slower by ~10x). 

  
  
  On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com 
 javascript: wrote: 
  Hi, 
  
  I've just seen a very strange (for me) performance difference for 
  exactly the same code on slightly different input with no explicit 
  branches. 
  
  The code is available here[1]. The most relavant part is the following 
  function. (All other part of the code are for initialization and bench 
  mark). This is a simplified version of my similation that compute the 
  next array column in the array based on the previous one. 
  
  The strange part is that the performance of this function can differ 
  by 10x depend on the value of the scaling factor (`eΓ`, the only use 
  of which is marked in the code below) even though I don't see any 
  branches that depends on that value in the relavant code. (unless the 
  cpu is 10x less efficient for certain input values) 
  
  function propagate(P, ψ0, ψs, eΓ) 
  @inbounds for i in 1:P.nele 
  ψs[1, i, 1] = ψ0[1, i] 
  ψs[2, i, 1] = ψ0[2, i] 
  end 
  T12 = im * sin(P.Ω) 
  T11 = cos(P.Ω) 
  @inbounds for i in 2:(P.nstep + 1) 
  for j in 1:P.nele 
  ψ_e = ψs[1, j, i - 1] 
  ψ_g = ψs[2, j, i - 1] * eΓ #  Scaling factor 
  ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g 
  ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e 
  end 
  end 
  ψs 
  end 
  
  The output of the full script is attached and it can be clearly seen 
  that for scaling factor 0.6-0.8, the performance is 5-10 times slower 
  than others. 
  
  The assembly[2] and llvm[3] code of this function is also in the same 
  repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 
  and LLVM 3.6 on two different x86_64 machine (my laptop and a linode 
  VPS) (the only platform I've tried that doesn't show similar behavior 
  is running julia 0.4 on qemu-arm... although the performance 
  between different values also differ by ~30% which is bigger than 
  noise) 
  
  This also seems to depend on the initial value. 
  
  Has anyone seen similar problems before? 
  
  Outputs: 
  
  325.821 milliseconds (25383 allocations: 1159 KB) 
  307.826 milliseconds (4 allocations: 144 bytes) 
  0.0 
   19.227 milliseconds (2 allocations: 48 bytes) 
  0.1 
   17.291 milliseconds (2 allocations: 48 bytes) 
  0.2 
   17.404 milliseconds (2 allocations: 48 bytes) 
  0.3 
   19.231 milliseconds (2 allocations: 48 bytes) 
  0.4 
   20.278 milliseconds (2 allocations: 48 bytes) 
  0.5 
   23.692 milliseconds (2 allocations: 48 bytes) 
  0.6 
  328.107 milliseconds (2 allocations: 48 bytes) 
  0.7 
  312.425 milliseconds (2 allocations: 48 bytes) 
  0.8 
  201.494 milliseconds (2 allocations: 48 bytes) 
  0.9 
   16.314 milliseconds (2 allocations: 48 bytes) 
  1.0 
   16.264 milliseconds (2 allocations: 48 bytes) 
  
  
  [1] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
  
  [2] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
  
  [2] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
  



[julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Yichao Yu
On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc1...@gmail.com wrote:
 P.S. Given how strange this problem is for me, I would appreciate if
 anyone can confirm either this is a real issue or I'm somehow being
 crazy or stupid.


One additional strange property of this issue is that I used to have
much costy operations in the (outer) loop (the one that iterate over
nsteps with i) like Fourier transformations. However, when the scaling
factor is taking the bad value, it slows everything down (i.e. the
Fourier transformation is also slower by ~10x).



 On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc1...@gmail.com wrote:
 Hi,

 I've just seen a very strange (for me) performance difference for
 exactly the same code on slightly different input with no explicit
 branches.

 The code is available here[1]. The most relavant part is the following
 function. (All other part of the code are for initialization and bench
 mark). This is a simplified version of my similation that compute the
 next array column in the array based on the previous one.

 The strange part is that the performance of this function can differ
 by 10x depend on the value of the scaling factor (`eΓ`, the only use
 of which is marked in the code below) even though I don't see any
 branches that depends on that value in the relavant code. (unless the
 cpu is 10x less efficient for certain input values)

 function propagate(P, ψ0, ψs, eΓ)
 @inbounds for i in 1:P.nele
 ψs[1, i, 1] = ψ0[1, i]
 ψs[2, i, 1] = ψ0[2, i]
 end
 T12 = im * sin(P.Ω)
 T11 = cos(P.Ω)
 @inbounds for i in 2:(P.nstep + 1)
 for j in 1:P.nele
 ψ_e = ψs[1, j, i - 1]
 ψ_g = ψs[2, j, i - 1] * eΓ #  Scaling factor
 ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
 ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
 end
 end
 ψs
 end

 The output of the full script is attached and it can be clearly seen
 that for scaling factor 0.6-0.8, the performance is 5-10 times slower
 than others.

 The assembly[2] and llvm[3] code of this function is also in the same
 repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
 and LLVM 3.6 on two different x86_64 machine (my laptop and a linode
 VPS) (the only platform I've tried that doesn't show similar behavior
 is running julia 0.4 on qemu-arm... although the performance
 between different values also differ by ~30% which is bigger than
 noise)

 This also seems to depend on the initial value.

 Has anyone seen similar problems before?

 Outputs:

 325.821 milliseconds (25383 allocations: 1159 KB)
 307.826 milliseconds (4 allocations: 144 bytes)
 0.0
  19.227 milliseconds (2 allocations: 48 bytes)
 0.1
  17.291 milliseconds (2 allocations: 48 bytes)
 0.2
  17.404 milliseconds (2 allocations: 48 bytes)
 0.3
  19.231 milliseconds (2 allocations: 48 bytes)
 0.4
  20.278 milliseconds (2 allocations: 48 bytes)
 0.5
  23.692 milliseconds (2 allocations: 48 bytes)
 0.6
 328.107 milliseconds (2 allocations: 48 bytes)
 0.7
 312.425 milliseconds (2 allocations: 48 bytes)
 0.8
 201.494 milliseconds (2 allocations: 48 bytes)
 0.9
  16.314 milliseconds (2 allocations: 48 bytes)
 1.0
  16.264 milliseconds (2 allocations: 48 bytes)


 [1] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
 [2] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
 [2] 
 https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll


Re: [julia-users] Re: Strange performance problem for array scaling

2015-07-12 Thread Yichao Yu
On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc1...@gmail.com wrote:
 Further update:

 I made a c++ version[1] and see a similar effect (depending on
 optimization levels) so it's not a julia issue (not that I think it
 really was to begin with...).

After investigating the c++ version more, I find that the difference
between the fast_math and the non-fast_math version is that the
compiler emit a function called `set_fast_math` (see below).

From what I can tell, the function sets bit 6 and bit 15 on the MXCSR
register (for SSE) and according to this page[1] these are DAZ and FZ
bits (both related to underflow). It also describe denormals as take
considerably longer to process. Since the operation I have keeps
decreasing the value, I guess it makes sense that there's a value
dependent performance (and it kind of make sense that fft also
suffers from these values)

So now the question is:

1. How important are underflow and denormal values? Note that I'm not
catching underflow explicitly anyway and I don't really care about
values that are really small compare to 1.

2. Is there a way to set up the SSE registers as done by the c
compilers? @fastmath does not seems to be doing this.

05b0 set_fast_math:
 5b0:0f ae 5c 24 fc   stmxcsr -0x4(%rsp)
 5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp)
 5bc:00
 5bd:0f ae 54 24 fc   ldmxcsr -0x4(%rsp)
 5c2:c3   retq
 5c3:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
 5ca:00 00 00
 5cd:0f 1f 00 nopl   (%rax)

[1] http://softpixel.com/~cwright/programming/simd/sse.php


 The slow down is presented in the c++ version for all optimization
 levels except Ofast and ffast-math. The julia version is faster than
 the default performance for both gcc and clang but is slower in the
 fast case for higher optmization levels. For O2 and higher, the c++

The slowness of the julia version seems to be due to multi dimentional
arrays. Using 1d array yields similar performance with C.

 version shows a ~100x slow down for the slow case.

 @fast_math in julia doesn't seem to have an effect for this although
 it does for clang and gcc...

 [1] 
 https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp

 On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc1...@gmail.com wrote:
 Update:

 I've just got an even simpler version without any complex numbers and
 only has Float64. The two loops are as small as the following LLVM-IR
 now and there's only simple arithmetics in the loop body.

 ```llvm
 L9.preheader: ; preds = %L12,
 %L9.preheader.preheader
   %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
   br label %L9

 L9:   ; preds = %L9, 
 %L9.preheader
   %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
   %44 = add i64 %#s4.0, 1
   %45 = add i64 %#s4.0, -1
   %46 = mul i64 %45, %10
   %47 = getelementptr double* %7, i64 %46
   %48 = load double* %47, align 8
   %49 = add i64 %46, 1
   %50 = getelementptr double* %7, i64 %49
   %51 = load double* %50, align 8
   %52 = fmul double %51, %3
   %53 = fmul double %38, %48
   %54 = fmul double %33, %52
   %55 = fadd double %53, %54
   store double %55, double* %50, align 8
   %56 = fmul double %38, %52
   %57 = fmul double %33, %48
   %58 = fsub double %56, %57
   store double %58, double* %47, align 8
   %59 = icmp eq i64 %#s4.0, %12
   br i1 %59, label %L12, label %L9

 L12:  ; preds = %L9
   %60 = add i64 %#s3.0, 1
   %61 = icmp eq i64 %#s3.0, %42
   br i1 %61, label %L14.loopexit, label %L9.preheader
 ```

 On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote:
 On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com 
 wrote:
 I can't really help you debug the IR code, but I can at least say I'm 
 seeing
 a similar thing. It starts to slow down around just after 0.5, and doesn't

 Thanks for the confirmation. At least I'm not crazy (or not the only
 one to be more precise :P )

 get back to where it was at 0.5 until 0.87. Can you compare the IR code 
 when
 two different values are used, to see what's different? When I tried 
 looking
 at the difference between 0.50 and 0.51, the biggest thing that popped out
 to me was that the numbers after !dbg were different.

 This is exactly the strange part. I don't think either llvm or julia
 is doing constant propagation here and different input values should
 be using the same function. Evidents are

 1. Julia only specialize on types not values (for now)
 2. The function is not inlined. Which has to be the case in global
 scope and can be double checked by adding `@noinline`. It's also too
 big to be inline_worthy
 3. No codegen is happening except the first call. This can be seen
 from the output of `@time`. Only the first call has thousands of
 allocations. Following