Re: [julia-users] Re: Strange performance problem for array scaling
On Mon, Jul 13, 2015 at 2:20 AM, Jeffrey Sarnoff jeffrey.sarn...@gmail.com wrote: and this: Cleve Moler tries to see it your way Moler on floating point denormals On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote: Denormals were made part of the IEEE Floating Point standard after some very careful numerical analysis showed that accomodating them would substantively improve the quality of floating point results and this would lift the quality of all floating point work. Surprising it may be, nonetheless you (and if not you today, you tomorrow and one of your neighbors today) really do care about those unusual, and often rarely observed values. fyi William Kahan on the introduction of denormals to the standard and an early, important paper on this Effects of Underflow on Solving Linear Systems - J.Demmel 1981 Thank you very much for the references. Yes I definitely believe every part of the IEEE Floating Point standard has it's reason to be there and I'm more wondering what are the cases that they are significant. OTOH, I also believe there's certain kind of computation that do not care about them, which is why -ffast-math is there. From the mathwork blog reference: Double precision denormals are so tiny that they are rarely numerically significant, but single precision denormals can be in the range where they affect some otherwise unremarkable computations. For my computation, I'm currently using double but I've already checked that switching to single precision still give me enough precision. Based on this, can I say that I can ignore them if I use double precision and may need to keep them if I switch to single precision? Using Float64 takes twice as long as using Float32 while keeping denormals seems to take 10x time. As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3 On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote: On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com wrote: Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...). After investigating the c++ version more, I find that the difference between the fast_math and the non-fast_math version is that the compiler emit a function called `set_fast_math` (see below). From what I can tell, the function sets bit 6 and bit 15 on the MXCSR register (for SSE) and according to this page[1] these are DAZ and FZ bits (both related to underflow). It also describe denormals as take considerably longer to process. Since the operation I have keeps decreasing the value, I guess it makes sense that there's a value dependent performance (and it kind of make sense that fft also suffers from these values) So now the question is: 1. How important are underflow and denormal values? Note that I'm not catching underflow explicitly anyway and I don't really care about values that are really small compare to 1. 2. Is there a way to set up the SSE registers as done by the c compilers? @fastmath does not seems to be doing this. 05b0 set_fast_math: 5b0:0f ae 5c 24 fc stmxcsr -0x4(%rsp) 5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 5bc:00 5bd:0f ae 54 24 fc ldmxcsr -0x4(%rsp) 5c2:c3 retq 5c3:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 5ca:00 00 00 5cd:0f 1f 00 nopl (%rax) [1] http://softpixel.com/~cwright/programming/simd/sse.php The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ The slowness of the julia version seems to be due to multi dimentional arrays. Using 1d array yields similar performance with C. version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com wrote: Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader:
Re: [julia-users] Re: Strange performance problem for array scaling
As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. And for future references I find #789, which is not documented anywhere AFAICT (will probably file a doc issue...) It also supports runtime detection of cpu feature so it should be much more portable. [1] https://github.com/JuliaLang/julia/pull/789 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
Re: [julia-users] Re: Strange performance problem for array scaling
Cleve Moler's discussion is not quite as contextually invariant as are William Kahan's and James Demmel's. In fact the numerical analysis community has made an overwhelmingly strong case that, roughly speaking, one is substantively better situated where denormalized floating point values will be used whenever they may arise than being free of those extra cycles at the mercy of an absent smoothness shoving those values to zero. And this holds widely for floating point centered applications or libraries. If the world were remade with each sunrise by fixed bitwidth floating point computations, supporting denormals is to have made house-calls with few numerical vaccines to everyone who will be relying on those computations to inform expectations about non-trivial work with fixed bitwdith floating point types. It does not wipe out all forms of numerical untowardness, and some will find the vaccinces more prophylatic than others; still, the analogy holds. We vaccinate many babies against measles even though there are some who would never have become exposed to that disease .. and for those who forgot why, not long ago the news was about a Disney vaction disease nexus and how far it spread -- then California changed its law to make it more difficult to opt-out of childhood vaccination. Having denormals there when the values they cover arise brings benifit that parallels the good in that law change. The larger social environment gets better by growing stronger and that can happen because somethat that had been bringing weakness (disease or bad consequences from subtile numbery misadventures) no longer operates. There is another way denormals have been shown to be matter -- the way above ought to help you feel at ease with deciding not to move your work from Float64 to Float32 for the purpose of avoiding values that hover around smaller magnitudes realizable with Float64s. That sounds like a headache, and you would not have changed the theory in a way that makes things work (or at all). Recasting the approch to solving ot transforming at hand to work with integer values would move the work away from any cost and benefit that accompany denormals. Other that that, thank your favorite floating point microarchitect for giving you greater throughput with denormals than everyone had a few design cycles ago. I would like their presence without measureable cost .. just not enough to dislike their availability. On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. And for future references I find #789, which is not documented anywhere AFAICT (will probably file a doc issue...) It also supports runtime detection of cpu feature so it should be much more portable. [1] https://github.com/JuliaLang/julia/pull/789 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
Re: [julia-users] Re: Strange performance problem for array scaling
On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff jeffrey.sarn...@gmail.com wrote: Thanks for sharing your view about denormal values. I hope what I said doesn't seem that I want to get rid of them completely (and if it did sound like that, I didn't meant it...). I didn't read the more detail analysis of their impact but I would believe you that they are important in general. For my specific application, I'm doing time propagation on a wavefunction (that can decay). For my purpose, there are many other sources of uncertainty and I'm mainly interested in how the majority of the wavefunction behave. Therefore I don't really care about the actually value of something smaller than 10^-10 but I do want it to run fast. Since this is a linear problem, I can also scale the values by a constant factor to make underflow less of a problem. I have not looked at the specifics of what is going on ... Dismissing denormals is particularly dicey when your functional data flow is generating many denormalized values. Do you what it is causing many values of very small magnitude to occur as you run this? Is the data holding them explicitly? If so, and you have access to preprocess the data, and you are sure that software cannot accumulate or reciprocate or exp etc them, clamp those values to zero and then use the data. Does the code operate as a denormalized value generator? If so, a small alteration to the order of operations may help. On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: Cleve Moler's discussion is not quite as contextually invariant as are William Kahan's and James Demmel's. In fact the numerical analysis community has made an overwhelmingly strong case that, roughly speaking, one is substantively better situated where denormalized floating point values will be used whenever they may arise than being free of those extra cycles at the mercy of an absent smoothness shoving those values to zero. And this holds widely for floating point centered applications or libraries. If the world were remade with each sunrise by fixed bitwidth floating point computations, supporting denormals is to have made house-calls with few numerical vaccines to everyone who will be relying on those computations to inform expectations about non-trivial work with fixed bitwdith floating point types. It does not wipe out all forms of numerical untowardness, and some will find the vaccinces more prophylatic than others; still, the analogy holds. We vaccinate many babies against measles even though there are some who would never have become exposed to that disease .. and for those who forgot why, not long ago the news was about a Disney vaction disease nexus and how far it spread -- then California changed its law to make it more difficult to opt-out of childhood vaccination. Having denormals there when the values they cover arise brings benifit that parallels the good in that law change. The larger social environment gets better by growing stronger and that can happen because somethat that had been bringing weakness (disease or bad consequences from subtile numbery misadventures) no longer operates. There is another way denormals have been shown to be matter -- the way above ought to help you feel at ease with deciding not to move your work from Float64 to Float32 for the purpose of avoiding values that hover around smaller magnitudes realizable with Float64s. That sounds like a headache, and you would not have changed the theory in a way that makes things work (or at all). Recasting the approch to solving ot transforming at hand to work with integer values would move the work away from any cost and benefit that accompany denormals. Other that that, thank your favorite floating point microarchitect for giving you greater throughput with denormals than everyone had a few design cycles ago. I would like their presence without measureable cost .. just not enough to dislike their availability. On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. And for future references I find #789, which is not documented anywhere AFAICT (will probably file a doc issue...) It also supports runtime detection of cpu feature so it should be much more portable. [1] https://github.com/JuliaLang/julia/pull/789 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
Re: [julia-users] Re: Strange performance problem for array scaling
it is a fairer test working from a second copy of the data that has been prescaled(x::Float64) = x * 2.0^64 On Monday, July 13, 2015 at 12:51:33 PM UTC-4, Jeffrey Sarnoff wrote: Staying with Float64, see if the runtime comes way down when you prescale the data using prescale(x) = x * 2.0^64 Guessing your values to be less than 10^15, and assuming the worst case smallest magnitude the base10 exponent of your largest data value is below 70 scale by constant is a good strategy when the largest of the data values is not large On Monday, July 13, 2015 at 12:04:32 PM UTC-4, Yichao Yu wrote: On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff jeffrey...@gmail.com wrote: Thanks for sharing your view about denormal values. I hope what I said doesn't seem that I want to get rid of them completely (and if it did sound like that, I didn't meant it...). I didn't read the more detail analysis of their impact but I would believe you that they are important in general. For my specific application, I'm doing time propagation on a wavefunction (that can decay). For my purpose, there are many other sources of uncertainty and I'm mainly interested in how the majority of the wavefunction behave. Therefore I don't really care about the actually value of something smaller than 10^-10 but I do want it to run fast. Since this is a linear problem, I can also scale the values by a constant factor to make underflow less of a problem. I have not looked at the specifics of what is going on ... Dismissing denormals is particularly dicey when your functional data flow is generating many denormalized values. Do you what it is causing many values of very small magnitude to occur as you run this? Is the data holding them explicitly? If so, and you have access to preprocess the data, and you are sure that software cannot accumulate or reciprocate or exp etc them, clamp those values to zero and then use the data. Does the code operate as a denormalized value generator? If so, a small alteration to the order of operations may help. On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: Cleve Moler's discussion is not quite as contextually invariant as are William Kahan's and James Demmel's. In fact the numerical analysis community has made an overwhelmingly strong case that, roughly speaking, one is substantively better situated where denormalized floating point values will be used whenever they may arise than being free of those extra cycles at the mercy of an absent smoothness shoving those values to zero. And this holds widely for floating point centered applications or libraries. If the world were remade with each sunrise by fixed bitwidth floating point computations, supporting denormals is to have made house-calls with few numerical vaccines to everyone who will be relying on those computations to inform expectations about non-trivial work with fixed bitwdith floating point types. It does not wipe out all forms of numerical untowardness, and some will find the vaccinces more prophylatic than others; still, the analogy holds. We vaccinate many babies against measles even though there are some who would never have become exposed to that disease .. and for those who forgot why, not long ago the news was about a Disney vaction disease nexus and how far it spread -- then California changed its law to make it more difficult to opt-out of childhood vaccination. Having denormals there when the values they cover arise brings benifit that parallels the good in that law change. The larger social environment gets better by growing stronger and that can happen because somethat that had been bringing weakness (disease or bad consequences from subtile numbery misadventures) no longer operates. There is another way denormals have been shown to be matter -- the way above ought to help you feel at ease with deciding not to move your work from Float64 to Float32 for the purpose of avoiding values that hover around smaller magnitudes realizable with Float64s. That sounds like a headache, and you would not have changed the theory in a way that makes things work (or at all). Recasting the approch to solving ot transforming at hand to work with integer values would move the work away from any cost and benefit that accompany denormals. Other that that, thank your favorite floating point microarchitect for giving you greater throughput with denormals than everyone had a few design cycles ago. I would like their presence without measureable cost .. just not enough to dislike their availability. On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it
Re: [julia-users] Re: Strange performance problem for array scaling
I have not looked at the specifics of what is going on ... Dismissing denormals is particularly dicey when your functional data flow is generating many denormalized values. Do you what it is causing many values of very small magnitude to occur as you run this? Is the data holding them explicitly? If so, and you have access to preprocess the data, and you are sure that software cannot accumulate or reciprocate or exp etc them, clamp those values to zero and then use the data. Does the code operate as a denormalized value generator? If so, a small alteration to the order of operations may help. On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: Cleve Moler's discussion is not quite as contextually invariant as are William Kahan's and James Demmel's. In fact the numerical analysis community has made an overwhelmingly strong case that, roughly speaking, one is substantively better situated where denormalized floating point values will be used whenever they may arise than being free of those extra cycles at the mercy of an absent smoothness shoving those values to zero. And this holds widely for floating point centered applications or libraries. If the world were remade with each sunrise by fixed bitwidth floating point computations, supporting denormals is to have made house-calls with few numerical vaccines to everyone who will be relying on those computations to inform expectations about non-trivial work with fixed bitwdith floating point types. It does not wipe out all forms of numerical untowardness, and some will find the vaccinces more prophylatic than others; still, the analogy holds. We vaccinate many babies against measles even though there are some who would never have become exposed to that disease .. and for those who forgot why, not long ago the news was about a Disney vaction disease nexus and how far it spread -- then California changed its law to make it more difficult to opt-out of childhood vaccination. Having denormals there when the values they cover arise brings benifit that parallels the good in that law change. The larger social environment gets better by growing stronger and that can happen because somethat that had been bringing weakness (disease or bad consequences from subtile numbery misadventures) no longer operates. There is another way denormals have been shown to be matter -- the way above ought to help you feel at ease with deciding not to move your work from Float64 to Float32 for the purpose of avoiding values that hover around smaller magnitudes realizable with Float64s. That sounds like a headache, and you would not have changed the theory in a way that makes things work (or at all). Recasting the approch to solving ot transforming at hand to work with integer values would move the work away from any cost and benefit that accompany denormals. Other that that, thank your favorite floating point microarchitect for giving you greater throughput with denormals than everyone had a few design cycles ago. I would like their presence without measureable cost .. just not enough to dislike their availability. On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. And for future references I find #789, which is not documented anywhere AFAICT (will probably file a doc issue...) It also supports runtime detection of cpu feature so it should be much more portable. [1] https://github.com/JuliaLang/julia/pull/789 [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3
Re: [julia-users] Re: Strange performance problem for array scaling
Staying with Float64, see if the runtime comes way down when you prescale the data using prescale(x) = x * 2.0^64 Guessing your values to be less than 10^15, and assuming the worst case smallest magnitude the base10 exponent of your largest data value is below 70 scale by constant is a good strategy when the largest of the data values is not large On Monday, July 13, 2015 at 12:04:32 PM UTC-4, Yichao Yu wrote: On Mon, Jul 13, 2015 at 11:39 AM, Jeffrey Sarnoff jeffrey...@gmail.com javascript: wrote: Thanks for sharing your view about denormal values. I hope what I said doesn't seem that I want to get rid of them completely (and if it did sound like that, I didn't meant it...). I didn't read the more detail analysis of their impact but I would believe you that they are important in general. For my specific application, I'm doing time propagation on a wavefunction (that can decay). For my purpose, there are many other sources of uncertainty and I'm mainly interested in how the majority of the wavefunction behave. Therefore I don't really care about the actually value of something smaller than 10^-10 but I do want it to run fast. Since this is a linear problem, I can also scale the values by a constant factor to make underflow less of a problem. I have not looked at the specifics of what is going on ... Dismissing denormals is particularly dicey when your functional data flow is generating many denormalized values. Do you what it is causing many values of very small magnitude to occur as you run this? Is the data holding them explicitly? If so, and you have access to preprocess the data, and you are sure that software cannot accumulate or reciprocate or exp etc them, clamp those values to zero and then use the data. Does the code operate as a denormalized value generator? If so, a small alteration to the order of operations may help. On Monday, July 13, 2015 at 9:45:59 AM UTC-4, Jeffrey Sarnoff wrote: Cleve Moler's discussion is not quite as contextually invariant as are William Kahan's and James Demmel's. In fact the numerical analysis community has made an overwhelmingly strong case that, roughly speaking, one is substantively better situated where denormalized floating point values will be used whenever they may arise than being free of those extra cycles at the mercy of an absent smoothness shoving those values to zero. And this holds widely for floating point centered applications or libraries. If the world were remade with each sunrise by fixed bitwidth floating point computations, supporting denormals is to have made house-calls with few numerical vaccines to everyone who will be relying on those computations to inform expectations about non-trivial work with fixed bitwdith floating point types. It does not wipe out all forms of numerical untowardness, and some will find the vaccinces more prophylatic than others; still, the analogy holds. We vaccinate many babies against measles even though there are some who would never have become exposed to that disease .. and for those who forgot why, not long ago the news was about a Disney vaction disease nexus and how far it spread -- then California changed its law to make it more difficult to opt-out of childhood vaccination. Having denormals there when the values they cover arise brings benifit that parallels the good in that law change. The larger social environment gets better by growing stronger and that can happen because somethat that had been bringing weakness (disease or bad consequences from subtile numbery misadventures) no longer operates. There is another way denormals have been shown to be matter -- the way above ought to help you feel at ease with deciding not to move your work from Float64 to Float32 for the purpose of avoiding values that hover around smaller magnitudes realizable with Float64s. That sounds like a headache, and you would not have changed the theory in a way that makes things work (or at all). Recasting the approch to solving ot transforming at hand to work with integer values would move the work away from any cost and benefit that accompany denormals. Other that that, thank your favorite floating point microarchitect for giving you greater throughput with denormals than everyone had a few design cycles ago. I would like their presence without measureable cost .. just not enough to dislike their availability. On Monday, July 13, 2015 at 8:02:13 AM UTC-4, Yichao Yu wrote: As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. And for future references I find #789, which is
Re: [julia-users] Re: Strange performance problem for array scaling
Denormals were made part of the IEEE Floating Point standard after some very careful numerical analysis showed that accomodating them would substantively improve the quality of floating point results and this would lift the quality of all floating point work. Surprising it may be, nonetheless you (and if not you today, you tomorrow and one of your neighbors today) really do care about those unusual, and often rarely observed values. fyi William Kahan on the introduction of denormals to the standard https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html and an early, important paper on this Effects of Underflow on Solving Linear Systems - J.Demmel 1981 http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote: On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com javascript: wrote: Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...). After investigating the c++ version more, I find that the difference between the fast_math and the non-fast_math version is that the compiler emit a function called `set_fast_math` (see below). From what I can tell, the function sets bit 6 and bit 15 on the MXCSR register (for SSE) and according to this page[1] these are DAZ and FZ bits (both related to underflow). It also describe denormals as take considerably longer to process. Since the operation I have keeps decreasing the value, I guess it makes sense that there's a value dependent performance (and it kind of make sense that fft also suffers from these values) So now the question is: 1. How important are underflow and denormal values? Note that I'm not catching underflow explicitly anyway and I don't really care about values that are really small compare to 1. 2. Is there a way to set up the SSE registers as done by the c compilers? @fastmath does not seems to be doing this. 05b0 set_fast_math: 5b0:0f ae 5c 24 fc stmxcsr -0x4(%rsp) 5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 5bc:00 5bd:0f ae 54 24 fc ldmxcsr -0x4(%rsp) 5c2:c3 retq 5c3:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 5ca:00 00 00 5cd:0f 1f 00 nopl (%rax) [1] http://softpixel.com/~cwright/programming/simd/sse.php The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ The slowness of the julia version seems to be due to multi dimentional arrays. Using 1d array yields similar performance with C. version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com javascript: wrote: Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %#s4.0, 1 %45 = add i64 %#s4.0, -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %#s4.0, %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %#s3.0, 1 %61 = icmp eq i64 %#s3.0, %42 br i1 %61, label %L14.loopexit, label %L9.preheader ``` On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc...@gmail.com javascript: wrote: On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin@gmail.com javascript: wrote: I can't really help you debug the IR code, but I can at least say I'm
Re: [julia-users] Re: Strange performance problem for array scaling
and this: Cleve Moler tries to see it your way Moler on floating point denormals http://blogs.mathworks.com/cleve/2014/07/21/floating-point-denormals-insignificant-but-controversial-2/ On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote: Denormals were made part of the IEEE Floating Point standard after some very careful numerical analysis showed that accomodating them would substantively improve the quality of floating point results and this would lift the quality of all floating point work. Surprising it may be, nonetheless you (and if not you today, you tomorrow and one of your neighbors today) really do care about those unusual, and often rarely observed values. fyi William Kahan on the introduction of denormals to the standard https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html and an early, important paper on this Effects of Underflow on Solving Linear Systems - J.Demmel 1981 http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote: On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc...@gmail.com wrote: Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...). After investigating the c++ version more, I find that the difference between the fast_math and the non-fast_math version is that the compiler emit a function called `set_fast_math` (see below). From what I can tell, the function sets bit 6 and bit 15 on the MXCSR register (for SSE) and according to this page[1] these are DAZ and FZ bits (both related to underflow). It also describe denormals as take considerably longer to process. Since the operation I have keeps decreasing the value, I guess it makes sense that there's a value dependent performance (and it kind of make sense that fft also suffers from these values) So now the question is: 1. How important are underflow and denormal values? Note that I'm not catching underflow explicitly anyway and I don't really care about values that are really small compare to 1. 2. Is there a way to set up the SSE registers as done by the c compilers? @fastmath does not seems to be doing this. 05b0 set_fast_math: 5b0:0f ae 5c 24 fc stmxcsr -0x4(%rsp) 5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 5bc:00 5bd:0f ae 54 24 fc ldmxcsr -0x4(%rsp) 5c2:c3 retq 5c3:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 5ca:00 00 00 5cd:0f 1f 00 nopl (%rax) [1] http://softpixel.com/~cwright/programming/simd/sse.php The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ The slowness of the julia version seems to be due to multi dimentional arrays. Using 1d array yields similar performance with C. version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc...@gmail.com wrote: Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %#s4.0, 1 %45 = add i64 %#s4.0, -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %#s4.0, %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %#s3.0, 1 %61 = icmp eq i64 %#s3.0, %42 br i1 %61, label %L14.loopexit, label %L9.preheader ```
Re: [julia-users] Re: Strange performance problem for array scaling
Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...). The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc1...@gmail.com wrote: Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %#s4.0, 1 %45 = add i64 %#s4.0, -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %#s4.0, %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %#s3.0, 1 %61 = icmp eq i64 %#s3.0, %42 br i1 %61, label %L14.loopexit, label %L9.preheader ``` On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote: On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote: I can't really help you debug the IR code, but I can at least say I'm seeing a similar thing. It starts to slow down around just after 0.5, and doesn't Thanks for the confirmation. At least I'm not crazy (or not the only one to be more precise :P ) get back to where it was at 0.5 until 0.87. Can you compare the IR code when two different values are used, to see what's different? When I tried looking at the difference between 0.50 and 0.51, the biggest thing that popped out to me was that the numbers after !dbg were different. This is exactly the strange part. I don't think either llvm or julia is doing constant propagation here and different input values should be using the same function. Evidents are 1. Julia only specialize on types not values (for now) 2. The function is not inlined. Which has to be the case in global scope and can be double checked by adding `@noinline`. It's also too big to be inline_worthy 3. No codegen is happening except the first call. This can be seen from the output of `@time`. Only the first call has thousands of allocations. Following call has less than 5 (on 0.4 at least). If julia can compile a function with less than 5 allocation, I'll be very happy. 4. My original version also has much more complicated logic to compute that scaling factor (ok. it's just a function call, but with parameters gathered from different arguments) I'd be really surprised if either llvm or julia can reason anything about it The difference in the debug info should just be an artifact of emitting it twice. Even 0.50001 is a lot slower: julia for eΓ in 0.5:0.1:0.50015 println(eΓ) gc() @time ψs = propagate(P, ψ0, ψs, eΓ) end 0.5 elapsed time: 0.065609581 seconds (16 bytes allocated) 0.50001 elapsed time: 0.875806461 seconds (16 bytes allocated) This is actually interesting and I can confirm the same here. `0.5` takes 28ms while 0.5001 takes 320ms. I was thinking whether the cpu is doing sth special depending on the bit pattern but I still don't understand why it would be very bad for a certain range. (and is not only a function of this either, also affected by all other values) julia versioninfo() Julia Version 0.3.9 Commit 31efe69 (2015-05-30 11:24 UTC) Platform Info: System: Darwin (x86_64-apple-darwin13.4.0) Good. so it's not Linux only. CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: On Sun, Jul 12, 2015 at 7:40 PM,
Re: [julia-users] Re: Strange performance problem for array scaling
On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote: I can't really help you debug the IR code, but I can at least say I'm seeing a similar thing. It starts to slow down around just after 0.5, and doesn't Thanks for the confirmation. At least I'm not crazy (or not the only one to be more precise :P ) get back to where it was at 0.5 until 0.87. Can you compare the IR code when two different values are used, to see what's different? When I tried looking at the difference between 0.50 and 0.51, the biggest thing that popped out to me was that the numbers after !dbg were different. This is exactly the strange part. I don't think either llvm or julia is doing constant propagation here and different input values should be using the same function. Evidents are 1. Julia only specialize on types not values (for now) 2. The function is not inlined. Which has to be the case in global scope and can be double checked by adding `@noinline`. It's also too big to be inline_worthy 3. No codegen is happening except the first call. This can be seen from the output of `@time`. Only the first call has thousands of allocations. Following call has less than 5 (on 0.4 at least). If julia can compile a function with less than 5 allocation, I'll be very happy. 4. My original version also has much more complicated logic to compute that scaling factor (ok. it's just a function call, but with parameters gathered from different arguments) I'd be really surprised if either llvm or julia can reason anything about it The difference in the debug info should just be an artifact of emitting it twice. Even 0.50001 is a lot slower: julia for eΓ in 0.5:0.1:0.50015 println(eΓ) gc() @time ψs = propagate(P, ψ0, ψs, eΓ) end 0.5 elapsed time: 0.065609581 seconds (16 bytes allocated) 0.50001 elapsed time: 0.875806461 seconds (16 bytes allocated) This is actually interesting and I can confirm the same here. `0.5` takes 28ms while 0.5001 takes 320ms. I was thinking whether the cpu is doing sth special depending on the bit pattern but I still don't understand why it would be very bad for a certain range. (and is not only a function of this either, also affected by all other values) julia versioninfo() Julia Version 0.3.9 Commit 31efe69 (2015-05-30 11:24 UTC) Platform Info: System: Darwin (x86_64-apple-darwin13.4.0) Good. so it's not Linux only. CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com wrote: P.S. Given how strange this problem is for me, I would appreciate if anyone can confirm either this is a real issue or I'm somehow being crazy or stupid. One additional strange property of this issue is that I used to have much costy operations in the (outer) loop (the one that iterate over nsteps with i) like Fourier transformations. However, when the scaling factor is taking the bad value, it slows everything down (i.e. the Fourier transformation is also slower by ~10x). On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com wrote: Hi, I've just seen a very strange (for me) performance difference for exactly the same code on slightly different input with no explicit branches. The code is available here[1]. The most relavant part is the following function. (All other part of the code are for initialization and bench mark). This is a simplified version of my similation that compute the next array column in the array based on the previous one. The strange part is that the performance of this function can differ by 10x depend on the value of the scaling factor (`eΓ`, the only use of which is marked in the code below) even though I don't see any branches that depends on that value in the relavant code. (unless the cpu is 10x less efficient for certain input values) function propagate(P, ψ0, ψs, eΓ) @inbounds for i in 1:P.nele ψs[1, i, 1] = ψ0[1, i] ψs[2, i, 1] = ψ0[2, i] end T12 = im * sin(P.Ω) T11 = cos(P.Ω) @inbounds for i in 2:(P.nstep + 1) for j in 1:P.nele ψ_e = ψs[1, j, i - 1] ψ_g = ψs[2, j, i - 1] * eΓ # Scaling factor ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e end end ψs end The output of the full script is attached and it can be clearly seen that for scaling factor 0.6-0.8, the performance is 5-10 times slower than others. The assembly[2] and llvm[3] code of this function is also in the same repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 and LLVM 3.6 on two different x86_64 machine
Re: [julia-users] Re: Strange performance problem for array scaling
Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %#s4.0, 1 %45 = add i64 %#s4.0, -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %#s4.0, %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %#s3.0, 1 %61 = icmp eq i64 %#s3.0, %42 br i1 %61, label %L14.loopexit, label %L9.preheader ``` On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote: On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote: I can't really help you debug the IR code, but I can at least say I'm seeing a similar thing. It starts to slow down around just after 0.5, and doesn't Thanks for the confirmation. At least I'm not crazy (or not the only one to be more precise :P ) get back to where it was at 0.5 until 0.87. Can you compare the IR code when two different values are used, to see what's different? When I tried looking at the difference between 0.50 and 0.51, the biggest thing that popped out to me was that the numbers after !dbg were different. This is exactly the strange part. I don't think either llvm or julia is doing constant propagation here and different input values should be using the same function. Evidents are 1. Julia only specialize on types not values (for now) 2. The function is not inlined. Which has to be the case in global scope and can be double checked by adding `@noinline`. It's also too big to be inline_worthy 3. No codegen is happening except the first call. This can be seen from the output of `@time`. Only the first call has thousands of allocations. Following call has less than 5 (on 0.4 at least). If julia can compile a function with less than 5 allocation, I'll be very happy. 4. My original version also has much more complicated logic to compute that scaling factor (ok. it's just a function call, but with parameters gathered from different arguments) I'd be really surprised if either llvm or julia can reason anything about it The difference in the debug info should just be an artifact of emitting it twice. Even 0.50001 is a lot slower: julia for eΓ in 0.5:0.1:0.50015 println(eΓ) gc() @time ψs = propagate(P, ψ0, ψs, eΓ) end 0.5 elapsed time: 0.065609581 seconds (16 bytes allocated) 0.50001 elapsed time: 0.875806461 seconds (16 bytes allocated) This is actually interesting and I can confirm the same here. `0.5` takes 28ms while 0.5001 takes 320ms. I was thinking whether the cpu is doing sth special depending on the bit pattern but I still don't understand why it would be very bad for a certain range. (and is not only a function of this either, also affected by all other values) julia versioninfo() Julia Version 0.3.9 Commit 31efe69 (2015-05-30 11:24 UTC) Platform Info: System: Darwin (x86_64-apple-darwin13.4.0) Good. so it's not Linux only. CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com wrote: P.S. Given how strange this problem is for me, I would appreciate if anyone can confirm either this is a real issue or I'm somehow being crazy or stupid. One additional strange property of this issue is that I used to have much costy operations in the (outer) loop (the one that iterate over nsteps with i) like Fourier transformations. However, when the scaling factor is taking the bad value, it slows everything down (i.e. the Fourier transformation is also slower by ~10x). On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com wrote: Hi, I've just seen a very strange (for me) performance difference for exactly the same code on slightly different input with no explicit branches. The code is available here[1]. The most relavant part is the
[julia-users] Re: Strange performance problem for array scaling
I can't really help you debug the IR code, but I can at least say I'm seeing a similar thing. It starts to slow down around just after 0.5, and doesn't get back to where it was at 0.5 until 0.87. Can you compare the IR code when two different values are used, to see what's different? When I tried looking at the difference between 0.50 and 0.51, the biggest thing that popped out to me was that the numbers after !dbg were different. Even 0.50001 is a lot slower: julia for eΓ in 0.5:0.1:0.50015 println(eΓ) gc() @time ψs = propagate(P, ψ0, ψs, eΓ) end 0.5 elapsed time: 0.065609581 seconds (16 bytes allocated) 0.50001 elapsed time: 0.875806461 seconds (16 bytes allocated) julia versioninfo() Julia Version 0.3.9 Commit 31efe69 (2015-05-30 11:24 UTC) Platform Info: System: Darwin (x86_64-apple-darwin13.4.0) CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc...@gmail.com javascript: wrote: P.S. Given how strange this problem is for me, I would appreciate if anyone can confirm either this is a real issue or I'm somehow being crazy or stupid. One additional strange property of this issue is that I used to have much costy operations in the (outer) loop (the one that iterate over nsteps with i) like Fourier transformations. However, when the scaling factor is taking the bad value, it slows everything down (i.e. the Fourier transformation is also slower by ~10x). On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc...@gmail.com javascript: wrote: Hi, I've just seen a very strange (for me) performance difference for exactly the same code on slightly different input with no explicit branches. The code is available here[1]. The most relavant part is the following function. (All other part of the code are for initialization and bench mark). This is a simplified version of my similation that compute the next array column in the array based on the previous one. The strange part is that the performance of this function can differ by 10x depend on the value of the scaling factor (`eΓ`, the only use of which is marked in the code below) even though I don't see any branches that depends on that value in the relavant code. (unless the cpu is 10x less efficient for certain input values) function propagate(P, ψ0, ψs, eΓ) @inbounds for i in 1:P.nele ψs[1, i, 1] = ψ0[1, i] ψs[2, i, 1] = ψ0[2, i] end T12 = im * sin(P.Ω) T11 = cos(P.Ω) @inbounds for i in 2:(P.nstep + 1) for j in 1:P.nele ψ_e = ψs[1, j, i - 1] ψ_g = ψs[2, j, i - 1] * eΓ # Scaling factor ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e end end ψs end The output of the full script is attached and it can be clearly seen that for scaling factor 0.6-0.8, the performance is 5-10 times slower than others. The assembly[2] and llvm[3] code of this function is also in the same repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 and LLVM 3.6 on two different x86_64 machine (my laptop and a linode VPS) (the only platform I've tried that doesn't show similar behavior is running julia 0.4 on qemu-arm... although the performance between different values also differ by ~30% which is bigger than noise) This also seems to depend on the initial value. Has anyone seen similar problems before? Outputs: 325.821 milliseconds (25383 allocations: 1159 KB) 307.826 milliseconds (4 allocations: 144 bytes) 0.0 19.227 milliseconds (2 allocations: 48 bytes) 0.1 17.291 milliseconds (2 allocations: 48 bytes) 0.2 17.404 milliseconds (2 allocations: 48 bytes) 0.3 19.231 milliseconds (2 allocations: 48 bytes) 0.4 20.278 milliseconds (2 allocations: 48 bytes) 0.5 23.692 milliseconds (2 allocations: 48 bytes) 0.6 328.107 milliseconds (2 allocations: 48 bytes) 0.7 312.425 milliseconds (2 allocations: 48 bytes) 0.8 201.494 milliseconds (2 allocations: 48 bytes) 0.9 16.314 milliseconds (2 allocations: 48 bytes) 1.0 16.264 milliseconds (2 allocations: 48 bytes) [1] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl [2] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S [2] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
[julia-users] Re: Strange performance problem for array scaling
On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu yyc1...@gmail.com wrote: P.S. Given how strange this problem is for me, I would appreciate if anyone can confirm either this is a real issue or I'm somehow being crazy or stupid. One additional strange property of this issue is that I used to have much costy operations in the (outer) loop (the one that iterate over nsteps with i) like Fourier transformations. However, when the scaling factor is taking the bad value, it slows everything down (i.e. the Fourier transformation is also slower by ~10x). On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu yyc1...@gmail.com wrote: Hi, I've just seen a very strange (for me) performance difference for exactly the same code on slightly different input with no explicit branches. The code is available here[1]. The most relavant part is the following function. (All other part of the code are for initialization and bench mark). This is a simplified version of my similation that compute the next array column in the array based on the previous one. The strange part is that the performance of this function can differ by 10x depend on the value of the scaling factor (`eΓ`, the only use of which is marked in the code below) even though I don't see any branches that depends on that value in the relavant code. (unless the cpu is 10x less efficient for certain input values) function propagate(P, ψ0, ψs, eΓ) @inbounds for i in 1:P.nele ψs[1, i, 1] = ψ0[1, i] ψs[2, i, 1] = ψ0[2, i] end T12 = im * sin(P.Ω) T11 = cos(P.Ω) @inbounds for i in 2:(P.nstep + 1) for j in 1:P.nele ψ_e = ψs[1, j, i - 1] ψ_g = ψs[2, j, i - 1] * eΓ # Scaling factor ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e end end ψs end The output of the full script is attached and it can be clearly seen that for scaling factor 0.6-0.8, the performance is 5-10 times slower than others. The assembly[2] and llvm[3] code of this function is also in the same repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 and LLVM 3.6 on two different x86_64 machine (my laptop and a linode VPS) (the only platform I've tried that doesn't show similar behavior is running julia 0.4 on qemu-arm... although the performance between different values also differ by ~30% which is bigger than noise) This also seems to depend on the initial value. Has anyone seen similar problems before? Outputs: 325.821 milliseconds (25383 allocations: 1159 KB) 307.826 milliseconds (4 allocations: 144 bytes) 0.0 19.227 milliseconds (2 allocations: 48 bytes) 0.1 17.291 milliseconds (2 allocations: 48 bytes) 0.2 17.404 milliseconds (2 allocations: 48 bytes) 0.3 19.231 milliseconds (2 allocations: 48 bytes) 0.4 20.278 milliseconds (2 allocations: 48 bytes) 0.5 23.692 milliseconds (2 allocations: 48 bytes) 0.6 328.107 milliseconds (2 allocations: 48 bytes) 0.7 312.425 milliseconds (2 allocations: 48 bytes) 0.8 201.494 milliseconds (2 allocations: 48 bytes) 0.9 16.314 milliseconds (2 allocations: 48 bytes) 1.0 16.264 milliseconds (2 allocations: 48 bytes) [1] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl [2] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S [2] https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
Re: [julia-users] Re: Strange performance problem for array scaling
On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu yyc1...@gmail.com wrote: Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...). After investigating the c++ version more, I find that the difference between the fast_math and the non-fast_math version is that the compiler emit a function called `set_fast_math` (see below). From what I can tell, the function sets bit 6 and bit 15 on the MXCSR register (for SSE) and according to this page[1] these are DAZ and FZ bits (both related to underflow). It also describe denormals as take considerably longer to process. Since the operation I have keeps decreasing the value, I guess it makes sense that there's a value dependent performance (and it kind of make sense that fft also suffers from these values) So now the question is: 1. How important are underflow and denormal values? Note that I'm not catching underflow explicitly anyway and I don't really care about values that are really small compare to 1. 2. Is there a way to set up the SSE registers as done by the c compilers? @fastmath does not seems to be doing this. 05b0 set_fast_math: 5b0:0f ae 5c 24 fc stmxcsr -0x4(%rsp) 5b5:81 4c 24 fc 40 80 00 orl$0x8040,-0x4(%rsp) 5bc:00 5bd:0f ae 54 24 fc ldmxcsr -0x4(%rsp) 5c2:c3 retq 5c3:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 5ca:00 00 00 5cd:0f 1f 00 nopl (%rax) [1] http://softpixel.com/~cwright/programming/simd/sse.php The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ The slowness of the julia version seems to be due to multi dimentional arrays. Using 1d array yields similar performance with C. version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu yyc1...@gmail.com wrote: Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body. ```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %#s3.0 = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %#s4.0 = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %#s4.0, 1 %45 = add i64 %#s4.0, -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %#s4.0, %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %#s3.0, 1 %61 = icmp eq i64 %#s3.0, %42 br i1 %61, label %L14.loopexit, label %L9.preheader ``` On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu yyc1...@gmail.com wrote: On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens kevin.j.ow...@gmail.com wrote: I can't really help you debug the IR code, but I can at least say I'm seeing a similar thing. It starts to slow down around just after 0.5, and doesn't Thanks for the confirmation. At least I'm not crazy (or not the only one to be more precise :P ) get back to where it was at 0.5 until 0.87. Can you compare the IR code when two different values are used, to see what's different? When I tried looking at the difference between 0.50 and 0.51, the biggest thing that popped out to me was that the numbers after !dbg were different. This is exactly the strange part. I don't think either llvm or julia is doing constant propagation here and different input values should be using the same function. Evidents are 1. Julia only specialize on types not values (for now) 2. The function is not inlined. Which has to be the case in global scope and can be double checked by adding `@noinline`. It's also too big to be inline_worthy 3. No codegen is happening except the first call. This can be seen from the output of `@time`. Only the first call has thousands of allocations. Following