Re: [julia-users] Re: Article on `@simd`
There is a recent and ongoing discussion on the llvm maillinglist to expose scatter and load operations as llvm intrinsics http://thread.gmane.org/gmane.comp.compilers.llvm.devel/79936 . - Valentin On Monday, 1 December 2014 17:43:11 UTC+1, John Myles White wrote: This is great. Thanks, Jacob. -- John On Dec 1, 2014, at 8:32 AM, Jacob Quinn quinn@gmail.com javascript: wrote: For all the vectorization fans out there, I stumbled across this LLVM blog post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html -Jacob On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fec...@gmail.com javascript: wrote: Great news! On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote: Update: The recent Julia 0.3.2 release supports vectorization of Float64.
Re: [julia-users] Re: Article on `@simd`
For all the vectorization fans out there, I stumbled across this LLVM blog post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html -Jacob On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fechner@gmail.com wrote: Great news! On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote: Update: The recent Julia 0.3.2 release supports vectorization of Float64.
Re: [julia-users] Re: Article on `@simd`
This is great. Thanks, Jacob. -- John On Dec 1, 2014, at 8:32 AM, Jacob Quinn quinn.jac...@gmail.com wrote: For all the vectorization fans out there, I stumbled across this LLVM blog post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html -Jacob On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fechner@gmail.com wrote: Great news! On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote: Update: The recent Julia 0.3.2 release supports vectorization of Float64.
[julia-users] Re: Article on `@simd`
Great news! On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote: Update: The recent Julia 0.3.2 release supports vectorization of Float64.
[julia-users] Re: Article on `@simd`
Update: The recent Julia 0.3.2 release supports vectorization of Float64.
Re: [julia-users] Re: Article on `@simd`
This are great news! :) One question: Why is tuple vectorization needed for fast 3D vector calculations? On Wednesday, September 24, 2014 12:05:47 AM UTC+2, Arch Robison wrote: Update on 64-bit support for vectorizing loops: The support just went into the Github sources. See https://github.com/JuliaLang/julia/pull/8452 . Though for 3D vectors, those are in need of tuple vectorization. See https://github.com/JuliaLang/julia/pull/6271 for the prototype. Unfortunately the prototyped slowed down compilation too much to be enabled by default. But it's possible we might evolve a way to turn it on for specially marked regions of code, or speed up how fast it can reject uninteresting code. On Wednesday, September 17, 2014 10:10:56 AM UTC-5, Uwe Fechner wrote: Any idea when the vectorization of 64 bit double values will be supported? (I work a lot with 3D double vectors, they could be calculated with one command in the Haswell CPU's. ) On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote: There is support in LLVM 3.5 for remarks from the vectorizer, such as vectorization is not beneficial and is not explicitly forced. I didn't see any remarks that explained the why in more detail, though that seems possible to improve since the vectorizer has debugging remarks that go into the why question (e.g. LV: Not vectorizing: Cannot prove legality.) The hard part is coming up with messages that are understandable to non-experts and pertinent. Having too many messages can bury the useful ones. I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for the subject. On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com wrote: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
ispc has some amazing capabilities. With the stuff that Arch is doing, we are already incrementally getting there. Perhaps someone can also convince the ispc team to spend some of their time on Julia. :-) -viral On Wednesday, September 24, 2014 12:22:49 AM UTC+5:30, Jeff Waller wrote: Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
On Wednesday, September 24, 2014 12:08:38 AM Uwe Fechner wrote: One question: Why is tuple vectorization needed for fast 3D vector calculations? 3 doesn't fit naturally into the hardware width. So the way to vectorize 3d computations is with 3 tuples, one per coordinate. @Arch, I noticed another typo: in your second code example, `w` should become `a`. --Tim
Re: [julia-users] Re: Article on `@simd`
I've been thinking about this a bit, and as usual, Julia's multiple dispatch might make such a thing possible in a novel way. The heart of ISPC is allowing a function that looks like int addScalar (int a, int b) { return a + b; } effectively be vectorint addVector (vectorint a, vectorint b) { return /*AVX version of */a + b; } This is what vectorizing compilers do, but they don't handle control flow like ISPC does. Also, ISPCs foreach and foreach_tiled allow these vectorized functions to be consumed more efficiently, for instance by handling the ragged/unaligned front and back of arrays with scalar versions, and the middle bits with vectorized functions. With support for hardware vectors in Julia, you can start to imagine writing macros that automatically generate the relevant functions, e.g. generating AddVector from addScalar. However, to do anything cleverer than the (already extremely clever) LLVM vectorizer, you have to expose masking operations. To handle incoherent/divergent control flow, you issue vector operations that are masked, allowing some lanes of the vector to stop participating in the program for a period. In a contrived example int addScalar(int a, int b) { return a % 2 ? a + b : a - b; } would be turned into something like the below vectorint addVector(vectorint a, vectorint b) { mask = all; // a register with all 1s, indicating all lanes participate int mod = a % 2; // vectorized, using mask mask = maskwhere(mod != 0); vectorint result = a + b; // vectorized, using mask mask = invert(mask); result = a - b; // vectorized, using mask return result; } If you look at it closely, you've got versions generated for each function that are - scalar - vector-enabled, but for arbitrary length vectors - specialized for (one or more hardware) vector sizes - specialized by alignment (as vector sizes get bigger, e.g. the 32- and 64-byte AVX versions coming out, you can't just rely on the runtime to align everything properly, it will be too wasteful) So, I think it's a big ask, but I think it could be produced incrementally. We'd need help from the Julia language/standard library itself to expose masked vector operations. *Sebastian Good* On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller truth...@gmail.com wrote: Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
... though I suspect to really profit from masked vectorization like this, it needs to be tackled at a much lower level in the compiler, likely even as an LLVM optimization pass, guided only by some hints from Julia itself. *Sebastian Good* On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good sebast...@palladiumconsulting.com wrote: I've been thinking about this a bit, and as usual, Julia's multiple dispatch might make such a thing possible in a novel way. The heart of ISPC is allowing a function that looks like int addScalar (int a, int b) { return a + b; } effectively be vectorint addVector (vectorint a, vectorint b) { return /*AVX version of */a + b; } This is what vectorizing compilers do, but they don't handle control flow like ISPC does. Also, ISPCs foreach and foreach_tiled allow these vectorized functions to be consumed more efficiently, for instance by handling the ragged/unaligned front and back of arrays with scalar versions, and the middle bits with vectorized functions. With support for hardware vectors in Julia, you can start to imagine writing macros that automatically generate the relevant functions, e.g. generating AddVector from addScalar. However, to do anything cleverer than the (already extremely clever) LLVM vectorizer, you have to expose masking operations. To handle incoherent/divergent control flow, you issue vector operations that are masked, allowing some lanes of the vector to stop participating in the program for a period. In a contrived example int addScalar(int a, int b) { return a % 2 ? a + b : a - b; } would be turned into something like the below vectorint addVector(vectorint a, vectorint b) { mask = all; // a register with all 1s, indicating all lanes participate int mod = a % 2; // vectorized, using mask mask = maskwhere(mod != 0); vectorint result = a + b; // vectorized, using mask mask = invert(mask); result = a - b; // vectorized, using mask return result; } If you look at it closely, you've got versions generated for each function that are - scalar - vector-enabled, but for arbitrary length vectors - specialized for (one or more hardware) vector sizes - specialized by alignment (as vector sizes get bigger, e.g. the 32- and 64-byte AVX versions coming out, you can't just rely on the runtime to align everything properly, it will be too wasteful) So, I think it's a big ask, but I think it could be produced incrementally. We'd need help from the Julia language/standard library itself to expose masked vector operations. *Sebastian Good* On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller truth...@gmail.com wrote: Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
You couldn't really preserve the semantics as Julia is a much more dynamic language. ISPC can do what it does because the kernel language is fairly restrictive. On Wednesday, September 24, 2014 11:30:56 AM UTC-4, Sebastian Good wrote: ... though I suspect to really profit from masked vectorization like this, it needs to be tackled at a much lower level in the compiler, likely even as an LLVM optimization pass, guided only by some hints from Julia itself. *Sebastian Good* On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good seba...@palladiumconsulting.com javascript: wrote: I've been thinking about this a bit, and as usual, Julia's multiple dispatch might make such a thing possible in a novel way. The heart of ISPC is allowing a function that looks like int addScalar (int a, int b) { return a + b; } effectively be vectorint addVector (vectorint a, vectorint b) { return /*AVX version of */a + b; } This is what vectorizing compilers do, but they don't handle control flow like ISPC does. Also, ISPCs foreach and foreach_tiled allow these vectorized functions to be consumed more efficiently, for instance by handling the ragged/unaligned front and back of arrays with scalar versions, and the middle bits with vectorized functions. With support for hardware vectors in Julia, you can start to imagine writing macros that automatically generate the relevant functions, e.g. generating AddVector from addScalar. However, to do anything cleverer than the (already extremely clever) LLVM vectorizer, you have to expose masking operations. To handle incoherent/divergent control flow, you issue vector operations that are masked, allowing some lanes of the vector to stop participating in the program for a period. In a contrived example int addScalar(int a, int b) { return a % 2 ? a + b : a - b; } would be turned into something like the below vectorint addVector(vectorint a, vectorint b) { mask = all; // a register with all 1s, indicating all lanes participate int mod = a % 2; // vectorized, using mask mask = maskwhere(mod != 0); vectorint result = a + b; // vectorized, using mask mask = invert(mask); result = a - b; // vectorized, using mask return result; } If you look at it closely, you've got versions generated for each function that are - scalar - vector-enabled, but for arbitrary length vectors - specialized for (one or more hardware) vector sizes - specialized by alignment (as vector sizes get bigger, e.g. the 32- and 64-byte AVX versions coming out, you can't just rely on the runtime to align everything properly, it will be too wasteful) So, I think it's a big ask, but I think it could be produced incrementally. We'd need help from the Julia language/standard library itself to expose masked vector operations. *Sebastian Good* On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller trut...@gmail.com javascript: wrote: Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
This is an important part. One of the most important pieces of functionality in vectorizing compilers is explaining how and why they did or didn't vectorize your code. It can be terrifically complicated to figure out. With ISPC, the language is constrained such that everything can be vectorized and so it's much easier to figure out. (Figuring out whether it was a good idea or not is left to the programmer!) *Sebastian Good* On Wed, Sep 24, 2014 at 12:52 PM, Jake Bolewski jakebolew...@gmail.com wrote: You couldn't really preserve the semantics as Julia is a much more dynamic language. ISPC can do what it does because the kernel language is fairly restrictive. On Wednesday, September 24, 2014 11:30:56 AM UTC-4, Sebastian Good wrote: ... though I suspect to really profit from masked vectorization like this, it needs to be tackled at a much lower level in the compiler, likely even as an LLVM optimization pass, guided only by some hints from Julia itself. *Sebastian Good* On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good seba...@ palladiumconsulting.com wrote: I've been thinking about this a bit, and as usual, Julia's multiple dispatch might make such a thing possible in a novel way. The heart of ISPC is allowing a function that looks like int addScalar (int a, int b) { return a + b; } effectively be vectorint addVector (vectorint a, vectorint b) { return /*AVX version of */a + b; } This is what vectorizing compilers do, but they don't handle control flow like ISPC does. Also, ISPCs foreach and foreach_tiled allow these vectorized functions to be consumed more efficiently, for instance by handling the ragged/unaligned front and back of arrays with scalar versions, and the middle bits with vectorized functions. With support for hardware vectors in Julia, you can start to imagine writing macros that automatically generate the relevant functions, e.g. generating AddVector from addScalar. However, to do anything cleverer than the (already extremely clever) LLVM vectorizer, you have to expose masking operations. To handle incoherent/divergent control flow, you issue vector operations that are masked, allowing some lanes of the vector to stop participating in the program for a period. In a contrived example int addScalar(int a, int b) { return a % 2 ? a + b : a - b; } would be turned into something like the below vectorint addVector(vectorint a, vectorint b) { mask = all; // a register with all 1s, indicating all lanes participate int mod = a % 2; // vectorized, using mask mask = maskwhere(mod != 0); vectorint result = a + b; // vectorized, using mask mask = invert(mask); result = a - b; // vectorized, using mask return result; } If you look at it closely, you've got versions generated for each function that are - scalar - vector-enabled, but for arbitrary length vectors - specialized for (one or more hardware) vector sizes - specialized by alignment (as vector sizes get bigger, e.g. the 32- and 64-byte AVX versions coming out, you can't just rely on the runtime to align everything properly, it will be too wasteful) So, I think it's a big ask, but I think it could be produced incrementally. We'd need help from the Julia language/standard library itself to expose masked vector operations. *Sebastian Good* On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller trut...@gmail.com wrote: Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
Based on this thread, I spent a few days playing around with a toy algorithm in Julia and C++ (trying OpenLC OpenMP), and finally ISPC. My verdict? ISPC is nothing short of magical. While my code was easily parallelizable (working independently on each element of a large array), it was not readily vectorizable by the usual suspects (LLVM or g++) due to potential branch divergence. The inner loop contains several if statements and even a while loop. In practice, these branches are almost never taken, but their presence seems to sufficiently discourage vectorizers such that they don't attempt a transformation. On my machine, a reasonably optimized C++/gcc 5.0 runs through a 440MB computation in about 380ms. (Julia's not far behind). Taking the inner loop functions and compiling them in ispc was almost entirely a copy/paste exercise. Some thought was required but far less than other approaches. My 8-wide AVX-enabled Intel CPU now runs the same benchmark in 140ms, or 2.7 times faster. I'm not a vector wizard, so perhaps it's possible to get much closer to the theoretical 8x speedup, but for minimal effort, unlocking the 2.7x left otherwise idle in the processor seems like a tremendous thing. Implementing something ispc-like as Julia macros would not be simple. It's sufficiently different than scalar code so as to require a different type system, differentiating between values which are inherently vectors (varying) and those which remain scalar (uniform). It has some new constructs (foreach, foreach_tiled, etc.). If you want to take explicit advantage of the vectorized code, then there are a large family of functions which give access to typical vector instructions (shuffle, rotate, scatter, etc.) I think if you want scalar code to be automatically vectorized, then you just have to wait for state of the art to improve in LLVM. But if you're willing to to make what are often minor changes to your loop, I suspect Julia could help with a properly designed macro that applied ISPC-like transformations. It would be extremely powerful, but also expensive to build. This hypothetical cleverer @simd vector would be a very large compiler unto itself. On Wednesday, September 17, 2014 8:58:11 PM UTC-4, Erik Schnetter wrote: On Wed, Sep 17, 2014 at 7:14 PM, gael@gmail.com javascript: wrote: Slightly OT, but since I won't talk about it myself I don't feel this will harm the current thread ... I don't know if it can be of any help/use/interest for any of you but some people (some at Intel) are actively working on SIMD use with LLVM: https://ispc.github.io/index.html But I really don't have the skills to tell you if they just wrote a new C-like language that is autovectorizing well or if they do some even smarter stuff to get maximum performances. I think they are up to something clever. If I read things correctly: ispc adds new keywords that describes the memory layout (!) of data structures that are accessed via SIMD instructions. There exist a few commonly-used data layout optimizations that are generally necessary to achieve good performance with SIMD code, called SOA or replicated or similar. Apparently, ispc introduces respective keywords that automatically transform the layout of data data structures. I wonder whether something equivalent could be implemented via macros in Julia. These would be macros acting on type declarations, not on code. Presumably, these would be array- or structure-like data types, and accessing them is then slightly more complex, so that one would also need to automatically define respective iterators. Maybe there could be a companion macro that acts on loops, so that the loops are transformed (and simd'ized) the same way as the data types... -erik -- Erik Schnetter schn...@cct.lsu.edu javascript: http://www.perimeterinstitute.ca/personal/eschnetter/
Re: [julia-users] Re: Article on `@simd`
Could this theoretical thing be approached incrementally? Meaning here's a project and he's some intermediate results and now it's 1.5x faster, and now he's something better and it's 2.7 all the while the goal is apparent but difficult. Or would it kind of be all works or doesn't?
Re: [julia-users] Re: Article on `@simd`
Update on 64-bit support for vectorizing loops: The support just went into the Github sources. See https://github.com/JuliaLang/julia/pull/8452 . Though for 3D vectors, those are in need of tuple vectorization. See https://github.com/JuliaLang/julia/pull/6271 for the prototype. Unfortunately the prototyped slowed down compilation too much to be enabled by default. But it's possible we might evolve a way to turn it on for specially marked regions of code, or speed up how fast it can reject uninteresting code. On Wednesday, September 17, 2014 10:10:56 AM UTC-5, Uwe Fechner wrote: Any idea when the vectorization of 64 bit double values will be supported? (I work a lot with 3D double vectors, they could be calculated with one command in the Haswell CPU's. ) On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote: There is support in LLVM 3.5 for remarks from the vectorizer, such as vectorization is not beneficial and is not explicitly forced. I didn't see any remarks that explained the why in more detail, though that seems possible to improve since the vectorizer has debugging remarks that go into the why question (e.g. LV: Not vectorizing: Cannot prove legality.) The hard part is coming up with messages that are understandable to non-experts and pertinent. Having too many messages can bury the useful ones. I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for the subject. On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com wrote: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
Just another tidbit I've noticed with regards to the scatter principle; it seems it only inhibits vectorization when used in the *assigning* matrix/vector, as opposed to the location where data is being fetched (so setindex!, not getindex). The following code on a sparse matrix *is able *to vectorize, despite the scatter/indirection indexing through the sparse matrix rows, for example: _nnz = nonzeros(data) _rows = data.rowval @inbounds for m = 1:M # dot product of data[:,m], centroids[:,m] tmp::Float32 = 0.0 @simd for n = data.colptr[m]:(data.colptr[m+1]-1) tmp += _nnz[n] * centroids[_rows[n],k] # indirection through sparse rows vector end # distance[column m, cluster k] = # 1 - dot / (column norm * cluster norm) dist[m,k] = 1.0 - tmp / (data_sumsq[m] * centroidssum[k]) end Note that I had to explicitly have the variables `_nnz` and `_rows` in order for it to vectorize. -Jacob On Thu, Sep 18, 2014 at 12:48 PM, Arch Robison arch.d.robi...@gmail.com wrote: ISPC is not only a an explicit vectorization language, but has some novel semantics, particularly for structures. Not only SOA vs. AOS, but the whole notion of uniform vs. varying fields of a structure is a new thing. A macro-based imitation might be plausible. On Wed, Sep 17, 2014 at 7:58 PM, Erik Schnetter schnet...@cct.lsu.edu wrote: On Wed, Sep 17, 2014 at 7:14 PM, gael.mc...@gmail.com wrote: Slightly OT, but since I won't talk about it myself I don't feel this will harm the current thread ... I don't know if it can be of any help/use/interest for any of you but some people (some at Intel) are actively working on SIMD use with LLVM: https://ispc.github.io/index.html But I really don't have the skills to tell you if they just wrote a new C-like language that is autovectorizing well or if they do some even smarter stuff to get maximum performances. I think they are up to something clever. If I read things correctly: ispc adds new keywords that describes the memory layout (!) of data structures that are accessed via SIMD instructions. There exist a few commonly-used data layout optimizations that are generally necessary to achieve good performance with SIMD code, called SOA or replicated or similar. Apparently, ispc introduces respective keywords that automatically transform the layout of data data structures. I wonder whether something equivalent could be implemented via macros in Julia. These would be macros acting on type declarations, not on code. Presumably, these would be array- or structure-like data types, and accessing them is then slightly more complex, so that one would also need to automatically define respective iterators. Maybe there could be a companion macro that acts on loops, so that the loops are transformed (and simd'ized) the same way as the data types... -erik -- Erik Schnetter schnet...@cct.lsu.edu http://www.perimeterinstitute.ca/personal/eschnetter/
Re: [julia-users] Re: Article on `@simd`
There are still three arguments to max in the last of those examples. Actually it's not clear that you can make an equivalent expression with min and max. Functionally (with intended use) x[i] = max(a, min(b, x[i])) does the same thing as the earlier examples but it expands to x[i] = ifelse(ifelse(b x[i], b, x[i]) a, a, ifelse(b x[i], b, x[i])) which should be hard for a compiler to optimize to the earlier examples since they don't give the same result in the degenerate case of a b. A closer correspondence is given by the clamp function which is implemented as a nested ifelse in the same way as example 2 (although in the opposite order, so it also differs for ab). Den onsdagen den 17:e september 2014 kl. 16:28:45 UTC+2 skrev Arch Robison: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se javascript: wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
Thanks for pointing out the problems, particularly the ab issue. I've reworked that section. On Thu, Sep 18, 2014 at 4:57 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: There are still three arguments to max in the last of those examples. Actually it's not clear that you can make an equivalent expression with min and max. Functionally (with intended use) x[i] = max(a, min(b, x[i])) does the same thing as the earlier examples but it expands to x[i] = ifelse(ifelse(b x[i], b, x[i]) a, a, ifelse(b x[i], b, x[i])) which should be hard for a compiler to optimize to the earlier examples since they don't give the same result in the degenerate case of a b. A closer correspondence is given by the clamp function which is implemented as a nested ifelse in the same way as example 2 (although in the opposite order, so it also differs for ab). Den onsdagen den 17:e september 2014 kl. 16:28:45 UTC+2 skrev Arch Robison: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
ISPC is not only a an explicit vectorization language, but has some novel semantics, particularly for structures. Not only SOA vs. AOS, but the whole notion of uniform vs. varying fields of a structure is a new thing. A macro-based imitation might be plausible. On Wed, Sep 17, 2014 at 7:58 PM, Erik Schnetter schnet...@cct.lsu.edu wrote: On Wed, Sep 17, 2014 at 7:14 PM, gael.mc...@gmail.com wrote: Slightly OT, but since I won't talk about it myself I don't feel this will harm the current thread ... I don't know if it can be of any help/use/interest for any of you but some people (some at Intel) are actively working on SIMD use with LLVM: https://ispc.github.io/index.html But I really don't have the skills to tell you if they just wrote a new C-like language that is autovectorizing well or if they do some even smarter stuff to get maximum performances. I think they are up to something clever. If I read things correctly: ispc adds new keywords that describes the memory layout (!) of data structures that are accessed via SIMD instructions. There exist a few commonly-used data layout optimizations that are generally necessary to achieve good performance with SIMD code, called SOA or replicated or similar. Apparently, ispc introduces respective keywords that automatically transform the layout of data data structures. I wonder whether something equivalent could be implemented via macros in Julia. These would be macros acting on type declarations, not on code. Presumably, these would be array- or structure-like data types, and accessing them is then slightly more complex, so that one would also need to automatically define respective iterators. Maybe there could be a companion macro that acts on loops, so that the loops are transformed (and simd'ized) the same way as the data types... -erik -- Erik Schnetter schnet...@cct.lsu.edu http://www.perimeterinstitute.ca/personal/eschnetter/
[julia-users] Re: Article on `@simd`
In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function. Den måndagen den 15:e september 2014 kl. 23:39:20 UTC+2 skrev Arch Robison: I've posted an article on the @simd feature to https://software.intel.com/en-us/articles/vectorization-in-julia . @simd is an experimental feature http://julia.readthedocs.org/en/release-0.3/manual/performance-tips/#performance-annotations in Julia 0.3 that gives the compiler more latitude to .vectorize loops. Corrections/suggestions appreciated. - Arch D. Robison Intel Corporation
Re: [julia-users] Re: Article on `@simd`
Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
There is support in LLVM 3.5 for remarks from the vectorizer, such as vectorization is not beneficial and is not explicitly forced. I didn't see any remarks that explained the why in more detail, though that seems possible to improve since the vectorizer has debugging remarks that go into the why question (e.g. LV: Not vectorizing: Cannot prove legality.) The hard part is coming up with messages that are understandable to non-experts and pertinent. Having too many messages can bury the useful ones. I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for the subject. On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d.robi...@gmail.com wrote: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
Any idea when the vectorization of 64 bit double values will be supported? (I work a lot with 3D double vectors, they could be calculated with one command in the Haswell CPU's. ) On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote: There is support in LLVM 3.5 for remarks from the vectorizer, such as vectorization is not beneficial and is not explicitly forced. I didn't see any remarks that explained the why in more detail, though that seems possible to improve since the vectorizer has debugging remarks that go into the why question (e.g. LV: Not vectorizing: Cannot prove legality.) The hard part is coming up with messages that are understandable to non-experts and pertinent. Having too many messages can bury the useful ones. I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for the subject. On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com javascript: wrote: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se javascript: wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
I don't know. It may require moving Julia to a newer version of LLVM. The LLVM vectorizer has been undergoing rapid improvements since the LLVM 3.3 that Julia currently uses. Some other non-vectorizaton issues have kept Julia on LLVM 3.3 so far. Last time I looked at the issue of vectorizing Float64, LLVM was punting on using the vector instructions because its cost model indicated that the instructions were costlier than the serial equivalent. The costs that it was using for vector instructions seemed unreasonably high. My copy of Clang (the LLVM C compiler) based on LLVM trunk (future LLVM 3.6) vectorizes 64-bit arithmetic just fine for C. So there is hope. On Wed, Sep 17, 2014 at 10:10 AM, Uwe Fechner uwe.fechner@gmail.com wrote: Any idea when the vectorization of 64 bit double values will be supported? (I work a lot with 3D double vectors, they could be calculated with one command in the Haswell CPU's. ) On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote: There is support in LLVM 3.5 for remarks from the vectorizer, such as vectorization is not beneficial and is not explicitly forced. I didn't see any remarks that explained the why in more detail, though that seems possible to improve since the vectorizer has debugging remarks that go into the why question (e.g. LV: Not vectorizing: Cannot prove legality.) The hard part is coming up with messages that are understandable to non-experts and pertinent. Having too many messages can bury the useful ones. I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for the subject. On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com wrote: Thanks. Now fixed. On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se wrote: In the section The Loop Body Should Be Straight-Line Code, the first and second code example look identical with ifelse constructions. I assume the first one should use ? instead. Also the third code example has a stray x[i]a argument to the max function.
Re: [julia-users] Re: Article on `@simd`
Slightly OT, but since I won't talk about it myself I don't feel this will harm the current thread ... I don't know if it can be of any help/use/interest for any of you but some people (some at Intel) are actively working on SIMD use with LLVM: https://ispc.github.io/index.html But I really don't have the skills to tell you if they just wrote a new C-like language that is autovectorizing well or if they do some even smarter stuff to get maximum performances.
Re: [julia-users] Re: Article on `@simd`
On Wed, Sep 17, 2014 at 7:14 PM, gael.mc...@gmail.com wrote: Slightly OT, but since I won't talk about it myself I don't feel this will harm the current thread ... I don't know if it can be of any help/use/interest for any of you but some people (some at Intel) are actively working on SIMD use with LLVM: https://ispc.github.io/index.html But I really don't have the skills to tell you if they just wrote a new C-like language that is autovectorizing well or if they do some even smarter stuff to get maximum performances. I think they are up to something clever. If I read things correctly: ispc adds new keywords that describes the memory layout (!) of data structures that are accessed via SIMD instructions. There exist a few commonly-used data layout optimizations that are generally necessary to achieve good performance with SIMD code, called SOA or replicated or similar. Apparently, ispc introduces respective keywords that automatically transform the layout of data data structures. I wonder whether something equivalent could be implemented via macros in Julia. These would be macros acting on type declarations, not on code. Presumably, these would be array- or structure-like data types, and accessing them is then slightly more complex, so that one would also need to automatically define respective iterators. Maybe there could be a companion macro that acts on loops, so that the loops are transformed (and simd'ized) the same way as the data types... -erik -- Erik Schnetter schnet...@cct.lsu.edu http://www.perimeterinstitute.ca/personal/eschnetter/