Re: [julia-users] Re: Article on `@simd`

2014-12-23 Thread Valentin Churavy
There is a recent and ongoing discussion on the llvm maillinglist to expose 
scatter and load operations as llvm 
intrinsics http://thread.gmane.org/gmane.comp.compilers.llvm.devel/79936 . 

- Valentin

On Monday, 1 December 2014 17:43:11 UTC+1, John Myles White wrote:

 This is great. Thanks, Jacob.

  -- John

 On Dec 1, 2014, at 8:32 AM, Jacob Quinn quinn@gmail.com javascript: 
 wrote:

 For all the vectorization fans out there, I stumbled across this LLVM blog 
 post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html

 -Jacob

 On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fec...@gmail.com 
 javascript: wrote:

 Great news!

 On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote:

 Update: The recent Julia 0.3.2 release supports vectorization of 
 Float64.





Re: [julia-users] Re: Article on `@simd`

2014-12-01 Thread Jacob Quinn
For all the vectorization fans out there, I stumbled across this LLVM blog
post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html

-Jacob

On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fechner@gmail.com
wrote:

 Great news!

 On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote:

 Update: The recent Julia 0.3.2 release supports vectorization of Float64.




Re: [julia-users] Re: Article on `@simd`

2014-12-01 Thread John Myles White
This is great. Thanks, Jacob.

 -- John

On Dec 1, 2014, at 8:32 AM, Jacob Quinn quinn.jac...@gmail.com wrote:

 For all the vectorization fans out there, I stumbled across this LLVM blog 
 post: http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html
 
 -Jacob
 
 On Wed, Oct 29, 2014 at 3:48 AM, Uwe Fechner uwe.fechner@gmail.com 
 wrote:
 Great news!
 
 On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote:
 Update: The recent Julia 0.3.2 release supports vectorization of Float64.
 



[julia-users] Re: Article on `@simd`

2014-10-29 Thread Uwe Fechner
Great news!

On Tuesday, October 28, 2014 5:06:18 PM UTC+1, Arch Robison wrote:

 Update: The recent Julia 0.3.2 release supports vectorization of Float64.



[julia-users] Re: Article on `@simd`

2014-10-28 Thread Arch Robison
Update: The recent Julia 0.3.2 release supports vectorization of Float64.


Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Uwe Fechner
This are great news! :)

One question: Why is tuple vectorization needed for fast 3D vector 
calculations?

On Wednesday, September 24, 2014 12:05:47 AM UTC+2, Arch Robison wrote:

 Update on 64-bit support for vectorizing loops: The support just went into 
 the Github sources.  See https://github.com/JuliaLang/julia/pull/8452 . 
  Though for 3D vectors, those are in need of tuple vectorization.  See 
 https://github.com/JuliaLang/julia/pull/6271 for the prototype. 
  Unfortunately the prototyped slowed down compilation too much to be 
 enabled by default.  But it's possible we might evolve a way to turn it on 
 for specially marked regions of code, or speed up how fast it can reject 
 uninteresting code.

 On Wednesday, September 17, 2014 10:10:56 AM UTC-5, Uwe Fechner wrote:

 Any idea when the vectorization of 64 bit double values will be 
 supported? 

 (I work a lot with 3D double vectors, they could be calculated with one 
 command
 in the Haswell CPU's. )

 On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote:

 There is support in LLVM 3.5 for remarks from the vectorizer, such as 
 vectorization is not beneficial and is not explicitly forced.  I didn't 
 see any remarks that explained the why in more detail, though that seems 
 possible to improve since the vectorizer has debugging remarks that go into 
 the why question (e.g. LV: Not vectorizing: Cannot prove legality.) 
  The hard part is coming up with messages that are understandable to 
 non-experts and pertinent.  Having too many messages can bury the useful 
 ones.

 I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 
 for the subject.

 On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com 
 wrote:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck 
 gun...@lysator.liu.se wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first 
 and second code example look identical with ifelse constructions. I 
 assume 
 the first one should use ? instead. Also the third code example has a 
 stray 
 x[i]a argument to the max function.




Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Viral Shah
ispc has some amazing capabilities. With the stuff that Arch is doing, we 
are already incrementally getting there. Perhaps someone can also convince 
the ispc team to spend some of their time on Julia. :-)

-viral

On Wednesday, September 24, 2014 12:22:49 AM UTC+5:30, Jeff Waller wrote:

 Could this theoretical thing be approached incrementally?  Meaning here's 
 a project and he's some intermediate results and now it's 1.5x faster, and 
 now he's something better and it's 2.7 all the while the goal is apparent 
 but difficult.  

 Or would it kind of be all works or doesn't?



Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Tim Holy
On Wednesday, September 24, 2014 12:08:38 AM Uwe Fechner wrote:
 One question: Why is tuple vectorization needed for fast 3D vector
 calculations?

3 doesn't fit naturally into the hardware width. So the way to vectorize 3d 
computations is with 3 tuples, one per coordinate.

@Arch, I noticed another typo: in your second code example, `w` should become 
`a`.

--Tim



Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Sebastian Good
I've been thinking about this a bit, and as usual, Julia's multiple
dispatch might make such a thing possible in a novel way. The heart of ISPC
is allowing a function that looks like

int addScalar (int a, int b) { return a + b; }

effectively be

vectorint addVector (vectorint a, vectorint b) { return /*AVX version
of */a + b; }

This is what vectorizing compilers do, but they don't handle control flow
like ISPC does. Also, ISPCs foreach and foreach_tiled allow these
vectorized functions to be consumed more efficiently, for instance by
handling the ragged/unaligned front and back of arrays with scalar
versions, and the middle bits with vectorized functions.

With support for hardware vectors in Julia, you can start to imagine
writing macros that automatically generate the relevant functions, e.g.
generating AddVector from addScalar. However, to do anything cleverer than
the (already extremely clever) LLVM vectorizer, you have to expose masking
operations. To handle incoherent/divergent control flow, you issue vector
operations that are masked, allowing some lanes of the vector to stop
participating in the program for a period.  In a contrived example

int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }

would be turned into something like the below

vectorint addVector(vectorint a, vectorint b) {
  mask = all; // a register with all 1s, indicating all lanes participate
  int mod = a % 2; // vectorized, using mask
  mask = maskwhere(mod != 0);
  vectorint result = a + b; // vectorized, using mask
  mask = invert(mask);
  result = a - b; // vectorized, using mask
  return result;
}

If you look at it closely, you've got versions generated for each function
that are
- scalar
- vector-enabled, but for arbitrary length vectors
- specialized for (one or more hardware) vector sizes
- specialized by alignment (as vector sizes get bigger, e.g. the 32- and
64-byte AVX versions coming out, you can't just rely on the runtime to
align everything properly, it will be too wasteful)

So, I think it's a big ask, but I think it could be produced incrementally.
We'd need help from the Julia language/standard library itself to expose
masked vector operations.


*Sebastian Good*


On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller truth...@gmail.com wrote:

 Could this theoretical thing be approached incrementally?  Meaning here's
 a project and he's some intermediate results and now it's 1.5x faster, and
 now he's something better and it's 2.7 all the while the goal is apparent
 but difficult.

 Or would it kind of be all works or doesn't?



Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Sebastian Good
... though I suspect to really profit from masked vectorization like this,
it needs to be tackled at a much lower level in the compiler, likely even
as an LLVM optimization pass, guided only by some hints from Julia itself.

*Sebastian Good*


On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good 
sebast...@palladiumconsulting.com wrote:

 I've been thinking about this a bit, and as usual, Julia's multiple
 dispatch might make such a thing possible in a novel way. The heart of ISPC
 is allowing a function that looks like

 int addScalar (int a, int b) { return a + b; }

 effectively be

 vectorint addVector (vectorint a, vectorint b) { return /*AVX
 version of */a + b; }

 This is what vectorizing compilers do, but they don't handle control flow
 like ISPC does. Also, ISPCs foreach and foreach_tiled allow these
 vectorized functions to be consumed more efficiently, for instance by
 handling the ragged/unaligned front and back of arrays with scalar
 versions, and the middle bits with vectorized functions.

 With support for hardware vectors in Julia, you can start to imagine
 writing macros that automatically generate the relevant functions, e.g.
 generating AddVector from addScalar. However, to do anything cleverer than
 the (already extremely clever) LLVM vectorizer, you have to expose masking
 operations. To handle incoherent/divergent control flow, you issue vector
 operations that are masked, allowing some lanes of the vector to stop
 participating in the program for a period.  In a contrived example

 int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }

 would be turned into something like the below

 vectorint addVector(vectorint a, vectorint b) {
   mask = all; // a register with all 1s, indicating all lanes participate
   int mod = a % 2; // vectorized, using mask
   mask = maskwhere(mod != 0);
   vectorint result = a + b; // vectorized, using mask
   mask = invert(mask);
   result = a - b; // vectorized, using mask
   return result;
 }

 If you look at it closely, you've got versions generated for each function
 that are
 - scalar
 - vector-enabled, but for arbitrary length vectors
 - specialized for (one or more hardware) vector sizes
 - specialized by alignment (as vector sizes get bigger, e.g. the 32- and
 64-byte AVX versions coming out, you can't just rely on the runtime to
 align everything properly, it will be too wasteful)

 So, I think it's a big ask, but I think it could be produced
 incrementally. We'd need help from the Julia language/standard library
 itself to expose masked vector operations.


 *Sebastian Good*


 On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller truth...@gmail.com wrote:

 Could this theoretical thing be approached incrementally?  Meaning here's
 a project and he's some intermediate results and now it's 1.5x faster, and
 now he's something better and it's 2.7 all the while the goal is apparent
 but difficult.

 Or would it kind of be all works or doesn't?





Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Jake Bolewski
You couldn't really preserve the semantics as Julia is a much more dynamic 
language.  ISPC can do what it does because the kernel language is fairly 
restrictive.

On Wednesday, September 24, 2014 11:30:56 AM UTC-4, Sebastian Good wrote:

 ... though I suspect to really profit from masked vectorization like this, 
 it needs to be tackled at a much lower level in the compiler, likely even 
 as an LLVM optimization pass, guided only by some hints from Julia itself.

 *Sebastian Good*

  
 On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good 
 seba...@palladiumconsulting.com javascript: wrote:

 I've been thinking about this a bit, and as usual, Julia's multiple 
 dispatch might make such a thing possible in a novel way. The heart of ISPC 
 is allowing a function that looks like

 int addScalar (int a, int b) { return a + b; }

 effectively be

 vectorint addVector (vectorint a, vectorint b) { return /*AVX 
 version of */a + b; }

 This is what vectorizing compilers do, but they don't handle control flow 
 like ISPC does. Also, ISPCs foreach and foreach_tiled allow these 
 vectorized functions to be consumed more efficiently, for instance by 
 handling the ragged/unaligned front and back of arrays with scalar 
 versions, and the middle bits with vectorized functions.

 With support for hardware vectors in Julia, you can start to imagine 
 writing macros that automatically generate the relevant functions, e.g. 
 generating AddVector from addScalar. However, to do anything cleverer than 
 the (already extremely clever) LLVM vectorizer, you have to expose masking 
 operations. To handle incoherent/divergent control flow, you issue vector 
 operations that are masked, allowing some lanes of the vector to stop 
 participating in the program for a period.  In a contrived example

 int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }

 would be turned into something like the below

 vectorint addVector(vectorint a, vectorint b) {
   mask = all; // a register with all 1s, indicating all lanes participate
   int mod = a % 2; // vectorized, using mask
   mask = maskwhere(mod != 0);
   vectorint result = a + b; // vectorized, using mask
   mask = invert(mask);
   result = a - b; // vectorized, using mask
   return result;
 }

 If you look at it closely, you've got versions generated for each 
 function that are
 - scalar
 - vector-enabled, but for arbitrary length vectors
 - specialized for (one or more hardware) vector sizes
 - specialized by alignment (as vector sizes get bigger, e.g. the 32- and 
 64-byte AVX versions coming out, you can't just rely on the runtime to 
 align everything properly, it will be too wasteful)

 So, I think it's a big ask, but I think it could be produced 
 incrementally. We'd need help from the Julia language/standard library 
 itself to expose masked vector operations.  


 *Sebastian Good*

  
 On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller trut...@gmail.com 
 javascript: wrote:

 Could this theoretical thing be approached incrementally?  Meaning 
 here's a project and he's some intermediate results and now it's 1.5x 
 faster, and now he's something better and it's 2.7 all the while the goal 
 is apparent but difficult.  

 Or would it kind of be all works or doesn't?





Re: [julia-users] Re: Article on `@simd`

2014-09-24 Thread Sebastian Good
This is an important part. One of the most important pieces of
functionality in vectorizing compilers is explaining how and why they did
or didn't vectorize your code. It can be terrifically complicated to figure
out. With ISPC, the language is constrained such that everything can be
vectorized and so it's much easier to figure out. (Figuring out whether it
was a good idea or not is left to the programmer!)

*Sebastian Good*


On Wed, Sep 24, 2014 at 12:52 PM, Jake Bolewski jakebolew...@gmail.com
wrote:

 You couldn't really preserve the semantics as Julia is a much more dynamic
 language.  ISPC can do what it does because the kernel language is fairly
 restrictive.

 On Wednesday, September 24, 2014 11:30:56 AM UTC-4, Sebastian Good wrote:

 ... though I suspect to really profit from masked vectorization like
 this, it needs to be tackled at a much lower level in the compiler, likely
 even as an LLVM optimization pass, guided only by some hints from Julia
 itself.

 *Sebastian Good*


 On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good seba...@
 palladiumconsulting.com wrote:

 I've been thinking about this a bit, and as usual, Julia's multiple
 dispatch might make such a thing possible in a novel way. The heart of ISPC
 is allowing a function that looks like

 int addScalar (int a, int b) { return a + b; }

 effectively be

 vectorint addVector (vectorint a, vectorint b) { return /*AVX
 version of */a + b; }

 This is what vectorizing compilers do, but they don't handle control
 flow like ISPC does. Also, ISPCs foreach and foreach_tiled allow these
 vectorized functions to be consumed more efficiently, for instance by
 handling the ragged/unaligned front and back of arrays with scalar
 versions, and the middle bits with vectorized functions.

 With support for hardware vectors in Julia, you can start to imagine
 writing macros that automatically generate the relevant functions, e.g.
 generating AddVector from addScalar. However, to do anything cleverer than
 the (already extremely clever) LLVM vectorizer, you have to expose masking
 operations. To handle incoherent/divergent control flow, you issue vector
 operations that are masked, allowing some lanes of the vector to stop
 participating in the program for a period.  In a contrived example

 int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }

 would be turned into something like the below

 vectorint addVector(vectorint a, vectorint b) {
   mask = all; // a register with all 1s, indicating all lanes participate
   int mod = a % 2; // vectorized, using mask
   mask = maskwhere(mod != 0);
   vectorint result = a + b; // vectorized, using mask
   mask = invert(mask);
   result = a - b; // vectorized, using mask
   return result;
 }

 If you look at it closely, you've got versions generated for each
 function that are
 - scalar
 - vector-enabled, but for arbitrary length vectors
 - specialized for (one or more hardware) vector sizes
 - specialized by alignment (as vector sizes get bigger, e.g. the 32- and
 64-byte AVX versions coming out, you can't just rely on the runtime to
 align everything properly, it will be too wasteful)

 So, I think it's a big ask, but I think it could be produced
 incrementally. We'd need help from the Julia language/standard library
 itself to expose masked vector operations.


 *Sebastian Good*


 On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller trut...@gmail.com wrote:

 Could this theoretical thing be approached incrementally?  Meaning
 here's a project and he's some intermediate results and now it's 1.5x
 faster, and now he's something better and it's 2.7 all the while the goal
 is apparent but difficult.

 Or would it kind of be all works or doesn't?






Re: [julia-users] Re: Article on `@simd`

2014-09-23 Thread Sebastian Good
Based on this thread, I spent a few days playing around with a toy 
algorithm in Julia and C++ (trying OpenLC  OpenMP), and finally ISPC.

My verdict? ISPC is nothing short of magical. While my code was easily 
parallelizable (working independently on each element of a large array), it 
was not readily vectorizable by the usual suspects (LLVM or g++) due to 
potential branch divergence. The inner loop contains several if statements 
and even a while loop. In practice, these branches are almost never taken, 
but their presence seems to sufficiently discourage vectorizers such that 
they don't attempt a transformation.

On my machine, a reasonably optimized C++/gcc 5.0 runs through a 440MB 
computation in about 380ms. (Julia's not far behind). Taking the inner loop 
functions and compiling them in ispc was almost entirely a copy/paste 
exercise. Some thought was required but far less than other approaches. My 
8-wide AVX-enabled Intel CPU now runs the same benchmark in 140ms, or 2.7 
times faster. I'm not a vector wizard, so perhaps it's possible to get much 
closer to the theoretical 8x speedup, but for minimal effort, unlocking the 
2.7x left otherwise idle in the processor seems like a tremendous thing.

Implementing something ispc-like as Julia macros would not be simple. It's 
sufficiently different than scalar code so as to require a different type 
system, differentiating between values which are inherently vectors 
(varying) and those which remain scalar (uniform). It has some new 
constructs (foreach, foreach_tiled, etc.). If you want to take explicit 
advantage of the vectorized code, then there are a large family of 
functions which give access to typical vector instructions (shuffle, 
rotate, scatter, etc.)

I think if you want scalar code to be automatically vectorized, then you 
just have to wait for state of the art to improve in LLVM. But if you're 
willing to to make what are often minor changes to your loop, I suspect 
Julia could help with a properly designed macro that applied ISPC-like 
transformations. It would be extremely powerful, but also expensive to 
build. This hypothetical cleverer @simd vector would be a very large 
compiler unto itself.



On Wednesday, September 17, 2014 8:58:11 PM UTC-4, Erik Schnetter wrote:

 On Wed, Sep 17, 2014 at 7:14 PM,  gael@gmail.com javascript: 
 wrote: 
  Slightly OT, but since I won't talk about it myself I don't feel this 
 will harm the current thread ... 
  
  
  I don't know if it can be of any help/use/interest for any of you but 
 some people (some at Intel) are actively working on SIMD use with LLVM: 
  
  https://ispc.github.io/index.html 
  
  But I really don't have the skills to tell you if they just wrote a 
 new C-like language that is autovectorizing well or if they do some even 
 smarter stuff to get maximum performances. 

 I think they are up to something clever. 

 If I read things correctly: ispc adds new keywords that describes the 
 memory layout (!) of data structures that are accessed via SIMD 
 instructions. There exist a few commonly-used data layout 
 optimizations that are generally necessary to achieve good performance 
 with SIMD code, called SOA or replicated or similar. Apparently, 
 ispc introduces respective keywords that automatically transform the 
 layout of data data structures. 

 I wonder whether something equivalent could be implemented via macros 
 in Julia. These would be macros acting on type declarations, not on 
 code. Presumably, these would be array- or structure-like data types, 
 and accessing them is then slightly more complex, so that one would 
 also need to automatically define respective iterators. Maybe there 
 could be a companion macro that acts on loops, so that the loops are 
 transformed (and simd'ized) the same way as the data types... 

 -erik 

 -- 
 Erik Schnetter schn...@cct.lsu.edu javascript: 
 http://www.perimeterinstitute.ca/personal/eschnetter/ 



Re: [julia-users] Re: Article on `@simd`

2014-09-23 Thread Jeff Waller
Could this theoretical thing be approached incrementally?  Meaning here's a 
project and he's some intermediate results and now it's 1.5x faster, and 
now he's something better and it's 2.7 all the while the goal is apparent 
but difficult.  

Or would it kind of be all works or doesn't?


Re: [julia-users] Re: Article on `@simd`

2014-09-23 Thread Arch Robison
Update on 64-bit support for vectorizing loops: The support just went into 
the Github sources.  See https://github.com/JuliaLang/julia/pull/8452 . 
 Though for 3D vectors, those are in need of tuple vectorization.  See 
https://github.com/JuliaLang/julia/pull/6271 for the prototype. 
 Unfortunately the prototyped slowed down compilation too much to be 
enabled by default.  But it's possible we might evolve a way to turn it on 
for specially marked regions of code, or speed up how fast it can reject 
uninteresting code.

On Wednesday, September 17, 2014 10:10:56 AM UTC-5, Uwe Fechner wrote:

 Any idea when the vectorization of 64 bit double values will be supported? 

 (I work a lot with 3D double vectors, they could be calculated with one 
 command
 in the Haswell CPU's. )

 On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote:

 There is support in LLVM 3.5 for remarks from the vectorizer, such as 
 vectorization is not beneficial and is not explicitly forced.  I didn't 
 see any remarks that explained the why in more detail, though that seems 
 possible to improve since the vectorizer has debugging remarks that go into 
 the why question (e.g. LV: Not vectorizing: Cannot prove legality.) 
  The hard part is coming up with messages that are understandable to 
 non-experts and pertinent.  Having too many messages can bury the useful 
 ones.

 I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 
 for the subject.

 On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com 
 wrote:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se
  wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first 
 and second code example look identical with ifelse constructions. I assume 
 the first one should use ? instead. Also the third code example has a 
 stray 
 x[i]a argument to the max function.




Re: [julia-users] Re: Article on `@simd`

2014-09-22 Thread Jacob Quinn
Just another tidbit I've noticed with regards to the scatter principle;
it seems it only inhibits vectorization when used in the *assigning*
matrix/vector,
as opposed to the location where data is being fetched (so setindex!, not
getindex). The following code on a sparse matrix *is able *to vectorize,
despite the scatter/indirection indexing through the sparse matrix rows,
for example:

_nnz = nonzeros(data)
_rows = data.rowval
@inbounds for m = 1:M
# dot product of data[:,m], centroids[:,m]
tmp::Float32 = 0.0
@simd for n = data.colptr[m]:(data.colptr[m+1]-1)
tmp += _nnz[n] * centroids[_rows[n],k] # indirection through sparse
rows vector
end
# distance[column m, cluster k] =
# 1 - dot / (column norm * cluster norm)
dist[m,k] = 1.0 - tmp / (data_sumsq[m] * centroidssum[k])
end

Note that I had to explicitly have the variables `_nnz` and `_rows` in
order for it to vectorize.

-Jacob

On Thu, Sep 18, 2014 at 12:48 PM, Arch Robison arch.d.robi...@gmail.com
wrote:

 ISPC is not only a an explicit vectorization language, but has some novel
 semantics, particularly for structures.  Not only SOA vs. AOS, but the
 whole notion of uniform vs. varying fields of a structure is a new
 thing.  A macro-based imitation might be plausible.

 On Wed, Sep 17, 2014 at 7:58 PM, Erik Schnetter schnet...@cct.lsu.edu
 wrote:

 On Wed, Sep 17, 2014 at 7:14 PM,  gael.mc...@gmail.com wrote:
  Slightly OT, but since I won't talk about it myself I don't feel this
 will harm the current thread ...
 
 
  I don't know if it can be of any help/use/interest for any of you but
 some people (some at Intel) are actively working on SIMD use with LLVM:
 
  https://ispc.github.io/index.html
 
  But I really don't have the skills to tell you if they just wrote a
 new C-like language that is autovectorizing well or if they do some even
 smarter stuff to get maximum performances.

 I think they are up to something clever.

 If I read things correctly: ispc adds new keywords that describes the
 memory layout (!) of data structures that are accessed via SIMD
 instructions. There exist a few commonly-used data layout
 optimizations that are generally necessary to achieve good performance
 with SIMD code, called SOA or replicated or similar. Apparently,
 ispc introduces respective keywords that automatically transform the
 layout of data data structures.

 I wonder whether something equivalent could be implemented via macros
 in Julia. These would be macros acting on type declarations, not on
 code. Presumably, these would be array- or structure-like data types,
 and accessing them is then slightly more complex, so that one would
 also need to automatically define respective iterators. Maybe there
 could be a companion macro that acts on loops, so that the loops are
 transformed (and simd'ized) the same way as the data types...

 -erik

 --
 Erik Schnetter schnet...@cct.lsu.edu
 http://www.perimeterinstitute.ca/personal/eschnetter/





Re: [julia-users] Re: Article on `@simd`

2014-09-18 Thread Gunnar Farnebäck
There are still three arguments to max in the last of those examples. 
Actually it's not clear that you can make an equivalent expression with min 
and max. Functionally (with intended use)
x[i] = max(a, min(b, x[i]))
does the same thing as the earlier examples but it expands to
x[i] = ifelse(ifelse(b  x[i], b, x[i])  a, a, ifelse(b  x[i], b, x[i]))
which should be hard for a compiler to optimize to the earlier examples 
since they don't give the same result in the degenerate case of a  b.

A closer correspondence is given by the clamp function which is implemented 
as a nested ifelse in the same way as example 2 (although in the opposite 
order, so it also differs for ab).

Den onsdagen den 17:e september 2014 kl. 16:28:45 UTC+2 skrev Arch Robison:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se 
 javascript: wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first 
 and second code example look identical with ifelse constructions. I assume 
 the first one should use ? instead. Also the third code example has a stray 
 x[i]a argument to the max function.



Re: [julia-users] Re: Article on `@simd`

2014-09-18 Thread Arch Robison
Thanks for pointing out the problems, particularly the ab issue.  I've
reworked that section.

On Thu, Sep 18, 2014 at 4:57 AM, Gunnar Farnebäck gun...@lysator.liu.se
wrote:

 There are still three arguments to max in the last of those examples.
 Actually it's not clear that you can make an equivalent expression with min
 and max. Functionally (with intended use)
 x[i] = max(a, min(b, x[i]))
 does the same thing as the earlier examples but it expands to
 x[i] = ifelse(ifelse(b  x[i], b, x[i])  a, a, ifelse(b  x[i], b, x[i]))
 which should be hard for a compiler to optimize to the earlier examples
 since they don't give the same result in the degenerate case of a  b.

 A closer correspondence is given by the clamp function which is
 implemented as a nested ifelse in the same way as example 2 (although in
 the opposite order, so it also differs for ab).

 Den onsdagen den 17:e september 2014 kl. 16:28:45 UTC+2 skrev Arch Robison:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se
 wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first
 and second code example look identical with ifelse constructions. I assume
 the first one should use ? instead. Also the third code example has a stray
 x[i]a argument to the max function.




Re: [julia-users] Re: Article on `@simd`

2014-09-18 Thread Arch Robison
ISPC is not only a an explicit vectorization language, but has some novel
semantics, particularly for structures.  Not only SOA vs. AOS, but the
whole notion of uniform vs. varying fields of a structure is a new
thing.  A macro-based imitation might be plausible.

On Wed, Sep 17, 2014 at 7:58 PM, Erik Schnetter schnet...@cct.lsu.edu
wrote:

 On Wed, Sep 17, 2014 at 7:14 PM,  gael.mc...@gmail.com wrote:
  Slightly OT, but since I won't talk about it myself I don't feel this
 will harm the current thread ...
 
 
  I don't know if it can be of any help/use/interest for any of you but
 some people (some at Intel) are actively working on SIMD use with LLVM:
 
  https://ispc.github.io/index.html
 
  But I really don't have the skills to tell you if they just wrote a
 new C-like language that is autovectorizing well or if they do some even
 smarter stuff to get maximum performances.

 I think they are up to something clever.

 If I read things correctly: ispc adds new keywords that describes the
 memory layout (!) of data structures that are accessed via SIMD
 instructions. There exist a few commonly-used data layout
 optimizations that are generally necessary to achieve good performance
 with SIMD code, called SOA or replicated or similar. Apparently,
 ispc introduces respective keywords that automatically transform the
 layout of data data structures.

 I wonder whether something equivalent could be implemented via macros
 in Julia. These would be macros acting on type declarations, not on
 code. Presumably, these would be array- or structure-like data types,
 and accessing them is then slightly more complex, so that one would
 also need to automatically define respective iterators. Maybe there
 could be a companion macro that acts on loops, so that the loops are
 transformed (and simd'ized) the same way as the data types...

 -erik

 --
 Erik Schnetter schnet...@cct.lsu.edu
 http://www.perimeterinstitute.ca/personal/eschnetter/



[julia-users] Re: Article on `@simd`

2014-09-17 Thread Gunnar Farnebäck
In the section The Loop Body Should Be Straight-Line Code, the first and 
second code example look identical with ifelse constructions. I assume the 
first one should use ? instead. Also the third code example has a stray 
x[i]a argument to the max function.

Den måndagen den 15:e september 2014 kl. 23:39:20 UTC+2 skrev Arch Robison:

 I've posted an article on the @simd feature to 
 https://software.intel.com/en-us/articles/vectorization-in-julia .   
 @simd is an experimental feature 
 http://julia.readthedocs.org/en/release-0.3/manual/performance-tips/#performance-annotations
  
 in Julia 0.3 that gives the compiler more latitude to .vectorize loops.   
 Corrections/suggestions appreciated.

 - Arch D. Robison
   Intel Corporation



Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread Arch Robison
Thanks.  Now fixed.

On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se
wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first and
 second code example look identical with ifelse constructions. I assume the
 first one should use ? instead. Also the third code example has a stray
 x[i]a argument to the max function.



Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread Arch Robison
There is support in LLVM 3.5 for remarks from the vectorizer, such as
vectorization is not beneficial and is not explicitly forced.  I didn't
see any remarks that explained the why in more detail, though that seems
possible to improve since the vectorizer has debugging remarks that go into
the why question (e.g. LV: Not vectorizing: Cannot prove legality.)
 The hard part is coming up with messages that are understandable to
non-experts and pertinent.  Having too many messages can bury the useful
ones.

I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for
the subject.

On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d.robi...@gmail.com
wrote:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se
 wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first
 and second code example look identical with ifelse constructions. I assume
 the first one should use ? instead. Also the third code example has a stray
 x[i]a argument to the max function.




Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread Uwe Fechner
Any idea when the vectorization of 64 bit double values will be supported? 

(I work a lot with 3D double vectors, they could be calculated with one 
command
in the Haswell CPU's. )

On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote:

 There is support in LLVM 3.5 for remarks from the vectorizer, such as 
 vectorization is not beneficial and is not explicitly forced.  I didn't 
 see any remarks that explained the why in more detail, though that seems 
 possible to improve since the vectorizer has debugging remarks that go into 
 the why question (e.g. LV: Not vectorizing: Cannot prove legality.) 
  The hard part is coming up with messages that are understandable to 
 non-experts and pertinent.  Having too many messages can bury the useful 
 ones.

 I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392 for 
 the subject.

 On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com 
 javascript: wrote:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se 
 javascript: wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first 
 and second code example look identical with ifelse constructions. I assume 
 the first one should use ? instead. Also the third code example has a stray 
 x[i]a argument to the max function.




Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread Arch Robison
I don't know.  It may require moving Julia to a newer version of LLVM.  The
LLVM vectorizer has been undergoing rapid improvements since the LLVM 3.3
that Julia currently uses.  Some other non-vectorizaton issues have kept
Julia on LLVM 3.3 so far.

Last time I looked at the issue of vectorizing Float64, LLVM was punting on
using the vector instructions because its cost model indicated that the
instructions were costlier than the serial equivalent.  The costs that it
was using for vector instructions seemed unreasonably high.  My copy of
Clang (the LLVM C compiler) based on LLVM trunk (future LLVM 3.6)
vectorizes 64-bit arithmetic just fine for C.  So there is hope.

On Wed, Sep 17, 2014 at 10:10 AM, Uwe Fechner uwe.fechner@gmail.com
wrote:

 Any idea when the vectorization of 64 bit double values will be supported?

 (I work a lot with 3D double vectors, they could be calculated with one
 command
 in the Haswell CPU's. )

 On Wednesday, September 17, 2014 4:48:26 PM UTC+2, Arch Robison wrote:

 There is support in LLVM 3.5 for remarks from the vectorizer, such as
 vectorization is not beneficial and is not explicitly forced.  I didn't
 see any remarks that explained the why in more detail, though that seems
 possible to improve since the vectorizer has debugging remarks that go into
 the why question (e.g. LV: Not vectorizing: Cannot prove legality.)
  The hard part is coming up with messages that are understandable to
 non-experts and pertinent.  Having too many messages can bury the useful
 ones.

 I opened issue #8392 https://github.com/JuliaLang/julia/issues/8392
 for the subject.

 On Wed, Sep 17, 2014 at 9:28 AM, Arch Robison arch.d@gmail.com
 wrote:

 Thanks.  Now fixed.

 On Wed, Sep 17, 2014 at 4:14 AM, Gunnar Farnebäck gun...@lysator.liu.se
  wrote:

 In the section The Loop Body Should Be Straight-Line Code, the first
 and second code example look identical with ifelse constructions. I assume
 the first one should use ? instead. Also the third code example has a stray
 x[i]a argument to the max function.





Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread gael . mcdon
Slightly OT, but since I won't talk about it myself I don't feel this will harm 
the current thread ...


I don't know if it can be of any help/use/interest for any of you but some 
people (some at Intel) are actively working on SIMD use with LLVM:

https://ispc.github.io/index.html

But I really don't have the skills to tell you if they just wrote a new 
C-like language that is autovectorizing well or if they do some even smarter 
stuff to get maximum performances.


Re: [julia-users] Re: Article on `@simd`

2014-09-17 Thread Erik Schnetter
On Wed, Sep 17, 2014 at 7:14 PM,  gael.mc...@gmail.com wrote:
 Slightly OT, but since I won't talk about it myself I don't feel this will 
 harm the current thread ...


 I don't know if it can be of any help/use/interest for any of you but some 
 people (some at Intel) are actively working on SIMD use with LLVM:

 https://ispc.github.io/index.html

 But I really don't have the skills to tell you if they just wrote a new 
 C-like language that is autovectorizing well or if they do some even smarter 
 stuff to get maximum performances.

I think they are up to something clever.

If I read things correctly: ispc adds new keywords that describes the
memory layout (!) of data structures that are accessed via SIMD
instructions. There exist a few commonly-used data layout
optimizations that are generally necessary to achieve good performance
with SIMD code, called SOA or replicated or similar. Apparently,
ispc introduces respective keywords that automatically transform the
layout of data data structures.

I wonder whether something equivalent could be implemented via macros
in Julia. These would be macros acting on type declarations, not on
code. Presumably, these would be array- or structure-like data types,
and accessing them is then slightly more complex, so that one would
also need to automatically define respective iterators. Maybe there
could be a companion macro that acts on loops, so that the loops are
transformed (and simd'ized) the same way as the data types...

-erik

-- 
Erik Schnetter schnet...@cct.lsu.edu
http://www.perimeterinstitute.ca/personal/eschnetter/