Re: N-body bench

2014-01-30 Thread Stanislav Blinov
On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov 
wrote:


Yep, doesn't seem to be simd-related:

struct S(T) { T v1, v2; }

void main() {
alias T = double; // integrals and float are ok :\
version (workaround) {
S!T[1] p = void;
} else {
S!T[1] p;
}
}

Anyway, here's the revised (and bugfixed :o)) code, if anyone's 
interested:


http://dpaste.dzfl.pl/52d9e1fdc0fd

On my machine, dmd -release -O -inline -noboundscheck is only 6 
times slower than that C++ version :D


I'll try to get around to making it work with ldc on the weekend.


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

Ok, didn't need to wait for the weekend :)

Looks like both dmd and ldc don't optimize slice operations yet, 
had to revert to loops (shaved off ~1.5 seconds for ldc, ~9 
seconds for dmd). Also, my local pull of ldc had some issues with 
to!int(string), reverted that to atoi :)


Here's the code:

http://dpaste.dzfl.pl/4b6df0771696

C++ version compiled with the provided flags.

dmd -release -O -inline -noboundscheck

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops

Here are the results on my machine (i3 2100 @3.1GHz):

time ./nbody-cpp 5000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 5000:
-0.169075164
-0.169059907
0:07.84 real, 7.82 user, 0.00 sys, 1324 kb, 99% cpu

time ./nbody-dmd 5000:
-0.169075164
-0.169059907
0:23.35 real, 23.29 user, 0.00 sys, 1184 kb, 99% cpu



Re: N-body bench

2014-01-30 Thread Stanislav Blinov
On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov 
wrote:


Forgot one slice assignment in toDobule2(). Now the results are 
more interesting:


time ./nbody-cpp 5000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 5000:
-0.169075164
-0.169059907
0:05.94 real, 5.92 user, 0.00 sys, 1320 kb, 99% cpu

time ./nbody-dmd 5000:
-0.169075164
-0.169059907
0:19.62 real, 19.57 user, 0.00 sys, 1188 kb, 99% cpu

:)


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

Forgot one slice assignment in toDobule2(). Now the results are 
more interesting:


Is the latest link shown the last version?

I need the 0.13.0-alpha1 to compile the code.
I am seeing a significant performance difference between C++ and 
D-ldc2.


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:


You mean with your current version of ldc?


Yes. The older version of LDC2 doesn't even compile the code. I 
need to use 0.13.0-alpha1.


Your D code with small changes:
http://codepad.org/xqqScd42

Asm generated by G++ for the advance function (that is the one 
that uses most of the run time):

http://codepad.org/tApRNsVy

Asm generated by ldc2:
http://codepad.org/jKSJcOAZ

With N = 5_000_000 my timings on an old CPU are 2.23 seconds for 
ldc2 and 1.83 seconds for g++. So there's some performance 
difference.


I have tried to unroll manually the loop in the D code, but I see 
worse performance. I'll try some more later.


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 16:53:22 UTC, bearophile wrote:

Yes. The older version of LDC2 doesn't even compile the code. I 
need to use 0.13.0-alpha1.


Hmm.


Your D code with small changes:
http://codepad.org/xqqScd42


That won't compile with dmd (at least, with 2.064.2): it expects 
constants as initializers for vectors. :( That's why I rolled up 
that toDouble2() function.


With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
for ldc2 and 1.83 seconds for g++. So there's some performance 
difference.


What about 50_000_000?



I have tried to unroll manually the loop in the D code, but I 
see worse performance. I'll try some more later.


I'm also fiddling :)


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

That won't compile with dmd (at least, with 2.064.2): it 
expects constants as initializers for vectors. :( That's why I 
rolled up that toDouble2() function.


I see. Then probably I will have to put it back...


With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
for ldc2 and 1.83 seconds for g++. So there's some performance 
difference.


What about 50_000_000?


First let me try to fiddle with the code some more :-)

Once done, this should go somewhere (like the wiki) as a simple 
example of SIMD usage in D.


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

That won't compile with dmd (at least, with 2.064.2): it 
expects constants as initializers for vectors. :( That's why I 
rolled up that toDouble2() function.


Few more changes, but this version still lacks the toDouble2:
http://codepad.org/SpMprWym

Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote:

I see you're compiling with

ldmd2 -wi -O -release -inline -noboundscheck nbody.d

Try

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

Looks like both dmd and ldc don't optimize slice operations 
yet, had to revert to loops


It's a very silly problem for a statically typed language. The D 
type system knows the static length of those arrays, but it 
doesn't use such information.
(Similarly several algorithms in Phobos force to throw away this 
very precious compile-time information requiring dynamic arrays 
in input.)


I have just suggested a fix for ldc2:
http://forum.dlang.org/thread/qeytzeqnygxpocywy...@forum.dlang.org

I have a similar enhancement request since some time in Bugzilla:
https://d.puremagic.com/issues/show_bug.cgi?id=10523
https://d.puremagic.com/issues/show_bug.cgi?id=10305

Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 18:43:02 UTC, bearophile wrote:

It's a very silly problem for a statically typed language. The 
D type system knows the static length of those arrays, but it 
doesn't use such information.


I agree.


Unrolling everything except the loop in energy() seems to have 
squeezed the bits neede to outperform c++, at least on my machine 
:)


http://dpaste.dzfl.pl/45e98e476daf

(I'm sticking to atoi because my copy of ldc seems to have an 
issue in std.conv).


time ./nbody-cpp 5000:
-0.169075164
-0.169059907
0:05.15 real, 5.14 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 5000:
-0.169075164
-0.169059907
0:04.41 real, 4.40 user, 0.00 sys, 1308 kb, 99% cpu

time ./nbody-dmd 5000:
-0.169075164
-0.169059907
0:15.39 real, 15.34 user, 0.00 sys, 1192 kb, 99% cpu



Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

Unrolling everything except the loop in energy() seems to have 
squeezed the bits neede to outperform c++, at least on my 
machine :)


That should be impossible, as I remember from my old profilings 
that energy() should use only an irrelevant amount of run time.




http://dpaste.dzfl.pl/45e98e476daf


While I benchmark some variants of this program I am seeing a 
large variety of problems, limitations, bugs and regressions.


You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 
compiles it. My older version of your D code runs with both 
compiler versions, but V.0.12.1 generates faster code.


Plus you can't make those double2 immutable, you can't use vector 
ops (because of performance, and also because they aren't nothrow 
in V.0.12.1).


I was also experimenting with (note the align):

align(16) struct Body {
double[3] x, v;
double mass;
}

struct NBodySystem {
private:
__gshared static Body[5] bodies = [
// Sun.
Body([0., 0., 0.],
 [0., 0., 0.],
 solarMass),
...

But this improves the code for V.0.12.1 and worsens it for 
0.13.0-alpha1.



Also I think the __gshared is ignored in V.0.12.1, but this bug 
could be fixed in more recent versions of ldc2.



(I'm sticking to atoi because my copy of ldc seems to have an 
issue in std.conv).


My version seems to use to!() correctly.

If ldc2 developers are reading this thread there is enough 
strange stuff here to give one or two headaches :-)


Now I don't know what final version should I keep of this 
program :-)


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:

ldc2 -release -O3 -disable-boundscheck -vectorize 
-vectorize-loops


All my versions of ldc2 don't even accept -vectorize :-)

ldc2: Unknown command line argument '-vectorize'.  Try: 'ldc2 
-help'

ldc2: Did you mean '-vectorize-slp'?

And -vectorize-loops should be active on default on recent 
versions of ldc2 (including V.0.12.1), and indeed I see no 
performance difference in using it.


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 21:04:06 UTC, bearophile wrote:

Stanislav Blinov:

Unrolling everything except the loop in energy() seems to have 
squeezed the bits neede to outperform c++, at least on my 
machine :)


That should be impossible, as I remember from my old profilings 
that energy() should use only an irrelevant amount of run time.


I meant that if I unroll it, it's not irrelevant anymore :)

While I benchmark some variants of this program I am seeing a 
large variety of problems, limitations, bugs and regressions...


:)

You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 
compiles it.


:))

My older version of your D code runs with both compiler 
versions, but V.0.12.1 generates faster code.


:)))

Plus you can't make those double2 immutable, you can't use 
vector ops (because of performance, and also because they 
aren't nothrow in V.0.12.1).


Well, not being able to make them immutable is not *that* big of 
a problem now, is it? What would be actually cool to have are 
those slice operations.



I was also experimenting with (note the align):

align(16) struct Body {
double[3] x, v;
double mass;
}

struct NBodySystem {
private:
__gshared static Body[5] bodies = [
// Sun.
Body([0., 0., 0.],
 [0., 0., 0.],
 solarMass),


Yeah... I've even thrown away that filler in the latest version 
:o)


But this improves the code for V.0.12.1 and worsens it for 
0.13.0-alpha1.


%|

(I'm sticking to atoi because my copy of ldc seems to have an 
issue in std.conv).


My version seems to use to!() correctly.


I'm using the git head (704ab3, last commit Sun Jan 26 00:00:21). 
I haven't tried the release yet.


If ldc2 developers are reading this thread there is enough 
strange stuff here to give one or two headaches :-)


Indeed.

Now I don't know what final version should I keep of this 
program :-)


I was going to compare the asm listings, but C++ seems to have 
unrolled and inlined the outer loop right inside main(), and now 
I'm slightly lost in it :)


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:


I meant that if I unroll it, it's not irrelevant anymore :)


If a function takes no time to run, and you tweak it, your 
program is not supposed to go faster.



I was going to compare the asm listings, but C++ seems to have 
unrolled and inlined the outer loop right inside main(), and 
now I'm slightly lost in it :)


Try using -fkeep-inline-functions.

Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 21:33:38 UTC, bearophile wrote:

If a function takes no time to run, and you tweak it, your 
program is not supposed to go faster.


Right.

I was going to compare the asm listings, but C++ seems to have 
unrolled and inlined the outer loop right inside main(), and 
now I'm slightly lost in it :)


Try using -fkeep-inline-functions.


Thanks.

G++:
http://codepad.org/oOZQw1VQ

LDC:
http://codepad.org/5nHoZL1k


LDC basically generated something that I can only call one 
straight *whsh*... This reminds me Andrei's talk on (last 
years?) GoingNative (more instructions is not always slower 
code).


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:


G++:
http://codepad.org/oOZQw1VQ

LDC:
http://codepad.org/5nHoZL1k


You seem to have a quite recent CPU, as the G++ code contains 
instructions like vmovsd. So you can try to do the same with 
ldc2, and use AVX or AVX2.


There are the switches:

-march=string- Architecture to generate code for:
-mattr=a1,+a2,-a3,...- Target specific attributes 
(-mattr=help for details)
-mcpu=cpu-name   - Target a specific cpu type 
(-mcpu=help for details)



LDC basically generated something that I can only call one 
straight *whsh*...


:-)

Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 21:54:17 UTC, bearophile wrote:


You seem to have a quite recent CPU,


An aging i3?

as the G++ code contains instructions like vmovsd. So you can 
try to do the same with ldc2, and use AVX or AVX2.


Hmm...


This is getting a bit silly now. I must have some compile 
switches for g++ wrong:


g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer 
-march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 
-fopenmp nbody.cpp -o nbody-cpp


time ./nbody-cpp 5000:
-0.169075164
-0.169059907
0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu

ldc2 -release -O3 -disable-boundscheck -vectorize 
-vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d


time ./nbody-ldc 5000:
-0.169075164
-0.169059907
0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu


Re: N-body bench

2014-01-30 Thread bearophile

Stanislav Blinov:


An aging i3?


My CPU is older, it doesn't support AVX2 and AVX.


This is getting a bit silly now. I must have some compile 
switches for g++ wrong:


g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer 
-march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 
-fopenmp nbody.cpp -o nbody-cpp


time ./nbody-cpp 5000:
-0.169075164
-0.169059907
0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu

ldc2 -release -O3 -disable-boundscheck -vectorize 
-vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d


time ./nbody-ldc 5000:
-0.169075164
-0.169059907
0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu


Now the ldc2-compile runs in 4 seconds, this sounds correct. If 
you have paid for a CPU with AVX2 or AVX, it's right to use that 
:-)


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread bearophile
Since my post someone has added a Fortran version based on the 
algorithm used in the C++11 code. It's a little faster than the 
C++11 code and it's much nicer looking:

http://benchmarksgame.alioth.debian.org/u32/program.php?test=nbodylang=ifcid=5


pure subroutine advance(tstep, x, v, mass)
  real*8, intent(in) :: tstep
  real*8, dimension(4,nb), intent(inout) :: x, v
  real*8, dimension(nb), intent(in) :: mass
  real*8 :: r(4,N),mag(N)

  real*8 :: distance, d2
  integer :: i, j, m
  m = 1
  do i = 1, nb
 do j = i + 1, nb
r(1,m) = x(1,i) - x(1,j)
r(2,m) = x(2,i) - x(2,j)
r(3,m) = x(3,i) - x(3,j)
m = m + 1
 end do
  end do

  do m = 1, N
 d2 = r(1,m)**2 + r(2,m)**2 + r(3,m)**2
 distance = 1/sqrt(real(d2))
 distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)
 !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)

 mag(m) = tstep * distance**3
  end do

  m = 1
  do i = 1, nb
 do j = i + 1, nb
v(1,i) = v(1,i) - r(1,m) * mass(j) * mag(m)
v(2,i) = v(2,i) - r(2,m) * mass(j) * mag(m)
v(3,i) = v(3,i) - r(3,m) * mass(j) * mag(m)

v(1,j) = v(1,j) + r(1,m) * mass(i) * mag(m)
v(2,j) = v(2,j) + r(2,m) * mass(i) * mag(m)
v(3,j) = v(3,j) + r(3,m) * mass(i) * mag(m)

m = m + 1
 end do
  end do

  do i = 1, nb
 x(1,i) = x(1,i) + tstep * v(1,i)
 x(2,i) = x(2,i) + tstep * v(2,i)
 x(3,i) = x(3,i) + tstep * v(3,i)
  end do
  end subroutine advance


Bye,
bearophile


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

On Thursday, 30 January 2014 at 22:45:45 UTC, bearophile wrote:
Since my post someone has added a Fortran version based on the 
algorithm used in the C++11 code. It's a little faster than the 
C++11 code and it's much nicer looking:


Yup, I saw it. They're cheating, they almost don't have to 
explicitly handle any SSE business :o) I'm wondering how our 
little code could perform on that machine.


It looks nice too, by the way:

http://dpaste.dzfl.pl/a81a475bbcf6

I've rearranged some bits, brought back to!int (turned out there 
wasn't any issues, it's just that ldc generated errors regarding 
to! when there were other compiler errors %\), replaced 
TypeTuples with your Iota... the works :)


Re: N-body bench

2014-01-30 Thread Stanislav Blinov

Gah! G'Kar moment...

http://dpaste.dzfl.pl/203d237d7413



Re: N-body bench

2014-01-29 Thread Stanislav Blinov

On Friday, 24 January 2014 at 15:56:26 UTC, bearophile wrote:
If someone if willing to test LDC2 with a known benchmark, 
there's this one:


http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

A reformatted C++11 version good as start point for a D 
translation:

http://codepad.org/4mOHW0fz

Bye,
bearophile


Hmm.. How would one use core.simd with LDC2? It doesn't seem to 
define D_SIMD.

Or should I go for builtins?


Re: N-body bench

2014-01-29 Thread bearophile

Stanislav Blinov:

Hmm.. How would one use core.simd with LDC2? It doesn't seem to 
define D_SIMD.

Or should I go for builtins?


I don't know if this is useful for you, but here I wrote a basic 
usage example of SIMD in ldc2 (second D entry):

http://rosettacode.org/wiki/Four_bits_adder#D

Bye,
bearophile


Re: N-body bench

2014-01-29 Thread Stanislav Blinov

On Wednesday, 29 January 2014 at 16:43:35 UTC, bearophile wrote:

Stanislav Blinov:

Hmm.. How would one use core.simd with LDC2? It doesn't seem 
to define D_SIMD.

Or should I go for builtins?


I don't know if this is useful for you, but here I wrote a 
basic usage example of SIMD in ldc2 (second D entry):

http://rosettacode.org/wiki/Four_bits_adder#D

Bye,
bearophile


I meant how to make it compile with ldc2? I've translated the 
code, it compiles and works with dmd (although segfaults in 
-release mode for some reason, probably a bug somewhere).


But with ldc2:

nbody.d(68): Error: undefined identifier __simd
nbody.d(68): Error: undefined identifier XMM

those are needed for that sqrt reciprocal call.


Re: N-body bench

2014-01-29 Thread bearophile

Stanislav Blinov:

I meant how to make it compile with ldc2? I've translated the 
code, it compiles and works with dmd (although segfaults in 
-release mode for some reason, probably a bug somewhere).


But with ldc2:

nbody.d(68): Error: undefined identifier __simd
nbody.d(68): Error: undefined identifier XMM

those are needed for that sqrt reciprocal call.


Usually for me ldc2 works with simd. Perhaps you have to show us 
the code, ask for help in the ldc newsgoup, or ask for help in 
the #ldc IRC channel.


Regarding dmd with -release, I suggest you to minimize the code 
and put the problem in Bugzilla. Benchmarks are also useful to 
find and fix compiler bugs.


Bye,
bearophile


Re: N-body bench

2014-01-29 Thread Stanislav Blinov

On Wednesday, 29 January 2014 at 16:54:54 UTC, bearophile wrote:

Stanislav Blinov:

I meant how to make it compile with ldc2? I've translated the 
code, it compiles and works with dmd (although segfaults in 
-release mode for some reason, probably a bug somewhere).


But with ldc2:

nbody.d(68): Error: undefined identifier __simd
nbody.d(68): Error: undefined identifier XMM

those are needed for that sqrt reciprocal call.


Usually for me ldc2 works with simd. Perhaps you have to show 
us the code, ask for help in the ldc newsgoup, or ask for help 
in the #ldc IRC channel.


It's a direct translation of that C++ code:

http://dpaste.dzfl.pl/89517fd0bf8fa

This line:

distance = __simd(XMM.CVTPS2PD, __simd(XMM.RSQRTPS, 
__simd(XMM.CVTPD2PS, dsquared)));


The XMM enum and __simd functions are defined only when D_SIMD 
version is set. ldc2 doesn't seem to set this, unless I'm missing 
some sort of compiler switch.




Regarding dmd with -release, I suggest you to minimize the code 
and put the problem in Bugzilla. Benchmarks are also useful to 
find and fix compiler bugs.


I'm already onto it :)


Re: N-body bench

2014-01-28 Thread Jerry
bearophile bearophileh...@lycos.com writes:

 If someone if willing to test LDC2 with a known benchmark, there's this one:

 http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

 A reformatted C++11 version good as start point for a D translation:
 http://codepad.org/4mOHW0fz

Just playing with the C++ version in gcc 4.7.3, I see a significant
speedup by using -funroll-loops.  You might want to make sure that's
enabled.

Jerry


N-body bench

2014-01-24 Thread bearophile
If someone if willing to test LDC2 with a known benchmark, 
there's this one:


http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

A reformatted C++11 version good as start point for a D 
translation:

http://codepad.org/4mOHW0fz

Bye,
bearophile