I need some help benchmarking SoA vs AoS

2016-03-26 Thread maik klein via Digitalmars-d-learn
I recently wrote an article an SoA 
https://maikklein.github.io/post/soa-d/


But now I wanted to actually benchmark SoA vs AoS and it is so 
much harder than I thought.


In DMD SoA basically always beats AoS by a huge chuck. SoA is 
always at least twice as fast compared to AoS.


But with LDC it is just really hard to get reliable results.

I have created a example without any dependencies

http://dpaste.dzfl.pl/877a925e0a33

dub run -b release --compiler=dmd

benchmarking complete access
AoS: 419 ms, 938 μs, and 2 hnsecs
SoA: 11 μs and 8 hnsecs
benchmarking partial access
AoS: 521 ms, 381 μs, and 3 hnsecs
SoA: 5 μs


dub run -b release --compiler=ldc

benchmarking complete access
AoS: 1 μs and 6 hnsecs
SoA: 1 hnsec
benchmarking partial access
AoS: 1 hnsec
SoA: 1 hnsec


The problem I have is that LDC always seems to optimize the 
functions too much. At least one function executes always in "1 
hnsec".


When I do manage do create an testcase where AoS and SoA have a 
reasonable time, then they are equally fast, no matter what I do.


What I wanted to see:

If I iterate over a collection and I always access all members, I 
would assume that AoS vs SoA should be equally fast.


If I iterate over a collection where the elements have a big size 
and I only access some members I would assume SoA to be much 
faster.


Any help tips are greatly appreciated.


Re: I need some help benchmarking SoA vs AoS

2016-03-26 Thread ag0aep6g via Digitalmars-d-learn

On 26.03.2016 14:47, maik klein wrote:

The problem I have is that LDC always seems to optimize the functions
too much. At least one function executes always in "1 hnsec".


Let the output depend on the results somehow. Simply printing them out 
should do the trick. You can also try throwing an Exception on wrong 
results. Else the calculations will be optimized away completely.


Also make sure that data that's supposed to be dynamic actually comes 
from some kind of input. Else the program may just print precalculated 
results.


Re: I need some help benchmarking SoA vs AoS

2016-03-26 Thread ag0aep6g via Digitalmars-d-learn

On 26.03.2016 17:31, ag0aep6g wrote:

Let the output depend on the results somehow. Simply printing them out
should do the trick. You can also try throwing an Exception on wrong
results. Else the calculations will be optimized away completely.

Also make sure that data that's supposed to be dynamic actually comes
from some kind of input. Else the program may just print precalculated
results.


I've crudely applied my advise here:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions

(Can I somehow link to one specific commit of a gist?)

size is now read from the program arguments instead of being an enum, 
and the results are checked with enforce. ldmd can no longer avoid 
calculating anything or precalculate the results completely.


It may still do ridiculous optimizations that wouldn't be possible with 
real world data, because the data is so formulaic. To avoid that you 
could prepare a file with test data and read it from there in the 
benchmark program. That way ldmd can't make predictions about it.


Note that I quick-and-dirty worked around two correctness issues in your 
code:


1) SOA didn't set its length. I've added it to allocate. No idea if 
that's the right spot.
2) SOA doesn't set the correct initial values. I'm just setting them 
from the outside here.


Re: I need some help benchmarking SoA vs AoS

2016-03-26 Thread ag0aep6g via Digitalmars-d-learn

On 26.03.2016 18:04, ag0aep6g wrote:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions


PS: Those enforces are for a size of 100_000 not 1_000_000, because I'm 
impatient.




Re: I need some help benchmarking SoA vs AoS

2016-03-26 Thread maik klein via Digitalmars-d-learn

On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:

On 26.03.2016 18:04, ag0aep6g wrote:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions


PS: Those enforces are for a size of 100_000 not 1_000_000, 
because I'm impatient.


Thanks, okay that gives me more more reliable results.
for 1_000_000

benchmarking complete access
AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
benchmarking partial access
AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

This is sort of what I expected. I will do a few more benchmarks 
now. I probably also randomize the inputs.


Re: I need some help benchmarking SoA vs AoS

2016-03-26 Thread Marco Leise via Digitalmars-d-learn
Am Sat, 26 Mar 2016 17:43:48 +
schrieb maik klein :

> On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:
> > On 26.03.2016 18:04, ag0aep6g wrote:
> >> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions
> >
> > PS: Those enforces are for a size of 100_000 not 1_000_000, 
> > because I'm impatient.
> 
> Thanks, okay that gives me more more reliable results.
> for 1_000_000
> 
> benchmarking complete access
> AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
> SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
> benchmarking partial access
> AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
> SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec
> 
> This is sort of what I expected. I will do a few more benchmarks 
> now. I probably also randomize the inputs.

That looks more like it. :) There is a few things to keep in
mind. When you use constant data and don't use the result
compilers can:

- Const-fold computations away.
- Specialize functions on compile-time known arguments. That
  works mostly as if the argument was a template argument. A
  new instance of the function is created for each invokation
  with a compile-time known value. (Disabling inlining wont
  prevent this.)
- Call pure functions with the same argument only once in a
  loop of 1_000_000.
- Replace 1_000_000 additions of the number X in a loop with
  the expression 1_000_000*X.

In addition to these real-world optimizations, when you don't
accumulate the result of the function call and print it or
store it in some global variable, the whole computation may be
removed as "no side-effect", as others have pointed out. When
inlining is used the compiler may also see through attempts to
only use a part of the result and remove instructions that
lead to the rest of it. For example when you return a struct
with two fields - a and b - and store the sum of a, but ignore
b, then the compiler may remove computations that are only
needed for b!

Try to generate input from random number generators or
external files. Disable inlining for the benchmarked function
via attribute or pragma(inline, false) or otherwise make sure
that the compiler cannot guess what any of the arguments are
and perform const-folding after inlining. When the result is
returned, make sure you use so much of it, that the
compiler cannot elide instructions after inlining. It is
often enough to just store it in a global variable.

-- 
Marco



Re: I need some help benchmarking SoA vs AoS

2016-03-29 Thread ZombineDev via Digitalmars-d-learn

On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:

On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:

On 26.03.2016 18:04, ag0aep6g wrote:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions


PS: Those enforces are for a size of 100_000 not 1_000_000, 
because I'm impatient.


Thanks, okay that gives me more more reliable results.
for 1_000_000

benchmarking complete access
AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
benchmarking partial access
AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

This is sort of what I expected. I will do a few more 
benchmarks now. I probably also randomize the inputs.


Can you benchmark my implementation 
(http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should 
have roughly the same API, except that mine doesn't support 
pre-allocation (reserving capacity) - I wanted to keep it 
minimalistic.
During access it should have the same performance, however the 
insertion behavior should be noticeably different. I'm interested 
to see if one larger allocation on insertion would be faster than 
a number of small ones (or vice-versa). Though, AoS should be the 
fastest in this regard (growth performance).


Re: I need some help benchmarking SoA vs AoS

2016-03-29 Thread maik klein via Digitalmars-d-learn

On Tuesday, 29 March 2016 at 21:27:15 UTC, ZombineDev wrote:

On Saturday, 26 March 2016 at 17:43:48 UTC, maik klein wrote:

On Saturday, 26 March 2016 at 17:06:39 UTC, ag0aep6g wrote:

On 26.03.2016 18:04, ag0aep6g wrote:

https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe/revisions


PS: Those enforces are for a size of 100_000 not 1_000_000, 
because I'm impatient.


Thanks, okay that gives me more more reliable results.
for 1_000_000

benchmarking complete access
AoS: 1 sec, 87 ms, 266 μs, and 4 hnsecs
SoA: 1 sec, 491 ms, 186 μs, and 6 hnsecs
benchmarking partial access
AoS: 7 secs, 167 ms, 635 μs, and 8 hnsecs
SoA: 1 sec, 20 ms, 573 μs, and 1 hnsec

This is sort of what I expected. I will do a few more 
benchmarks now. I probably also randomize the inputs.


Can you benchmark my implementation 
(http://dpaste.dzfl.pl/3de1e18756f8) against yours? It should 
have roughly the same API, except that mine doesn't support 
pre-allocation (reserving capacity) - I wanted to keep it 
minimalistic.
During access it should have the same performance, however the 
insertion behavior should be noticeably different. I'm 
interested to see if one larger allocation on insertion would 
be faster than a number of small ones (or vice-versa). Though, 
AoS should be the fastest in this regard (growth performance).


Yes I'll do it tomorrow, though I am not sure how much you can 
rely on the data. I think it really depends on how fragmented the 
heap currently is.


I will also benchmark memory access, just to see if there is any 
difference. I recently asked a question about it on SO 
https://stackoverflow.com/questions/36296960/pros-and-cons-of-one-big-buffer-vs-many-smaller-buffer?


Doesn't seem to be well received though.