On Fri, Nov 6, 2015 at 12:32 PM, Lionel du Peloux <lionel.dupel...@gmail.com
> wrote:

>
> Yichao, thank you for this meaningful answer.
>
> I understand points 1-4 to improve my coding.
>
> - I’ve redrawn a unique graph with your 2 benchmarking methods.
> - I’ve added a broadcast! version of sqrt. And two version from the
> MKL/VML library (with VML.jl)
> - I’ve finally added my first custom benchmark function to the plot (in
> black)
>
> => My custom benchmark function is clearly out of scope for small n and it
> seems to come from point 5.
> => with your method you’re also measuring the inner for loops : is the
> cost of this loop negligible regarding the cost of sqrt ?
>

Well, I would imagine that the cost of a loop is much smaller than the cost
of measuring the time. On the machine I did the benchmark, the overhead of
calling an empty non-inlined function in a loop is ~ 1.17ns and it is
certainly negligible compare to GC allocation cost (which is ~2-3ns per
64bit).


> => on my machine, I get quite different results : allocation is x2.5
> faster than sqrt and there is still a huge loss of performance for n<1e2
>

Which LLVM version are you using. IIRC we are not using the sqrt intrinsic
on LLVM 3.3 (the default one). I'm using LLVM 3.7 and the sqrt function on
my machine is using the `vsqrtsd` instruction rather than calling the libm
function and this makes a big difference (It doesn't seems to be vectorized
(SIMD) and I'm not sure why.)


> => using broadcast!, allocation should not be part of the measurement
> right ? But there’s still a gap in performance ...
>

There's also the cost of anonymous function. I'm not sure how you write the
broadcast! version and this could be a problem


>
> So, do you think you’re explanation for point 6 is valid ?
>

I think it is certainly valid on my machine. Not sure about other setups.


> Is it just a matter of measuring or do performance of vectorized
> operations penalized (by what ?) for small n ?
>
> Thanks,
> Lionel
>
> Note : I’m going to implement a non linear solver which deals with about
> 100 beam elements and each element is about 10 to 100 nodes.
> I want to evaluate what could be the impact of modeling my problem with
> one big DOF vector (1e3 to 1e4 nodes) versus a nested vector (a vector of
> 100 vectors, each of 10 to 100 elements).
>
>
>
>
> <https://lh3.googleusercontent.com/-SY5BcG0XvaQ/Vjzj5-tT8cI/AAAAAAAAEgs/dl1Rr34VcbA/s1600/sqrt_yichao.png>
>
>
> <https://lh3.googleusercontent.com/-HJDgvDskYWo/Vjzjs3D7cHI/AAAAAAAAEgk/Ha-eqUsA3r8/s1600/sqrt_bench.png>
>
>
>
>
>

Reply via email to