Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Roman Yurchak Tue, 23 Feb 2021 10:40:37 -0800

For the first benchmark apparently A.dot(B) with A real and B complex isa known issue performance wise https://github.com/numpy/numpy/issues/10468

In general, it might be worth trying different BLAS backends. Forinstance, if you install numpy from conda-forge you should be able toswitch between OpenBLAS, MKL and BLIS:https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation


Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker <ndbeck...@gmail.com<mailto:ndbeck...@gmail.com>> wrote:


    I have code that performs dot product of a 2D matrix of size (on the
    order of) [1000,16] with a vector of size [1000].  The matrix is
    float64 and the vector is complex128.  I was using numpy.dot but it
    turned out to be a bottleneck.

    So I coded dot2x1 in c++ (using xtensor-python just for the
    interface).  No fancy simd was used, unless g++ did it on it's own.

    On a simple benchmark using timeit I find my hand-coded routine is on
    the order of 1000x faster than numpy?  Here is the test code:
    My custom c++ code is dot2x1.  I'm not copying it here because it has
    some dependencies.  Any idea what is going on?

I had a similar experience - albeit with an older numpy and Python 2.7,so my comments are easily outdated and irrelevant. This was on Windows10 64 bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and Iended up using fortran + f2py. Even with the overhead of having to gothrough f2py bridge, the fortran dot_product was several times faster.


Sorry if It doesn’t help much.

Andrea.




    import numpy as np

    from dot2x1 import dot2x1

    a = np.ones ((1000,16))
    b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
            -0.80311816+0.80311816j, -0.80311816-0.80311816j,
             1.09707981+0.29396165j,  1.09707981-0.29396165j,
            -1.09707981+0.29396165j, -1.09707981-0.29396165j,
             0.29396165+1.09707981j,  0.29396165-1.09707981j,
            -0.29396165+1.09707981j, -0.29396165-1.09707981j,
             0.25495815+0.25495815j,  0.25495815-0.25495815j,
            -0.25495815+0.25495815j, -0.25495815-0.25495815j])

    def F1():
         d = dot2x1 (a, b)

    def F2():
         d = np.dot (a, b)

    from timeit import timeit
    print (timeit ('F1()', globals=globals(), number=1000))
    print (timeit ('F2()', globals=globals(), number=1000))

    In [13]: 0.013910860987380147 << 1st timeit
    28.608758996007964  << 2nd timeit

--Those who don't understand recursion are doomed to repeat it

    _______________________________________________
    NumPy-Discussion mailing list
    NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>
    https://mail.python.org/mailman/listinfo/numpy-discussion
    <https://mail.python.org/mailman/listinfo/numpy-discussion>


_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

Reply via email to