I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
       -0.80311816+0.80311816j, -0.80311816-0.80311816j,
        1.09707981+0.29396165j,  1.09707981-0.29396165j,
       -1.09707981+0.29396165j, -1.09707981-0.29396165j,
        0.29396165+1.09707981j,  0.29396165-1.09707981j,
       -0.29396165+1.09707981j, -0.29396165-1.09707981j,
        0.25495815+0.25495815j,  0.25495815-0.25495815j,
       -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
    d = dot2x1 (a, b)

def F2():
    d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
