Hi all,
Thanks for the replies. As mentioned, I'm parallelizing so that I can take
many inner products simultaneously (which I agree is embarrassingly
parallel). The library I'm writing asks the user to supply a function that
takes two objects and returns their inner product. After all the discussion
though it seems this is too simplistic of an approach. Instead, I plan to
write this part of the library as if the inner product function supplied by
the user uses all available cores (with numpy and/or numexpr built with MKL
or LAPACK).

As far as using fortran or C and openMP, this probably isn't worth the time
it would take, both for me and the user.

I've tried increasing the array sizes and found the same trends, so the
slowdown isn't only because the arrays are too small to see the benefit of
multiprocessing. I wrote the code to be easy for anyone to experiment with,
so feel free to play around with what is included in the profiling, the
sizes of arrays, functions used, etc.

I also tried using handythread.foreach with arraySize = (3000,1000), and
found the following:
No shared memory, numpy array multiplication took 1.57585811615 seconds
Shared memory, numpy array multiplication took 1.25499510765 seconds
This is definitely an improvement from multiprocessing, but without knowing
any better, I was hoping to see a roughly 8x speedup on my 8-core
workstation.

Based on what Chris sent, it seems there is some large overhead caused by
multiprocessing pickling numpy arrays. To test what Robin mentioned

> If you are on Linux or Mac then fork works nicely so you have read
> only shared memory you just have to put it in a module before the fork
> (so before pool = Pool() ) and then all the subprocesses can access it
> without any pickling required. ie
> myutil.data = listofdata
> p = multiprocessing.Pool(8)
> def mymapfunc(i):
>   return mydatafunc(myutil.data[i])
>
> p.map(mymapfunc, range(len(myutil.data)))

I tried creating the arrayList in the myutil module and using
multiprocessing to find the inner products of myutil.arrayList, however this
was still slower than not using multiprocessing, so I believe there is still
some large overhead. Here are the results:
No shared memory, numpy array multiplication took 1.55906510353 seconds
Shared memory, numpy array multiplication took 9.82426381111 seconds
Shared memory, myutil.arrayList numpy array multiplication took
8.77094507217 seconds
I'm attaching this code.

I'm going to work around this numpy/multiprocessing behavior with
numpy/numexpr built with MKL or LAPACK. It would be good to know exactly
what's causing this though. It would be nice if there was a way to get the
ideal speedup via multiprocessing, regardless of the internal workings of
the single-threaded inner product function, as this was the behavior I
expected. I imagine other people might come across similar situations, but
again I'm going to try to get around this by letting MKL or LAPACK make use
of all available cores.

Thanks again,
Brandt

Attachment: myutil.py
Description: Binary data

Attachment: shared_mem.py
Description: Binary data

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to