Hi all, Thanks for the replies. As mentioned, I'm parallelizing so that I can take many inner products simultaneously (which I agree is embarrassingly parallel). The library I'm writing asks the user to supply a function that takes two objects and returns their inner product. After all the discussion though it seems this is too simplistic of an approach. Instead, I plan to write this part of the library as if the inner product function supplied by the user uses all available cores (with numpy and/or numexpr built with MKL or LAPACK).
As far as using fortran or C and openMP, this probably isn't worth the time it would take, both for me and the user. I've tried increasing the array sizes and found the same trends, so the slowdown isn't only because the arrays are too small to see the benefit of multiprocessing. I wrote the code to be easy for anyone to experiment with, so feel free to play around with what is included in the profiling, the sizes of arrays, functions used, etc. I also tried using handythread.foreach with arraySize = (3000,1000), and found the following: No shared memory, numpy array multiplication took 1.57585811615 seconds Shared memory, numpy array multiplication took 1.25499510765 seconds This is definitely an improvement from multiprocessing, but without knowing any better, I was hoping to see a roughly 8x speedup on my 8-core workstation. Based on what Chris sent, it seems there is some large overhead caused by multiprocessing pickling numpy arrays. To test what Robin mentioned > If you are on Linux or Mac then fork works nicely so you have read > only shared memory you just have to put it in a module before the fork > (so before pool = Pool() ) and then all the subprocesses can access it > without any pickling required. ie > myutil.data = listofdata > p = multiprocessing.Pool(8) > def mymapfunc(i): > return mydatafunc(myutil.data[i]) > > p.map(mymapfunc, range(len(myutil.data))) I tried creating the arrayList in the myutil module and using multiprocessing to find the inner products of myutil.arrayList, however this was still slower than not using multiprocessing, so I believe there is still some large overhead. Here are the results: No shared memory, numpy array multiplication took 1.55906510353 seconds Shared memory, numpy array multiplication took 9.82426381111 seconds Shared memory, myutil.arrayList numpy array multiplication took 8.77094507217 seconds I'm attaching this code. I'm going to work around this numpy/multiprocessing behavior with numpy/numexpr built with MKL or LAPACK. It would be good to know exactly what's causing this though. It would be nice if there was a way to get the ideal speedup via multiprocessing, regardless of the internal workings of the single-threaded inner product function, as this was the behavior I expected. I imagine other people might come across similar situations, but again I'm going to try to get around this by letting MKL or LAPACK make use of all available cores. Thanks again, Brandt
myutil.py
Description: Binary data
shared_mem.py
Description: Binary data
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion