On 9 apr, 09:36, John Ladasky <lada...@my-deja.com> wrote: > Thanks for finding my discussion! Yes, it's about passing numpy > arrays to multiple processors. I'll accomplish that any way that I > can.
My preferred ways of doing this are: 1. Most cases for parallel processing are covered by libraries, even for neural nets. This particularly involves linear algebra solvers and FFTs, or calling certain expensive functions (sin, cos, exp) over and over again. The solution here is optimised LAPACK and BLAS (Intel MKL, AMD ACML, GotoBLAS, ATLAS, Cray libsci), optimised FFTs (FFTW, Intel MKL, ACML), and fast vector math libraries (Intel VML, ACML). For example, if you want to make multiple calls to the function "exp", there is a good chance you want to use a vector math library. Despite of this, most Python programmers' instinct seems to be to use multiple processes with numpy.exp or math.exp, or use multiple threads in C with exp from libm (cf. math.h). Why go through this pain when a single function call to Intel VML or AMD ACML (acml-vm) will be much better? It is common to see scholars argue that "yes but my needs are so special that I need to customise everything myself." Usually this translates to "I don't know these libraries (not even that they exist) and are happy to reinvent the wheel." Thus, if you think you need to use manually managed threads or processes for parallel technical computing, and even contemplate that the GIL might get in your way, there is a 99% chance you are wrong. You will almost ALWAYS want ot use a fast library, either directly in Python or linked to your own serial C or Fortran code. You have probably heard that "premature optimisation is the root of all evil in computer programming." It particularly applies here. Learn to use the available performance libraires, and it does not matter from which language they are called (C or Fortran will not be faster than Python!) This is one of the major reasons Python can be used for HPC (high-performance computing) even though the Python part itself is "slow". Most of these libraires are available for free (GotoBLAS, ATLAS, FFTW, ACML), but Intel MKL and VML require a license fee. Also note that there are comprehensive numerical libraries you can use, which can be linked with multi-threaded performance libraries under the hood. Of particular interest are the Fortran libraries from NAG and IMSL, which are the two gold standards of technical computing. Also note that the linear algebra solvers of NumPy and SciPy in Enthought Python Distribution are linked with Intel MKL. Enthought's license are cheaper than Intel's, and you don't need to build NumPy or SciPy against MKL yourself. Using scipy.linalg from EPD is likely to cover your need for parallel computing with neural nets. 2. Use Cython, threading.Thread, and release the GIL. Perhaps we should have a cookbook example in scipy.org on this. In the "nogil" block you can call a library or do certain things that Cython allows without holding the GIL. 3. Use C, C++ or Fortran with OpenMP, and call these using Cython, ctypes or f2py. (I prefer Cython and Fortran 95, but any combination will do.) 4. Use mpi4py with any MPI implementation. Note that 1-4 are not mutually exclusive, you can always use a certain combination. > I will retain a copy of YOUR shmarray code (not the Bitbucket code) > for some time in the future. I anticipate that my arrays might get > really large, and then copying them might not be practical in terms of > time and memory usage. The expensive overhead in passing a NymPy array to multiprocessing.Queue is related to pickle/cPickle, not IPC or making a copy of the buffer. For any NumPy array you can afford to copy in terms of memory, just work with copies. The shared memory arrays I made are only useful for large arrays. They are just as expensive to pickle in terms of time, but can be inexpensive in terms of memory. Also beware that the buffer is not copied to the pickle, so you need to call .copy() to pickle the contents of the buffer. But again, I'd urge you to consider a library or threads (threading.Thread in Cython or OpenMP) before you consider multiple processes. The reason I have not updated the sharedmem arrays for two years is that I have come to the conclusion that there are better ways to do this (paricularly vendor tuned libraries). But since they are mostly useful with 64-bit (i.e. large arrays), I'll post an update soon. If you decide to use a multithreaded solution (or shared memory as IPC), beware of "false sharing". If multiple processors write to the same cache line (they can be up to 512K depending on hardware), you'll create an invisible "GIL" that will kill any scalability. That is because dirty cache lines need to be synchonized with RAM. "False sharing" is one of the major reasons that "home-brewed" compute- intensive code will not scale. It is not uncommon to see Java programmers complain about Python's GIL, and then they go on to write i/o bound or false shared code. Rest assured that multi-threaded Java will not scale better than Python in these cases :-) Regards, Sturla -- http://mail.python.org/mailman/listinfo/python-list