On 08.05.2014 02:48, Frédéric Bastien wrote: > Just a quick question/possibility. > > What about just parallelizing ufunc with only 1 inputs that is c or > fortran contiguous like trigonometric function? Is there a fast path in > the ufunc mechanism when the input is fortran/c contig? If that is the > case, it would be relatively easy to add an openmp pragma to parallelize > that loop, with a condition to a minimum number of element.
opemmp is problematic as it gnu openmp deadlocks on fork (multiprocessing) I think if we do consider adding support using multiprocessing.pool.ThreadPool could be a good option. But it also is not difficult for the user to just write a wrapper function like this: parallel_trig(x, func, pool): x = x.reshape(s.size / nthreads, -1) # assuming 1d and no remainder return array(pool.map(func, x)) # use partial to use the out argument > > Anyway, I won't do it. I'm just outlining what I think is the most easy > case(depending of NumPy internal that I don't now enough) to implement > and I think the most frequent (so possible a quick fix for someone with > the knowledge of that code). > > In Theano, we found in a few CPUs for the addition we need a minimum of > 200k element for the parallelization of elemwise to be useful. We use > that number by default for all operation to make it easy. This is user > configurable. This warenty that with current generation, the threading > don't slow thing down. I think that this is more important, don't show > user slow down by default with a new version. > > Fred > > > > > On Wed, May 7, 2014 at 2:27 PM, Julian Taylor > <jtaylor.deb...@googlemail.com <mailto:jtaylor.deb...@googlemail.com>> > wrote: > > On 07.05.2014 20:11, Sturla Molden wrote: > > On 03/05/14 23:56, Siegfried Gonzi wrote: > > > > A more technical answer is that NumPy's internals does not play very > > nicely with multithreading. For examples the array iterators used in > > ufuncs store an internal state. Multithreading would imply an > excessive > > contention for this state, as well as induce false sharing of the > > iterator object. Therefore, a multithreaded NumPy would have > performance > > problems due to synchronization as well as hierachical memory > > collisions. Adding multithreading support to the current NumPy core > > would just degrade the performance. NumPy will not be able to use > > multithreading efficiently unless we redesign the iterators in NumPy > > core. That is a massive undertaking which prbably means rewriting most > > of NumPy's core C code. A better strategy would be to monkey-patch > some > > of the more common ufuncs with multithreaded versions. > > > I wouldn't say that the iterator is a problem, the important iterator > functions are threadsafe and there is support for multithreaded > iteration using NpyIter_Copy so no data is shared between threads. > > I'd say the main issue is that there simply aren't many functions worth > parallelizing in numpy. Most the commonly used stuff is already memory > bandwidth bound with only one or two threads. > The only things I can think of that would profit is sorting/partition > and the special functions like sqrt, exp, log, etc. > > Generic efficient parallelization would require merging of operations > improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so > and thus also has builtin support for multithreading. > > That being said you can use Python threads with numpy as (especially in > 1.9) most expensive functions release the GIL. But unless you are doing > very flop intensive stuff you will probably have to manually block your > operations to the last level cache size if you want to scale beyond one > or two threads. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion