HI Tobias, I saw your pull request and have been following it closely, nice work ;)
Though, in the case of element-wise matrix operations, like tanh, there is no need for extra allocations, since the buffer should be allocated only once. >From your first code snippet, is julia smart enough to pre-compute i*N/2 ? In such cases, creating a kind of array view on the original data would probably be faster, right? (though I don't know how allocations work here). For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known operations such as trigonometric ones, that benefit a lot from multi-threading. I know this is a hack, but it is quick to implement and brings an amazing speed up (8x in the case of the code I posted above). ------------------------------------------ Carlos On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <tobias.kn...@googlemail.com>wrote: > Hi Carlos, > > I am working on something that will allow to do multithreading on Julia > functions (https://github.com/JuliaLang/julia/pull/6741). Implementing > vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime > is not thread safe (yet) > > Your example is great. I first got a slowdown of 10 because the example > revealed a locking issue. With a little trick I now get a speedup of 1.75 > on a 2 core machine. Not to bad taking into account that memory allocation > cannot be parallelized. > > The tweaked code looks like > > function tanh_core(x,y,i) > > N=length(x) > > for l=1:N/2 > > y[l+i*N/2] = tanh(x[l+i*N/2]) > > end > > end > > > function ptanh(x;numthreads=2) > > y = similar(x) > > N = length(x) > > parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) > > y > > end > > > I actually want this to be also fast for > > > function tanh_core(x,y,i) > > y[i] = tanh(x[i]) > > end > > > function ptanh(x;numthreads=2) > > y = similar(x) > > N = length(x) > > parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) > > y > > end > > Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker: > >> now that I think about it, maybe openblas has nothing to do here, since >> @which tanh(y) leads to a call to vectorize_1arg(). >> >> If that's the case, wouldn't it be advantageous to have a >> vectorize_1arg_openmp() function (defined in C/C++) that works for >> element-wise operations on scalar arrays, >> multi-threading with OpenMP? >> >> >> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió: >>> >>> forgot to add versioninfo(): >>> >>> julia> versioninfo() >>> Julia Version 0.3.0-prerelease+2921 >>> Commit ea70e4d* (2014-05-07 17:56 UTC) >>> Platform Info: >>> System: Linux (x86_64-linux-gnu) >>> CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz >>> WORD_SIZE: 64 >>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) >>> LAPACK: libopenblas >>> LIBM: libopenlibm >>> >>> >>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker escribió: >>>> >>>> This is probably related to openblas, but it seems to be that tanh() is >>>> not multi-threaded, which hinders a considerable speed improvement. >>>> For example, MATLAB does multi-thread it and gets something around 3x >>>> speed-up over the single-threaded version. >>>> >>>> For example, >>>> >>>> x = rand(100000,200); >>>> @time y = tanh(x); >>>> >>>> yields: >>>> - 0.71 sec in Julia >>>> - 0.76 sec in matlab with -singleCompThread >>>> - and 0.09 sec in Matlab (this one uses multi-threading by default) >>>> >>>> Good news is that julia (w/openblas) is competitive with matlab >>>> single-threaded version, >>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have any >>>> effect on the timings, nor I see higher CPU usage with 'top'. >>>> >>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I >>>> missing? >>>> >>>