Sounds great! I just gave it a try, and with 16 threads I get 0.07sec which is impressive.
That is when I tried it in isolated code. When put together with other julia code I have, it segfaults. Have you experienced this as well? Le 18 mai 2014 16:05, "Tobias Knopp" <tobias.kn...@googlemail.com> a écrit : > sure, the function is Base.parapply though. I had explicitly imported it. > > In the case of vectorize_1arg it would be great to automatically > parallelize comprehensions. If someone could tell me where the actual > looping happens, this would be great. I have not found that yet. Seems to > be somewhere in the parser. > > Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker: >> >> btw, the code you just sent works as is with your pull request branch? >> >> >> ------------------------------------------ >> Carlos >> >> >> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote: >> >>> HI Tobias, I saw your pull request and have been following it closely, >>> nice work ;) >>> >>> Though, in the case of element-wise matrix operations, like tanh, there >>> is no need for extra allocations, since the buffer should be allocated only >>> once. >>> >>> From your first code snippet, is julia smart enough to pre-compute i*N/2 >>> ? >>> In such cases, creating a kind of array view on the original data would >>> probably be faster, right? (though I don't know how allocations work here). >>> >>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known >>> operations such as trigonometric ones, that benefit a lot from >>> multi-threading. >>> I know this is a hack, but it is quick to implement and brings an >>> amazing speed up (8x in the case of the code I posted above). >>> >>> >>> >>> >>> ------------------------------------------ >>> Carlos >>> >>> >>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <tobias...@googlemail.com >>> > wrote: >>> >>>> Hi Carlos, >>>> >>>> I am working on something that will allow to do multithreading on Julia >>>> functions (https://github.com/JuliaLang/julia/pull/6741). Implementing >>>> vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime >>>> is not thread safe (yet) >>>> >>>> Your example is great. I first got a slowdown of 10 because the example >>>> revealed a locking issue. With a little trick I now get a speedup of 1.75 >>>> on a 2 core machine. Not to bad taking into account that memory allocation >>>> cannot be parallelized. >>>> >>>> The tweaked code looks like >>>> >>>> function tanh_core(x,y,i) >>>> >>>> N=length(x) >>>> >>>> for l=1:N/2 >>>> >>>> y[l+i*N/2] = tanh(x[l+i*N/2]) >>>> >>>> end >>>> >>>> end >>>> >>>> >>>> function ptanh(x;numthreads=2) >>>> >>>> y = similar(x) >>>> >>>> N = length(x) >>>> >>>> parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) >>>> >>>> y >>>> >>>> end >>>> >>>> >>>> I actually want this to be also fast for >>>> >>>> >>>> function tanh_core(x,y,i) >>>> >>>> y[i] = tanh(x[i]) >>>> >>>> end >>>> >>>> >>>> function ptanh(x;numthreads=2) >>>> >>>> y = similar(x) >>>> >>>> N = length(x) >>>> >>>> parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) >>>> >>>> y >>>> >>>> end >>>> >>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker: >>>> >>>>> now that I think about it, maybe openblas has nothing to do here, >>>>> since @which tanh(y) leads to a call to vectorize_1arg(). >>>>> >>>>> If that's the case, wouldn't it be advantageous to have a >>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for >>>>> element-wise operations on scalar arrays, >>>>> multi-threading with OpenMP? >>>>> >>>>> >>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió: >>>>>> >>>>>> forgot to add versioninfo(): >>>>>> >>>>>> julia> versioninfo() >>>>>> Julia Version 0.3.0-prerelease+2921 >>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC) >>>>>> Platform Info: >>>>>> System: Linux (x86_64-linux-gnu) >>>>>> CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz >>>>>> WORD_SIZE: 64 >>>>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) >>>>>> LAPACK: libopenblas >>>>>> LIBM: libopenlibm >>>>>> >>>>>> >>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker escribió: >>>>>>> >>>>>>> This is probably related to openblas, but it seems to be that tanh() >>>>>>> is not multi-threaded, which hinders a considerable speed improvement. >>>>>>> For example, MATLAB does multi-thread it and gets something around >>>>>>> 3x speed-up over the single-threaded version. >>>>>>> >>>>>>> For example, >>>>>>> >>>>>>> x = rand(100000,200); >>>>>>> @time y = tanh(x); >>>>>>> >>>>>>> yields: >>>>>>> - 0.71 sec in Julia >>>>>>> - 0.76 sec in matlab with -singleCompThread >>>>>>> - and 0.09 sec in Matlab (this one uses multi-threading by default) >>>>>>> >>>>>>> Good news is that julia (w/openblas) is competitive with matlab >>>>>>> single-threaded version, >>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have >>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'. >>>>>>> >>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I >>>>>>> missing? >>>>>>> >>>>>> >>> >>