sure, the function is Base.parapply though. I had explicitly imported it. In the case of vectorize_1arg it would be great to automatically parallelize comprehensions. If someone could tell me where the actual looping happens, this would be great. I have not found that yet. Seems to be somewhere in the parser.
Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker: > > btw, the code you just sent works as is with your pull request branch? > > > ------------------------------------------ > Carlos > > > On Sun, May 18, 2014 at 1:04 PM, Carlos Becker > <carlos...@gmail.com<javascript:> > > wrote: > >> HI Tobias, I saw your pull request and have been following it closely, >> nice work ;) >> >> Though, in the case of element-wise matrix operations, like tanh, there >> is no need for extra allocations, since the buffer should be allocated only >> once. >> >> From your first code snippet, is julia smart enough to pre-compute i*N/2 ? >> In such cases, creating a kind of array view on the original data would >> probably be faster, right? (though I don't know how allocations work here). >> >> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known >> operations such as trigonometric ones, that benefit a lot from >> multi-threading. >> I know this is a hack, but it is quick to implement and brings an amazing >> speed up (8x in the case of the code I posted above). >> >> >> >> >> ------------------------------------------ >> Carlos >> >> >> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp >> <tobias...@googlemail.com<javascript:> >> > wrote: >> >>> Hi Carlos, >>> >>> I am working on something that will allow to do multithreading on Julia >>> functions (https://github.com/JuliaLang/julia/pull/6741). Implementing >>> vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime >>> is not thread safe (yet) >>> >>> Your example is great. I first got a slowdown of 10 because the example >>> revealed a locking issue. With a little trick I now get a speedup of 1.75 >>> on a 2 core machine. Not to bad taking into account that memory allocation >>> cannot be parallelized. >>> >>> The tweaked code looks like >>> >>> function tanh_core(x,y,i) >>> >>> N=length(x) >>> >>> for l=1:N/2 >>> >>> y[l+i*N/2] = tanh(x[l+i*N/2]) >>> >>> end >>> >>> end >>> >>> >>> function ptanh(x;numthreads=2) >>> >>> y = similar(x) >>> >>> N = length(x) >>> >>> parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) >>> >>> y >>> >>> end >>> >>> >>> I actually want this to be also fast for >>> >>> >>> function tanh_core(x,y,i) >>> >>> y[i] = tanh(x[i]) >>> >>> end >>> >>> >>> function ptanh(x;numthreads=2) >>> >>> y = similar(x) >>> >>> N = length(x) >>> >>> parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) >>> >>> y >>> >>> end >>> >>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker: >>> >>>> now that I think about it, maybe openblas has nothing to do here, since >>>> @which tanh(y) leads to a call to vectorize_1arg(). >>>> >>>> If that's the case, wouldn't it be advantageous to have a >>>> vectorize_1arg_openmp() function (defined in C/C++) that works for >>>> element-wise operations on scalar arrays, >>>> multi-threading with OpenMP? >>>> >>>> >>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió: >>>>> >>>>> forgot to add versioninfo(): >>>>> >>>>> julia> versioninfo() >>>>> Julia Version 0.3.0-prerelease+2921 >>>>> Commit ea70e4d* (2014-05-07 17:56 UTC) >>>>> Platform Info: >>>>> System: Linux (x86_64-linux-gnu) >>>>> CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz >>>>> WORD_SIZE: 64 >>>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) >>>>> LAPACK: libopenblas >>>>> LIBM: libopenlibm >>>>> >>>>> >>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker escribió: >>>>>> >>>>>> This is probably related to openblas, but it seems to be that tanh() >>>>>> is not multi-threaded, which hinders a considerable speed improvement. >>>>>> For example, MATLAB does multi-thread it and gets something around 3x >>>>>> speed-up over the single-threaded version. >>>>>> >>>>>> For example, >>>>>> >>>>>> x = rand(100000,200); >>>>>> @time y = tanh(x); >>>>>> >>>>>> yields: >>>>>> - 0.71 sec in Julia >>>>>> - 0.76 sec in matlab with -singleCompThread >>>>>> - and 0.09 sec in Matlab (this one uses multi-threading by default) >>>>>> >>>>>> Good news is that julia (w/openblas) is competitive with matlab >>>>>> single-threaded version, >>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have >>>>>> any effect on the timings, nor I see higher CPU usage with 'top'. >>>>>> >>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I >>>>>> missing? >>>>>> >>>>> >> >