Well when I started I got segfaullt all the time :-) Could you please send me a minimal code example that segfaults? This would be great! This is the only way we can get this stable.
Am Sonntag, 18. Mai 2014 16:35:47 UTC+2 schrieb Carlos Becker: > > Sounds great! > I just gave it a try, and with 16 threads I get 0.07sec which is > impressive. > > That is when I tried it in isolated code. When put together with other > julia code I have, it segfaults. Have you experienced this as well? > Le 18 mai 2014 16:05, "Tobias Knopp" <tobias...@googlemail.com<javascript:>> > a écrit : > >> sure, the function is Base.parapply though. I had explicitly imported it. >> >> In the case of vectorize_1arg it would be great to automatically >> parallelize comprehensions. If someone could tell me where the actual >> looping happens, this would be great. I have not found that yet. Seems to >> be somewhere in the parser. >> >> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker: >>> >>> btw, the code you just sent works as is with your pull request branch? >>> >>> >>> ------------------------------------------ >>> Carlos >>> >>> >>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote: >>> >>>> HI Tobias, I saw your pull request and have been following it closely, >>>> nice work ;) >>>> >>>> Though, in the case of element-wise matrix operations, like tanh, there >>>> is no need for extra allocations, since the buffer should be allocated >>>> only >>>> once. >>>> >>>> From your first code snippet, is julia smart enough to pre-compute >>>> i*N/2 ? >>>> In such cases, creating a kind of array view on the original data would >>>> probably be faster, right? (though I don't know how allocations work here). >>>> >>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for >>>> known operations such as trigonometric ones, that benefit a lot from >>>> multi-threading. >>>> I know this is a hack, but it is quick to implement and brings an >>>> amazing speed up (8x in the case of the code I posted above). >>>> >>>> >>>> >>>> >>>> ------------------------------------------ >>>> Carlos >>>> >>>> >>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp < >>>> tobias...@googlemail.com> wrote: >>>> >>>>> Hi Carlos, >>>>> >>>>> I am working on something that will allow to do multithreading on >>>>> Julia functions (https://github.com/JuliaLang/julia/pull/6741). >>>>> Implementing vectorize_1arg_openmp is actually a lot less trivial as the >>>>> Julia runtime is not thread safe (yet) >>>>> >>>>> Your example is great. I first got a slowdown of 10 because the >>>>> example revealed a locking issue. With a little trick I now get a speedup >>>>> of 1.75 on a 2 core machine. Not to bad taking into account that memory >>>>> allocation cannot be parallelized. >>>>> >>>>> The tweaked code looks like >>>>> >>>>> function tanh_core(x,y,i) >>>>> >>>>> N=length(x) >>>>> >>>>> for l=1:N/2 >>>>> >>>>> y[l+i*N/2] = tanh(x[l+i*N/2]) >>>>> >>>>> end >>>>> >>>>> end >>>>> >>>>> >>>>> function ptanh(x;numthreads=2) >>>>> >>>>> y = similar(x) >>>>> >>>>> N = length(x) >>>>> >>>>> parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) >>>>> >>>>> y >>>>> >>>>> end >>>>> >>>>> >>>>> I actually want this to be also fast for >>>>> >>>>> >>>>> function tanh_core(x,y,i) >>>>> >>>>> y[i] = tanh(x[i]) >>>>> >>>>> end >>>>> >>>>> >>>>> function ptanh(x;numthreads=2) >>>>> >>>>> y = similar(x) >>>>> >>>>> N = length(x) >>>>> >>>>> parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) >>>>> >>>>> y >>>>> >>>>> end >>>>> >>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker: >>>>> >>>>>> now that I think about it, maybe openblas has nothing to do here, >>>>>> since @which tanh(y) leads to a call to vectorize_1arg(). >>>>>> >>>>>> If that's the case, wouldn't it be advantageous to have a >>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for >>>>>> element-wise operations on scalar arrays, >>>>>> multi-threading with OpenMP? >>>>>> >>>>>> >>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió: >>>>>>> >>>>>>> forgot to add versioninfo(): >>>>>>> >>>>>>> julia> versioninfo() >>>>>>> Julia Version 0.3.0-prerelease+2921 >>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC) >>>>>>> Platform Info: >>>>>>> System: Linux (x86_64-linux-gnu) >>>>>>> CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz >>>>>>> WORD_SIZE: 64 >>>>>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) >>>>>>> LAPACK: libopenblas >>>>>>> LIBM: libopenlibm >>>>>>> >>>>>>> >>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker >>>>>>> escribió: >>>>>>>> >>>>>>>> This is probably related to openblas, but it seems to be that >>>>>>>> tanh() is not multi-threaded, which hinders a considerable speed >>>>>>>> improvement. >>>>>>>> For example, MATLAB does multi-thread it and gets something around >>>>>>>> 3x speed-up over the single-threaded version. >>>>>>>> >>>>>>>> For example, >>>>>>>> >>>>>>>> x = rand(100000,200); >>>>>>>> @time y = tanh(x); >>>>>>>> >>>>>>>> yields: >>>>>>>> - 0.71 sec in Julia >>>>>>>> - 0.76 sec in matlab with -singleCompThread >>>>>>>> - and 0.09 sec in Matlab (this one uses multi-threading by >>>>>>>> default) >>>>>>>> >>>>>>>> Good news is that julia (w/openblas) is competitive with matlab >>>>>>>> single-threaded version, >>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have >>>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'. >>>>>>>> >>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I >>>>>>>> missing? >>>>>>>> >>>>>>> >>>> >>>