Re: [julia-users] Re: tanh() speed / multi-threading

Andreas Noack Jensen Sun, 18 May 2014 07:48:28 -0700

The computation of `tanh` is done in openlibm, not openblas, and it is not
multithreaded.  Probably, MATLAB uses Intel's Vectorized Mathematical
Functions (VML) in MKL. If you have MKL you can do that yourself. It makes
a big difference as you also saw in MATLAB. With openlibm I get


julia> @time y = tanh(x);
elapsed time: 1.229392453 seconds (160000096 bytes allocated)

and with VML I get

julia> @time (ymkl=similar(x);ccall((:vdtanh_,Base.libblas_name), Void,
(Ptr{Int}, Ptr{Float64}, Ptr{Float64}), &length(x), x, ymkl))
elapsed time: 0.086282489 seconds (160000112 bytes allocated)

It appears that we can get something similar with Tobias' work, which is
cool.


2014-05-18 16:35 GMT+02:00 Carlos Becker <carlosbec...@gmail.com>:

> Sounds great!
> I just gave it a try, and with 16 threads I get 0.07sec which is
> impressive.
>
> That is when I tried it in isolated code. When put together with other
> julia code I have, it segfaults. Have you experienced this as well?
>  Le 18 mai 2014 16:05, "Tobias Knopp" <tobias.kn...@googlemail.com> a
> écrit :
>
> sure, the function is Base.parapply though. I had explicitly imported it.
>>
>> In the case of vectorize_1arg it would be great to automatically
>> parallelize comprehensions. If someone could tell me where the actual
>> looping happens, this would be great. I have not found that yet. Seems to
>> be somewhere in the parser.
>>
>> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker:
>>>
>>> btw, the code you just sent works as is with your pull request branch?
>>>
>>>
>>> ------------------------------------------
>>> Carlos
>>>
>>>
>>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote:
>>>
>>>> HI Tobias, I saw your pull request and have been following it closely,
>>>> nice work ;)
>>>>
>>>> Though, in the case of element-wise matrix operations, like tanh, there
>>>> is no need for extra allocations, since the buffer should be allocated only
>>>> once.
>>>>
>>>> From your first code snippet, is julia smart enough to pre-compute
>>>> i*N/2 ?
>>>> In such cases, creating a kind of array view on the original data would
>>>> probably be faster, right? (though I don't know how allocations work here).
>>>>
>>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for
>>>> known operations such as trigonometric ones, that benefit a lot from
>>>> multi-threading.
>>>> I know this is a hack, but it is quick to implement and brings an
>>>> amazing speed up (8x in the case of the code I posted above).
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------
>>>> Carlos
>>>>
>>>>
>>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <
>>>> tobias...@googlemail.com> wrote:
>>>>
>>>>> Hi Carlos,
>>>>>
>>>>> I am working on something that will allow to do multithreading on
>>>>> Julia functions (https://github.com/JuliaLang/julia/pull/6741).
>>>>> Implementing vectorize_1arg_openmp is actually a lot less trivial as the
>>>>> Julia runtime is not thread safe (yet)
>>>>>
>>>>> Your example is great. I first got a slowdown of 10 because the
>>>>> example revealed a locking issue. With a little trick I now get a speedup
>>>>> of 1.75 on a 2 core machine. Not to bad taking into account that memory
>>>>> allocation cannot be parallelized.
>>>>>
>>>>> The tweaked code looks like
>>>>>
>>>>> function tanh_core(x,y,i)
>>>>>
>>>>>     N=length(x)
>>>>>
>>>>>     for l=1:N/2
>>>>>
>>>>>       y[l+i*N/2] = tanh(x[l+i*N/2])
>>>>>
>>>>>     end
>>>>>
>>>>> end
>>>>>
>>>>>
>>>>> function ptanh(x;numthreads=2)
>>>>>
>>>>>     y = similar(x)
>>>>>
>>>>>     N = length(x)
>>>>>
>>>>>     parapply(tanh_core,(x,y), 0:1, numthreads=numthreads)
>>>>>
>>>>>     y
>>>>>
>>>>> end
>>>>>
>>>>>
>>>>> I actually want this to be also fast for
>>>>>
>>>>>
>>>>> function tanh_core(x,y,i)
>>>>>
>>>>>     y[i] = tanh(x[i])
>>>>>
>>>>> end
>>>>>
>>>>>
>>>>> function ptanh(x;numthreads=2)
>>>>>
>>>>>     y = similar(x)
>>>>>
>>>>>     N = length(x)
>>>>>
>>>>>     parapply(tanh_core,(x,y), 1:N, numthreads=numthreads)
>>>>>
>>>>>     y
>>>>>
>>>>> end
>>>>>
>>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker:
>>>>>
>>>>>> now that I think about it, maybe openblas has nothing to do here,
>>>>>> since @which tanh(y) leads to a call to vectorize_1arg().
>>>>>>
>>>>>> If that's the case, wouldn't it be advantageous to have a
>>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for
>>>>>> element-wise operations on scalar arrays,
>>>>>> multi-threading with OpenMP?
>>>>>>
>>>>>>
>>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió:
>>>>>>>
>>>>>>> forgot to add versioninfo():
>>>>>>>
>>>>>>> julia> versioninfo()
>>>>>>> Julia Version 0.3.0-prerelease+2921
>>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC)
>>>>>>> Platform Info:
>>>>>>>   System: Linux (x86_64-linux-gnu)
>>>>>>>   CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
>>>>>>>   WORD_SIZE: 64
>>>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
>>>>>>>   LAPACK: libopenblas
>>>>>>>   LIBM: libopenlibm
>>>>>>>
>>>>>>>
>>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker
>>>>>>> escribió:
>>>>>>>>
>>>>>>>> This is probably related to openblas, but it seems to be that
>>>>>>>> tanh() is not multi-threaded, which hinders a considerable speed
>>>>>>>> improvement.
>>>>>>>> For example, MATLAB does multi-thread it and gets something around
>>>>>>>> 3x speed-up over the single-threaded version.
>>>>>>>>
>>>>>>>> For example,
>>>>>>>>
>>>>>>>>   x = rand(100000,200);
>>>>>>>>   @time y = tanh(x);
>>>>>>>>
>>>>>>>> yields:
>>>>>>>>   - 0.71 sec in Julia
>>>>>>>>   - 0.76 sec in matlab with -singleCompThread
>>>>>>>>   - and 0.09 sec in Matlab (this one uses multi-threading by
>>>>>>>> default)
>>>>>>>>
>>>>>>>> Good news is that julia (w/openblas) is competitive with matlab
>>>>>>>> single-threaded version,
>>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have
>>>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'.
>>>>>>>>
>>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I
>>>>>>>> missing?
>>>>>>>>
>>>>>>>
>>>>
>>>


-- 
Med venlig hilsen

Andreas Noack Jensen

Re: [julia-users] Re: tanh() speed / multi-threading

Reply via email to