Re: [julia-users] Re: tanh() speed / multi-threading

Tobias Knopp Sun, 18 May 2014 08:27:06 -0700

And I am pretty excited that it seems to scale so well at your setup. I 
have only 2 cores so could not see if it scales to more cores.


Am Sonntag, 18. Mai 2014 16:40:18 UTC+2 schrieb Tobias Knopp:
>
> Well when I started I got segfaullt all the time :-)
>
> Could you please send me a minimal code example that segfaults? This would 
> be great! This is the only way we can get this stable.
>
> Am Sonntag, 18. Mai 2014 16:35:47 UTC+2 schrieb Carlos Becker:
>>
>> Sounds great!
>> I just gave it a try, and with 16 threads I get 0.07sec which is 
>> impressive.
>>
>> That is when I tried it in isolated code. When put together with other 
>> julia code I have, it segfaults. Have you experienced this as well?
>>  Le 18 mai 2014 16:05, "Tobias Knopp" <tobias...@googlemail.com> a 
>> écrit :
>>
>>> sure, the function is Base.parapply though. I had explicitly imported it.
>>>
>>> In the case of vectorize_1arg it would be great to automatically 
>>> parallelize comprehensions. If someone could tell me where the actual 
>>> looping happens, this would be great. I have not found that yet. Seems to 
>>> be somewhere in the parser.
>>>
>>> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker:
>>>>
>>>> btw, the code you just sent works as is with your pull request branch?
>>>>
>>>>
>>>> ------------------------------------------
>>>> Carlos
>>>>  
>>>>
>>>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote:
>>>>
>>>>> HI Tobias, I saw your pull request and have been following it closely, 
>>>>> nice work ;)
>>>>>
>>>>> Though, in the case of element-wise matrix operations, like tanh, 
>>>>> there is no need for extra allocations, since the buffer should be 
>>>>> allocated only once.
>>>>>
>>>>> From your first code snippet, is julia smart enough to pre-compute 
>>>>> i*N/2 ?
>>>>> In such cases, creating a kind of array view on the original data 
>>>>> would probably be faster, right? (though I don't know how allocations 
>>>>> work 
>>>>> here).
>>>>>
>>>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for 
>>>>> known operations such as trigonometric ones, that benefit a lot from 
>>>>> multi-threading.
>>>>> I know this is a hack, but it is quick to implement and brings an 
>>>>> amazing speed up (8x in the case of the code I posted above).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------
>>>>> Carlos
>>>>>  
>>>>>
>>>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <
>>>>> tobias...@googlemail.com> wrote:
>>>>>
>>>>>> Hi Carlos,
>>>>>>
>>>>>> I am working on something that will allow to do multithreading on 
>>>>>> Julia functions (https://github.com/JuliaLang/julia/pull/6741). 
>>>>>> Implementing vectorize_1arg_openmp is actually a lot less trivial as the 
>>>>>> Julia runtime is not thread safe (yet)
>>>>>>
>>>>>> Your example is great. I first got a slowdown of 10 because the 
>>>>>> example revealed a locking issue. With a little trick I now get a 
>>>>>> speedup 
>>>>>> of 1.75 on a 2 core machine. Not to bad taking into account that memory 
>>>>>> allocation cannot be parallelized.
>>>>>>
>>>>>> The tweaked code looks like
>>>>>>
>>>>>> function tanh_core(x,y,i)
>>>>>>
>>>>>>     N=length(x)
>>>>>>
>>>>>>     for l=1:N/2
>>>>>>
>>>>>>       y[l+i*N/2] = tanh(x[l+i*N/2])
>>>>>>
>>>>>>     end
>>>>>>
>>>>>> end
>>>>>>
>>>>>>
>>>>>> function ptanh(x;numthreads=2)
>>>>>>
>>>>>>     y = similar(x)
>>>>>>
>>>>>>     N = length(x)
>>>>>>
>>>>>>     parapply(tanh_core,(x,y), 0:1, numthreads=numthreads)
>>>>>>
>>>>>>     y
>>>>>>
>>>>>> end
>>>>>>
>>>>>>
>>>>>> I actually want this to be also fast for
>>>>>>
>>>>>>
>>>>>> function tanh_core(x,y,i)
>>>>>>
>>>>>>     y[i] = tanh(x[i])
>>>>>>
>>>>>> end
>>>>>>
>>>>>>
>>>>>> function ptanh(x;numthreads=2)
>>>>>>
>>>>>>     y = similar(x)
>>>>>>
>>>>>>     N = length(x)
>>>>>>
>>>>>>     parapply(tanh_core,(x,y), 1:N, numthreads=numthreads)
>>>>>>
>>>>>>     y
>>>>>>
>>>>>> end
>>>>>>
>>>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker:
>>>>>>
>>>>>>> now that I think about it, maybe openblas has nothing to do here, 
>>>>>>> since @which tanh(y) leads to a call to vectorize_1arg().
>>>>>>>
>>>>>>> If that's the case, wouldn't it be advantageous to have a 
>>>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for 
>>>>>>> element-wise operations on scalar arrays,
>>>>>>> multi-threading with OpenMP?
>>>>>>>
>>>>>>>
>>>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker 
>>>>>>> escribió:
>>>>>>>>
>>>>>>>> forgot to add versioninfo():
>>>>>>>>
>>>>>>>> julia> versioninfo()
>>>>>>>> Julia Version 0.3.0-prerelease+2921
>>>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC)
>>>>>>>> Platform Info:
>>>>>>>>   System: Linux (x86_64-linux-gnu)
>>>>>>>>   CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
>>>>>>>>   WORD_SIZE: 64
>>>>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
>>>>>>>>   LAPACK: libopenblas
>>>>>>>>   LIBM: libopenlibm
>>>>>>>>
>>>>>>>>
>>>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker 
>>>>>>>> escribió:
>>>>>>>>>
>>>>>>>>> This is probably related to openblas, but it seems to be that 
>>>>>>>>> tanh() is not multi-threaded, which hinders a considerable speed 
>>>>>>>>> improvement.
>>>>>>>>> For example, MATLAB does multi-thread it and gets something around 
>>>>>>>>> 3x speed-up over the single-threaded version.
>>>>>>>>>
>>>>>>>>> For example,
>>>>>>>>>
>>>>>>>>>   x = rand(100000,200);
>>>>>>>>>   @time y = tanh(x);
>>>>>>>>>
>>>>>>>>> yields:
>>>>>>>>>   - 0.71 sec in Julia
>>>>>>>>>   - 0.76 sec in matlab with -singleCompThread
>>>>>>>>>   - and 0.09 sec in Matlab (this one uses multi-threading by 
>>>>>>>>> default)
>>>>>>>>>
>>>>>>>>> Good news is that julia (w/openblas) is competitive with matlab 
>>>>>>>>> single-threaded version,
>>>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have 
>>>>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'.
>>>>>>>>>
>>>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I 
>>>>>>>>> missing?
>>>>>>>>>
>>>>>>>>
>>>>>  
>>>>

Re: [julia-users] Re: tanh() speed / multi-threading

Reply via email to