Sounds great!
I just gave it a try, and with 16 threads I get 0.07sec which is impressive.

That is when I tried it in isolated code. When put together with other
julia code I have, it segfaults. Have you experienced this as well?
 Le 18 mai 2014 16:05, "Tobias Knopp" <tobias.kn...@googlemail.com> a
écrit :

> sure, the function is Base.parapply though. I had explicitly imported it.
>
> In the case of vectorize_1arg it would be great to automatically
> parallelize comprehensions. If someone could tell me where the actual
> looping happens, this would be great. I have not found that yet. Seems to
> be somewhere in the parser.
>
> Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker:
>>
>> btw, the code you just sent works as is with your pull request branch?
>>
>>
>> ------------------------------------------
>> Carlos
>>
>>
>> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker <carlos...@gmail.com>wrote:
>>
>>> HI Tobias, I saw your pull request and have been following it closely,
>>> nice work ;)
>>>
>>> Though, in the case of element-wise matrix operations, like tanh, there
>>> is no need for extra allocations, since the buffer should be allocated only
>>> once.
>>>
>>> From your first code snippet, is julia smart enough to pre-compute i*N/2
>>> ?
>>> In such cases, creating a kind of array view on the original data would
>>> probably be faster, right? (though I don't know how allocations work here).
>>>
>>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known
>>> operations such as trigonometric ones, that benefit a lot from
>>> multi-threading.
>>> I know this is a hack, but it is quick to implement and brings an
>>> amazing speed up (8x in the case of the code I posted above).
>>>
>>>
>>>
>>>
>>> ------------------------------------------
>>> Carlos
>>>
>>>
>>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp <tobias...@googlemail.com
>>> > wrote:
>>>
>>>> Hi Carlos,
>>>>
>>>> I am working on something that will allow to do multithreading on Julia
>>>> functions (https://github.com/JuliaLang/julia/pull/6741). Implementing
>>>> vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime
>>>> is not thread safe (yet)
>>>>
>>>> Your example is great. I first got a slowdown of 10 because the example
>>>> revealed a locking issue. With a little trick I now get a speedup of 1.75
>>>> on a 2 core machine. Not to bad taking into account that memory allocation
>>>> cannot be parallelized.
>>>>
>>>> The tweaked code looks like
>>>>
>>>> function tanh_core(x,y,i)
>>>>
>>>>     N=length(x)
>>>>
>>>>     for l=1:N/2
>>>>
>>>>       y[l+i*N/2] = tanh(x[l+i*N/2])
>>>>
>>>>     end
>>>>
>>>> end
>>>>
>>>>
>>>> function ptanh(x;numthreads=2)
>>>>
>>>>     y = similar(x)
>>>>
>>>>     N = length(x)
>>>>
>>>>     parapply(tanh_core,(x,y), 0:1, numthreads=numthreads)
>>>>
>>>>     y
>>>>
>>>> end
>>>>
>>>>
>>>> I actually want this to be also fast for
>>>>
>>>>
>>>> function tanh_core(x,y,i)
>>>>
>>>>     y[i] = tanh(x[i])
>>>>
>>>> end
>>>>
>>>>
>>>> function ptanh(x;numthreads=2)
>>>>
>>>>     y = similar(x)
>>>>
>>>>     N = length(x)
>>>>
>>>>     parapply(tanh_core,(x,y), 1:N, numthreads=numthreads)
>>>>
>>>>     y
>>>>
>>>> end
>>>>
>>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker:
>>>>
>>>>> now that I think about it, maybe openblas has nothing to do here,
>>>>> since @which tanh(y) leads to a call to vectorize_1arg().
>>>>>
>>>>> If that's the case, wouldn't it be advantageous to have a
>>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for
>>>>> element-wise operations on scalar arrays,
>>>>> multi-threading with OpenMP?
>>>>>
>>>>>
>>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió:
>>>>>>
>>>>>> forgot to add versioninfo():
>>>>>>
>>>>>> julia> versioninfo()
>>>>>> Julia Version 0.3.0-prerelease+2921
>>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC)
>>>>>> Platform Info:
>>>>>>   System: Linux (x86_64-linux-gnu)
>>>>>>   CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
>>>>>>   WORD_SIZE: 64
>>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
>>>>>>   LAPACK: libopenblas
>>>>>>   LIBM: libopenlibm
>>>>>>
>>>>>>
>>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker escribió:
>>>>>>>
>>>>>>> This is probably related to openblas, but it seems to be that tanh()
>>>>>>> is not multi-threaded, which hinders a considerable speed improvement.
>>>>>>> For example, MATLAB does multi-thread it and gets something around
>>>>>>> 3x speed-up over the single-threaded version.
>>>>>>>
>>>>>>> For example,
>>>>>>>
>>>>>>>   x = rand(100000,200);
>>>>>>>   @time y = tanh(x);
>>>>>>>
>>>>>>> yields:
>>>>>>>   - 0.71 sec in Julia
>>>>>>>   - 0.76 sec in matlab with -singleCompThread
>>>>>>>   - and 0.09 sec in Matlab (this one uses multi-threading by default)
>>>>>>>
>>>>>>> Good news is that julia (w/openblas) is competitive with matlab
>>>>>>> single-threaded version,
>>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have
>>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'.
>>>>>>>
>>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I
>>>>>>> missing?
>>>>>>>
>>>>>>
>>>
>>

Reply via email to