sure, the function is Base.parapply though. I had explicitly imported it.

In the case of vectorize_1arg it would be great to automatically 
parallelize comprehensions. If someone could tell me where the actual 
looping happens, this would be great. I have not found that yet. Seems to 
be somewhere in the parser.

Am Sonntag, 18. Mai 2014 14:30:49 UTC+2 schrieb Carlos Becker:
>
> btw, the code you just sent works as is with your pull request branch?
>
>
> ------------------------------------------
> Carlos
>  
>
> On Sun, May 18, 2014 at 1:04 PM, Carlos Becker 
> <carlos...@gmail.com<javascript:>
> > wrote:
>
>> HI Tobias, I saw your pull request and have been following it closely, 
>> nice work ;)
>>
>> Though, in the case of element-wise matrix operations, like tanh, there 
>> is no need for extra allocations, since the buffer should be allocated only 
>> once.
>>
>> From your first code snippet, is julia smart enough to pre-compute i*N/2 ?
>> In such cases, creating a kind of array view on the original data would 
>> probably be faster, right? (though I don't know how allocations work here).
>>
>> For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known 
>> operations such as trigonometric ones, that benefit a lot from 
>> multi-threading.
>> I know this is a hack, but it is quick to implement and brings an amazing 
>> speed up (8x in the case of the code I posted above).
>>
>>
>>
>>
>> ------------------------------------------
>> Carlos
>>  
>>
>> On Sun, May 18, 2014 at 12:30 PM, Tobias Knopp 
>> <tobias...@googlemail.com<javascript:>
>> > wrote:
>>
>>> Hi Carlos,
>>>
>>> I am working on something that will allow to do multithreading on Julia 
>>> functions (https://github.com/JuliaLang/julia/pull/6741). Implementing 
>>> vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime 
>>> is not thread safe (yet)
>>>
>>> Your example is great. I first got a slowdown of 10 because the example 
>>> revealed a locking issue. With a little trick I now get a speedup of 1.75 
>>> on a 2 core machine. Not to bad taking into account that memory allocation 
>>> cannot be parallelized.
>>>
>>> The tweaked code looks like
>>>
>>> function tanh_core(x,y,i)
>>>
>>>     N=length(x)
>>>
>>>     for l=1:N/2
>>>
>>>       y[l+i*N/2] = tanh(x[l+i*N/2])
>>>
>>>     end
>>>
>>> end
>>>
>>>
>>> function ptanh(x;numthreads=2)
>>>
>>>     y = similar(x)
>>>
>>>     N = length(x)
>>>
>>>     parapply(tanh_core,(x,y), 0:1, numthreads=numthreads)
>>>
>>>     y
>>>
>>> end
>>>
>>>
>>> I actually want this to be also fast for
>>>
>>>
>>> function tanh_core(x,y,i)
>>>
>>>     y[i] = tanh(x[i])
>>>
>>> end
>>>
>>>
>>> function ptanh(x;numthreads=2)
>>>
>>>     y = similar(x)
>>>
>>>     N = length(x)
>>>
>>>     parapply(tanh_core,(x,y), 1:N, numthreads=numthreads)
>>>
>>>     y
>>>
>>> end
>>>
>>> Am Sonntag, 18. Mai 2014 11:40:13 UTC+2 schrieb Carlos Becker:
>>>
>>>> now that I think about it, maybe openblas has nothing to do here, since 
>>>> @which tanh(y) leads to a call to vectorize_1arg().
>>>>
>>>> If that's the case, wouldn't it be advantageous to have a 
>>>> vectorize_1arg_openmp() function (defined in C/C++) that works for 
>>>> element-wise operations on scalar arrays,
>>>> multi-threading with OpenMP?
>>>>
>>>>
>>>> El domingo, 18 de mayo de 2014 11:34:11 UTC+2, Carlos Becker escribió:
>>>>>
>>>>> forgot to add versioninfo():
>>>>>
>>>>> julia> versioninfo()
>>>>> Julia Version 0.3.0-prerelease+2921
>>>>> Commit ea70e4d* (2014-05-07 17:56 UTC)
>>>>> Platform Info:
>>>>>   System: Linux (x86_64-linux-gnu)
>>>>>   CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
>>>>>   WORD_SIZE: 64
>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
>>>>>   LAPACK: libopenblas
>>>>>   LIBM: libopenlibm
>>>>>
>>>>>
>>>>> El domingo, 18 de mayo de 2014 11:33:45 UTC+2, Carlos Becker escribió:
>>>>>>
>>>>>> This is probably related to openblas, but it seems to be that tanh() 
>>>>>> is not multi-threaded, which hinders a considerable speed improvement.
>>>>>> For example, MATLAB does multi-thread it and gets something around 3x 
>>>>>> speed-up over the single-threaded version.
>>>>>>
>>>>>> For example,
>>>>>>
>>>>>>   x = rand(100000,200);
>>>>>>   @time y = tanh(x);
>>>>>>
>>>>>> yields:
>>>>>>   - 0.71 sec in Julia
>>>>>>   - 0.76 sec in matlab with -singleCompThread
>>>>>>   - and 0.09 sec in Matlab (this one uses multi-threading by default)
>>>>>>
>>>>>> Good news is that julia (w/openblas) is competitive with matlab 
>>>>>> single-threaded version,
>>>>>> though setting the env variable OPENBLAS_NUM_THREADS doesn't have 
>>>>>> any effect on the timings, nor I see higher CPU usage with 'top'.
>>>>>>
>>>>>> Is there an override for OPENBLAS_NUM_THREADS in julia? what am I 
>>>>>> missing?
>>>>>>
>>>>>
>>  
>

Reply via email to