You may want to try using a profiler. I recently used the ProfileView.jl 
<https://github.com/timholy/ProfileView.jl> package to great success. 

On Friday, February 20, 2015 at 11:53:56 PM UTC-8, Zhixuan Yang wrote:
>
> After recompiled an native arch version of Julia and OpenBLAS, it's about 
> 8x slower than the C code and I think it's near to the  highest performance 
> my code can achieve. After all, the C code was optimized intensively in the 
> cache level and all loops were unrolled. But my Julia code is much more 
> flexible and extensible. 
>
> Maybe I should try to use more computers. Currently my code is paralleled 
> by using pmap(). I hope the communication overhead will not be a new 
> bottleneck if I run on a local network cluster.
>
> Thanks for your help! 
>
> Regards, Yang Zhixuan
>
> 在 2015年2月21日星期六 UTC+8下午2:23:37,Viral Shah写道:
>>
>> So, where is the performance now compared to the C program? I don't think 
>> MKL will give you much if you were on the order of 100x slower to start 
>> with.
>>
>> -viral
>>
>> On Friday, February 20, 2015 at 8:19:50 PM UTC+5:30, Zhixuan Yang wrote:
>>>
>>> Mauro, Sean, and Tim, thanks for your help. 
>>>
>>> Following your suggestions, I removed keyword arguments and split the 
>>> function to avoid conditional statements. These helped a bit. 
>>>
>>> But I got a surprising result after replacing BLAS functions with simple 
>>> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is 
>>> compiled on my computer with the default configuration (the versioninfo() 
>>> is listed below). Do you think it will help to compile a Julia with a 
>>> faster and more optimized BLAS implementation such as Intel's MKL? 
>>>
>>> Julia Version 0.3.6-pre+70
>>> Commit 638fa02 (2015-02-12 13:59 UTC)
>>> Platform Info:
>>>  System: Darwin (x86_64-apple-darwin14.1.0)
>>>  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
>>>  WORD_SIZE: 64
>>>  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>  LAPACK: libopenblas
>>>  LIBM: libopenlibm
>>>  LLVM: libLLVM-3.3
>>>
>>>
>>> Regards, Yang Zhixuan
>>>
>>> 在 2015年2月19日星期四 UTC+8下午10:51:20,Zhixuan Yang写道:
>>>>
>>>> Hello everyone, 
>>>>
>>>> Recently I'm working on my first Julia project, a word embedding 
>>>> training program similar to Google's word2vec 
>>>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed 
>>>> very high-quality, but I want to add more features, so I decided to write 
>>>> a 
>>>> new one). Thanks to Julia's expressiveness, it cost me less than 2 days to 
>>>> write the entire program. But it runs really slow, about 100x slower than 
>>>> the C code of word2vec (the algorithm is the same).  I've been trying to 
>>>> optimize my code for several days (adding type annotations, using BLAS to 
>>>> do computation, eliminating memory allocations ...), but it is still 30x 
>>>> slower than the C code. 
>>>>
>>>> The critical part of my program is the following function (it also 
>>>> consumes most of the time according to the profiling result):
>>>>
>>>> function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: 
>>>> Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, 
>>>> Array{Float64}) = nothing)
>>>>     predict!(c, x)
>>>>     c.outputs[y] -= 1
>>>>
>>>>     if input_gradient != nothing
>>>>         # input_gradient = ( c.weights * outputs' )'
>>>>         BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
>>>>     end
>>>>
>>>>     # c.weights -= α * x' * outputs;
>>>>     BLAS.ger!(-α, vec(x), c.outputs, c.weights)
>>>> end
>>>>
>>>> function predict!(c :: LinearClassifier, x :: Array{Float64})
>>>>     c.outputs = vec(softmax(x * c.weights))
>>>> end
>>>>
>>>> type LinearClassifier
>>>>     k :: Int64 # number of outputs
>>>>     n :: Int64 # number of inputs
>>>>     weights :: Array{Float64, 2} # k * n weight matrix
>>>>
>>>>     outputs :: Vector{Float64}
>>>> end
>>>>
>>>> And the entire program can be found here 
>>>> <https://github.com/yangzhixuan/embed>. Could you please check my code 
>>>> and tell me what I can do to get performance comparable to C. 
>>>>
>>>> Regards.
>>>> Yang Zhixuan
>>>>
>>>

Reply via email to