On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
<igor.cerov...@2bridgz.com> wrote:
> The factor ~20% I've mentioned just because it is something what I've
> commonly observed, and of course can vary, and isn't that important.
>
> What bothers me is: why the performance drops 2-times, when I combine two
> routines where each one alone causes performance drop 0.2-times?

I looked at the IJulia notebook you posted, but it wasn't obvious
which routines you mean. Can you point to exactly the source codes you
are comparing?

-erik

> In other words I have routines foo() and bar() and their equivalents in BLAS
> fooblas() barblas(); where
> @elapsed foo() / @elapsed fooblas() ~= 1.2
> The same for bar. Consider following pseudo-code
>   for k in 1:N
>     foo()  # my Julia implementation of a BLAS function for example gemv
>     bar()  # my Julia implementation of a BLAS function for example ger
>   end
> end
>
>
> function foobarblas()
>   for k in 1:N
>     fooblas()  # this is equivalent of foo in BLAS for example gemv
>     barblas()  # this is equivalent of bar in BLAS for example ger
>   end
> end
> then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>
>
> On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>>
>> The architecture-specific, manual BLAS optimizations don't just give
>> you an additional 20%. They can improve speed by a factor of a few.
>>
>> If you see a factor of 2.6, then that's probably to be accepted,
>> unless to really look into the details (generated assembler code,
>> measure cache misses, introduce manual vectorization and loop
>> unrolling, etc.) And you'll have to repeat that analysis if you're
>> using a different system.
>>
>> -erik
>>
>> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>> <igor.c...@2bridgz.com> wrote:
>> > Well, maybe the subject of the post is confusing. I've tried to write an
>> > algorithm which runs approximately as fast as using BLAS functions, but
>> > using pure Julia implementation. Sure, we know, that BLAS is highly
>> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us
>> > say
>> > ~1.2-times.
>> >
>> > If I take a part of the algorithm, and run it separately all works fine.
>> > Consider code below:
>> > function rank1update!(A, x, y)
>> >     for j = 1:size(A, 2)
>> >         @fastmath @inbounds @simd for i = 1:size(A, 1)
>> >             A[i,j] += 1.1 * y[j] * x[i]
>> >         end
>> >     end
>> > end
>> >
>> > function rank1updateb!(A, x, y)
>> >     R = BLAS.ger!(1.1, x, y, A)
>> > end
>> >
>> > Here BLAS is ~1.2-times faster.
>> > However, calling it together with 'mygemv!' in the loop (see code in
>> > original post), the performance drops to ~2.6 times slower then using
>> > BLAS
>> > functions (gemv, ger)
>> >
>> >
>> >
>> >
>> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>> >>
>> >> I'm not sure what the expected result here is. BLAS is designed to be
>> >> as
>> >> fast as possible at matrix multiply. I'd be more concerned if you write
>> >> straightforward loop code and beat BLAS, since that means the BLAS is
>> >> badly
>> >> mistuned.
>> >>
>> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <igor.c...@2bridgz.com>
>> >> wrote:
>> >>>
>> >>> Thanks Steven, I've thought there is something more behind...
>> >>>
>> >>> I shall note that that I forgot to mention matrix dimensions, which is
>> >>> 1000 x 1000.
>> >>>
>> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>> >>>>
>> >>>> You need a lot more than just fast loops to match the performance of
>> >>>> an
>> >>>> optimized BLAS.    See e.g. this notebook for some comments on the
>> >>>> related
>> >>>> case of matrix multiplication:
>> >>>>
>> >>>>
>> >>>>
>> >>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Erik Schnetter <schn...@gmail.com>
>> http://www.perimeterinstitute.ca/personal/eschnetter/



-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to