Re: [julia-users] Re: performace of loops Julia vs Blas

Erik Schnetter Wed, 23 Mar 2016 07:44:26 -0700

I get a time ratio (bc / bb) of 1.1

It could be that you're just having bad luck with the particular
optimization decisions that LLVM makes for the combined code, or with
the parameters (sizes) for this benchmark. Maybe the performance
difference changes for different matrix sizes? There's a million
things you can try, e.g. starting Julia with the "-O" option, or using
a different LLVM version. What would really help is gather more
detailed information, e.g. by looking at the disassembled loop kernels
(to see whether something is wrong), or using a profiler to see where
the time is spent (Julia has a built-in profiler), or gathering
statistics about floating point instructions executed and cache
operations (that requires an external tool).


The disassembled code is CPU-specific and also depends on the LLVM
version. I'd be happy to have a quick glance at it if you create a
listing (with `@code_native`) and e.g. put it up as a gist
<gist.github.com>. I'd also need your CPU type (`versioninfo()` in
Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.

-erik

On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
<igor.cerov...@2bridgz.com> wrote:
> I've attached two notebooks, you can check the comparisons.
> The first one is to compare rank1updatede! and rank1updateb! functions. The
> Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The same
> applies to mygemv vs Blas.gemv.
> Combining the same routines into the mgs algorithm in the very first post,
> the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 6700HQ
> (that is important to mention, because on older processors the difference is
> not that big, it similar to comparing the routines rank1update and
> BLAS.ger). This is something what I'm trying to figure out why?
>
>
> On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>>
>> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
>> <igor.c...@2bridgz.com> wrote:
>> > The factor ~20% I've mentioned just because it is something what I've
>> > commonly observed, and of course can vary, and isn't that important.
>> >
>> > What bothers me is: why the performance drops 2-times, when I combine
>> > two
>> > routines where each one alone causes performance drop 0.2-times?
>>
>> I looked at the IJulia notebook you posted, but it wasn't obvious
>> which routines you mean. Can you point to exactly the source codes you
>> are comparing?
>>
>> -erik
>>
>> > In other words I have routines foo() and bar() and their equivalents in
>> > BLAS
>> > fooblas() barblas(); where
>> > @elapsed foo() / @elapsed fooblas() ~= 1.2
>> > The same for bar. Consider following pseudo-code
>> >   for k in 1:N
>> >     foo()  # my Julia implementation of a BLAS function for example gemv
>> >     bar()  # my Julia implementation of a BLAS function for example ger
>> >   end
>> > end
>> >
>> >
>> > function foobarblas()
>> >   for k in 1:N
>> >     fooblas()  # this is equivalent of foo in BLAS for example gemv
>> >     barblas()  # this is equivalent of bar in BLAS for example ger
>> >   end
>> > end
>> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>> >
>> >
>> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>> >>
>> >> The architecture-specific, manual BLAS optimizations don't just give
>> >> you an additional 20%. They can improve speed by a factor of a few.
>> >>
>> >> If you see a factor of 2.6, then that's probably to be accepted,
>> >> unless to really look into the details (generated assembler code,
>> >> measure cache misses, introduce manual vectorization and loop
>> >> unrolling, etc.) And you'll have to repeat that analysis if you're
>> >> using a different system.
>> >>
>> >> -erik
>> >>
>> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>> >> <igor.c...@2bridgz.com> wrote:
>> >> > Well, maybe the subject of the post is confusing. I've tried to write
>> >> > an
>> >> > algorithm which runs approximately as fast as using BLAS functions,
>> >> > but
>> >> > using pure Julia implementation. Sure, we know, that BLAS is highly
>> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let
>> >> > us
>> >> > say
>> >> > ~1.2-times.
>> >> >
>> >> > If I take a part of the algorithm, and run it separately all works
>> >> > fine.
>> >> > Consider code below:
>> >> > function rank1update!(A, x, y)
>> >> >     for j = 1:size(A, 2)
>> >> >         @fastmath @inbounds @simd for i = 1:size(A, 1)
>> >> >             A[i,j] += 1.1 * y[j] * x[i]
>> >> >         end
>> >> >     end
>> >> > end
>> >> >
>> >> > function rank1updateb!(A, x, y)
>> >> >     R = BLAS.ger!(1.1, x, y, A)
>> >> > end
>> >> >
>> >> > Here BLAS is ~1.2-times faster.
>> >> > However, calling it together with 'mygemv!' in the loop (see code in
>> >> > original post), the performance drops to ~2.6 times slower then using
>> >> > BLAS
>> >> > functions (gemv, ger)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>> >> >>
>> >> >> I'm not sure what the expected result here is. BLAS is designed to
>> >> >> be
>> >> >> as
>> >> >> fast as possible at matrix multiply. I'd be more concerned if you
>> >> >> write
>> >> >> straightforward loop code and beat BLAS, since that means the BLAS
>> >> >> is
>> >> >> badly
>> >> >> mistuned.
>> >> >>
>> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky
>> >> >> <igor.c...@2bridgz.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Thanks Steven, I've thought there is something more behind...
>> >> >>>
>> >> >>> I shall note that that I forgot to mention matrix dimensions, which
>> >> >>> is
>> >> >>> 1000 x 1000.
>> >> >>>
>> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>> >> >>>>
>> >> >>>> You need a lot more than just fast loops to match the performance
>> >> >>>> of
>> >> >>>> an
>> >> >>>> optimized BLAS.    See e.g. this notebook for some comments on the
>> >> >>>> related
>> >> >>>> case of matrix multiplication:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Erik Schnetter <schn...@gmail.com>
>> >> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>>
>>
>> --
>> Erik Schnetter <schn...@gmail.com>
>> http://www.perimeterinstitute.ca/personal/eschnetter/



-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Re: [julia-users] Re: performace of loops Julia vs Blas

Reply via email to