Re: [julia-users] Re: performace of loops Julia vs Blas

2016-04-04 Thread Igor Cerovsky
Thank you all for suggestions and troubleshooting. I'll try 0.4.5 Julia 
version later, post results.

On Thursday, 24 March 2016 14:31:29 UTC+1, Yichao Yu wrote:
>
> On Thu, Mar 24, 2016 at 9:26 AM, Erik Schnetter  > wrote: 
> > Using only SSE2 instead of AVX is slower by a factor of two. 
> > 
> > You can try a newer version of Julia (0.4.5?) that should build 
> > against LLVM 3.7.1 instead of LLVM 3.3. 
>
> IIRC the 0.4.* binaries should be built with llvm 3.3. 
>
> > 
> > -erik 
> > 
> > On Thu, Mar 24, 2016 at 8:54 AM, Yichao Yu  > wrote: 
> >> On Thu, Mar 24, 2016 at 3:22 AM, Igor Cerovsky 
> >>  wrote: 
> >>> I cannot find ymm instuctions in the assembly code. Does 
> >>> LLVM libLLVM-3.3 supports AVX2 instruction set? 
> >> 
> >> Disabled due to llvm bug. The nightly binary is probably also compiled 
> >> without it. 
> > 
> > 
> > 
> > -- 
> > Erik Schnetter  
> > http://www.perimeterinstitute.ca/personal/eschnetter/ 
>


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Yichao Yu
On Thu, Mar 24, 2016 at 9:26 AM, Erik Schnetter  wrote:
> Using only SSE2 instead of AVX is slower by a factor of two.
>
> You can try a newer version of Julia (0.4.5?) that should build
> against LLVM 3.7.1 instead of LLVM 3.3.

IIRC the 0.4.* binaries should be built with llvm 3.3.

>
> -erik
>
> On Thu, Mar 24, 2016 at 8:54 AM, Yichao Yu  wrote:
>> On Thu, Mar 24, 2016 at 3:22 AM, Igor Cerovsky
>>  wrote:
>>> I cannot find ymm instuctions in the assembly code. Does
>>> LLVM libLLVM-3.3 supports AVX2 instruction set?
>>
>> Disabled due to llvm bug. The nightly binary is probably also compiled
>> without it.
>
>
>
> --
> Erik Schnetter 
> http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Erik Schnetter
Using only SSE2 instead of AVX is slower by a factor of two.

You can try a newer version of Julia (0.4.5?) that should build
against LLVM 3.7.1 instead of LLVM 3.3.

-erik

On Thu, Mar 24, 2016 at 8:54 AM, Yichao Yu  wrote:
> On Thu, Mar 24, 2016 at 3:22 AM, Igor Cerovsky
>  wrote:
>> I cannot find ymm instuctions in the assembly code. Does
>> LLVM libLLVM-3.3 supports AVX2 instruction set?
>
> Disabled due to llvm bug. The nightly binary is probably also compiled
> without it.



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Yichao Yu
On Thu, Mar 24, 2016 at 3:22 AM, Igor Cerovsky
 wrote:
> I cannot find ymm instuctions in the assembly code. Does
> LLVM libLLVM-3.3 supports AVX2 instruction set?

Disabled due to llvm bug. The nightly binary is probably also compiled
without it.


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Igor Cerovsky
I cannot find ymm instuctions in the assembly code. Does 
LLVM libLLVM-3.3 supports AVX2 instruction set?


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Igor Cerovsky
assembler code is attached.

versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Igor

On Wednesday, 23 March 2016 16:05:20 UTC+1, Erik Schnetter wrote:
>
> I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz". 
>
> -erik 
>
> On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky 
>  wrote: 
> > Thanks, Erik. I've thought there is something deeper in the LLVM. 
> > Since I'm quite new to Julia, I'll follow your suggestions and send you 
> > some outputs. 
> > What is a processor you were running the benchmarks? 
> > 
> > On 23 March 2016 at 15:42, Erik Schnetter  > wrote: 
> >> 
> >> I get a time ratio (bc / bb) of 1.1 
> >> 
> >> It could be that you're just having bad luck with the particular 
> >> optimization decisions that LLVM makes for the combined code, or with 
> >> the parameters (sizes) for this benchmark. Maybe the performance 
> >> difference changes for different matrix sizes? There's a million 
> >> things you can try, e.g. starting Julia with the "-O" option, or using 
> >> a different LLVM version. What would really help is gather more 
> >> detailed information, e.g. by looking at the disassembled loop kernels 
> >> (to see whether something is wrong), or using a profiler to see where 
> >> the time is spent (Julia has a built-in profiler), or gathering 
> >> statistics about floating point instructions executed and cache 
> >> operations (that requires an external tool). 
> >> 
> >> The disassembled code is CPU-specific and also depends on the LLVM 
> >> version. I'd be happy to have a quick glance at it if you create a 
> >> listing (with `@code_native`) and e.g. put it up as a gist 
> >> . I'd also need your CPU type (`versioninfo()` in 
> >> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though. 
> >> 
> >> -erik 
> >> 
> >> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky 
> >>  wrote: 
> >> > I've attached two notebooks, you can check the comparisons. 
> >> > The first one is to compare rank1updatede! and rank1updateb! 
> functions. 
> >> > The 
> >> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. 
> The 
> >> > same 
> >> > applies to mygemv vs Blas.gemv. 
> >> > Combining the same routines into the mgs algorithm in the very first 
> >> > post, 
> >> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 
> >> > 6700HQ 
> >> > (that is important to mention, because on older processors the 
> >> > difference is 
> >> > not that big, it similar to comparing the routines rank1update and 
> >> > BLAS.ger). This is something what I'm trying to figure out why? 
> >> > 
> >> > 
> >> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote: 
> >> >> 
> >> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky 
> >> >>  wrote: 
> >> >> > The factor ~20% I've mentioned just because it is something what 
> I've 
> >> >> > commonly observed, and of course can vary, and isn't that 
> important. 
> >> >> > 
> >> >> > What bothers me is: why the performance drops 2-times, when I 
> combine 
> >> >> > two 
> >> >> > routines where each one alone causes performance drop 0.2-times? 
> >> >> 
> >> >> I looked at the IJulia notebook you posted, but it wasn't obvious 
> >> >> which routines you mean. Can you point to exactly the source codes 
> you 
> >> >> are comparing? 
> >> >> 
> >> >> -erik 
> >> >> 
> >> >> > In other words I have routines foo() and bar() and their 
> equivalents 
> >> >> > in 
> >> >> > BLAS 
> >> >> > fooblas() barblas(); where 
> >> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2 
> >> >> > The same for bar. Consider following pseudo-code 
> >> >> >   for k in 1:N 
> >> >> > foo()  # my Julia implementation of a BLAS function for 
> example 
> >> >> > gemv 
> >> >> > bar()  # my Julia implementation of a BLAS function for 
> example 
> >> >> > ger 
> >> >> >   end 
> >> >> > end 
> >> >> > 
> >> >> > 
> >> >> > function foobarblas() 
> >> >> >   for k in 1:N 
> >> >> > fooblas()  # this is equivalent of foo in BLAS for example 
> gemv 
> >> >> > barblas()  # this is equivalent of bar in BLAS for example ger 
> >> >> >   end 
> >> >> > end 
> >> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 
> >> >> > 
> >> >> > 
> >> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: 
> >> >> >> 
> >> >> >> The architecture-specific, manual BLAS optimizations don't just 
> give 
> >> >> >> you an additional 20%. They can improve speed by a factor of a 
> few. 
> >> >> >> 
> >> >> >> If you see a factor of 2.6, then that's probably to be accepted, 
> >> >> >> unless to really look into the details (generated assembler code, 
> >> >> >> 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Erik Schnetter
I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz".

-erik

On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky
 wrote:
> Thanks, Erik. I've thought there is something deeper in the LLVM.
> Since I'm quite new to Julia, I'll follow your suggestions and send you
> some outputs.
> What is a processor you were running the benchmarks?
>
> On 23 March 2016 at 15:42, Erik Schnetter  wrote:
>>
>> I get a time ratio (bc / bb) of 1.1
>>
>> It could be that you're just having bad luck with the particular
>> optimization decisions that LLVM makes for the combined code, or with
>> the parameters (sizes) for this benchmark. Maybe the performance
>> difference changes for different matrix sizes? There's a million
>> things you can try, e.g. starting Julia with the "-O" option, or using
>> a different LLVM version. What would really help is gather more
>> detailed information, e.g. by looking at the disassembled loop kernels
>> (to see whether something is wrong), or using a profiler to see where
>> the time is spent (Julia has a built-in profiler), or gathering
>> statistics about floating point instructions executed and cache
>> operations (that requires an external tool).
>>
>> The disassembled code is CPU-specific and also depends on the LLVM
>> version. I'd be happy to have a quick glance at it if you create a
>> listing (with `@code_native`) and e.g. put it up as a gist
>> . I'd also need your CPU type (`versioninfo()` in
>> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.
>>
>> -erik
>>
>> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
>>  wrote:
>> > I've attached two notebooks, you can check the comparisons.
>> > The first one is to compare rank1updatede! and rank1updateb! functions.
>> > The
>> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The
>> > same
>> > applies to mygemv vs Blas.gemv.
>> > Combining the same routines into the mgs algorithm in the very first
>> > post,
>> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7
>> > 6700HQ
>> > (that is important to mention, because on older processors the
>> > difference is
>> > not that big, it similar to comparing the routines rank1update and
>> > BLAS.ger). This is something what I'm trying to figure out why?
>> >
>> >
>> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>> >>
>> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
>> >>  wrote:
>> >> > The factor ~20% I've mentioned just because it is something what I've
>> >> > commonly observed, and of course can vary, and isn't that important.
>> >> >
>> >> > What bothers me is: why the performance drops 2-times, when I combine
>> >> > two
>> >> > routines where each one alone causes performance drop 0.2-times?
>> >>
>> >> I looked at the IJulia notebook you posted, but it wasn't obvious
>> >> which routines you mean. Can you point to exactly the source codes you
>> >> are comparing?
>> >>
>> >> -erik
>> >>
>> >> > In other words I have routines foo() and bar() and their equivalents
>> >> > in
>> >> > BLAS
>> >> > fooblas() barblas(); where
>> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2
>> >> > The same for bar. Consider following pseudo-code
>> >> >   for k in 1:N
>> >> > foo()  # my Julia implementation of a BLAS function for example
>> >> > gemv
>> >> > bar()  # my Julia implementation of a BLAS function for example
>> >> > ger
>> >> >   end
>> >> > end
>> >> >
>> >> >
>> >> > function foobarblas()
>> >> >   for k in 1:N
>> >> > fooblas()  # this is equivalent of foo in BLAS for example gemv
>> >> > barblas()  # this is equivalent of bar in BLAS for example ger
>> >> >   end
>> >> > end
>> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>> >> >
>> >> >
>> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>> >> >>
>> >> >> The architecture-specific, manual BLAS optimizations don't just give
>> >> >> you an additional 20%. They can improve speed by a factor of a few.
>> >> >>
>> >> >> If you see a factor of 2.6, then that's probably to be accepted,
>> >> >> unless to really look into the details (generated assembler code,
>> >> >> measure cache misses, introduce manual vectorization and loop
>> >> >> unrolling, etc.) And you'll have to repeat that analysis if you're
>> >> >> using a different system.
>> >> >>
>> >> >> -erik
>> >> >>
>> >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>> >> >>  wrote:
>> >> >> > Well, maybe the subject of the post is confusing. I've tried to
>> >> >> > write
>> >> >> > an
>> >> >> > algorithm which runs approximately as fast as using BLAS
>> >> >> > functions,
>> >> >> > but
>> >> >> > using pure Julia implementation. Sure, we know, that BLAS is
>> >> >> > highly
>> >> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower,
>> >> >> > let
>> >> >> > us
>> >> >> > say
>> >> >> > ~1.2-times.
>> >> >> >
>> 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Igor Cerovsky
Thanks, Erik. I've thought there is something deeper in the LLVM.
Since I'm quite new to Julia, I'll follow your suggestions and send you
some outputs.
What is a processor you were running the benchmarks?

On 23 March 2016 at 15:42, Erik Schnetter  wrote:

> I get a time ratio (bc / bb) of 1.1
>
> It could be that you're just having bad luck with the particular
> optimization decisions that LLVM makes for the combined code, or with
> the parameters (sizes) for this benchmark. Maybe the performance
> difference changes for different matrix sizes? There's a million
> things you can try, e.g. starting Julia with the "-O" option, or using
> a different LLVM version. What would really help is gather more
> detailed information, e.g. by looking at the disassembled loop kernels
> (to see whether something is wrong), or using a profiler to see where
> the time is spent (Julia has a built-in profiler), or gathering
> statistics about floating point instructions executed and cache
> operations (that requires an external tool).
>
> The disassembled code is CPU-specific and also depends on the LLVM
> version. I'd be happy to have a quick glance at it if you create a
> listing (with `@code_native`) and e.g. put it up as a gist
> . I'd also need your CPU type (`versioninfo()` in
> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.
>
> -erik
>
> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
>  wrote:
> > I've attached two notebooks, you can check the comparisons.
> > The first one is to compare rank1updatede! and rank1updateb! functions.
> The
> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The
> same
> > applies to mygemv vs Blas.gemv.
> > Combining the same routines into the mgs algorithm in the very first
> post,
> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7
> 6700HQ
> > (that is important to mention, because on older processors the
> difference is
> > not that big, it similar to comparing the routines rank1update and
> > BLAS.ger). This is something what I'm trying to figure out why?
> >
> >
> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
> >>
> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
> >>  wrote:
> >> > The factor ~20% I've mentioned just because it is something what I've
> >> > commonly observed, and of course can vary, and isn't that important.
> >> >
> >> > What bothers me is: why the performance drops 2-times, when I combine
> >> > two
> >> > routines where each one alone causes performance drop 0.2-times?
> >>
> >> I looked at the IJulia notebook you posted, but it wasn't obvious
> >> which routines you mean. Can you point to exactly the source codes you
> >> are comparing?
> >>
> >> -erik
> >>
> >> > In other words I have routines foo() and bar() and their equivalents
> in
> >> > BLAS
> >> > fooblas() barblas(); where
> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2
> >> > The same for bar. Consider following pseudo-code
> >> >   for k in 1:N
> >> > foo()  # my Julia implementation of a BLAS function for example
> gemv
> >> > bar()  # my Julia implementation of a BLAS function for example
> ger
> >> >   end
> >> > end
> >> >
> >> >
> >> > function foobarblas()
> >> >   for k in 1:N
> >> > fooblas()  # this is equivalent of foo in BLAS for example gemv
> >> > barblas()  # this is equivalent of bar in BLAS for example ger
> >> >   end
> >> > end
> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
> >> >
> >> >
> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
> >> >>
> >> >> The architecture-specific, manual BLAS optimizations don't just give
> >> >> you an additional 20%. They can improve speed by a factor of a few.
> >> >>
> >> >> If you see a factor of 2.6, then that's probably to be accepted,
> >> >> unless to really look into the details (generated assembler code,
> >> >> measure cache misses, introduce manual vectorization and loop
> >> >> unrolling, etc.) And you'll have to repeat that analysis if you're
> >> >> using a different system.
> >> >>
> >> >> -erik
> >> >>
> >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
> >> >>  wrote:
> >> >> > Well, maybe the subject of the post is confusing. I've tried to
> write
> >> >> > an
> >> >> > algorithm which runs approximately as fast as using BLAS functions,
> >> >> > but
> >> >> > using pure Julia implementation. Sure, we know, that BLAS is highly
> >> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let
> >> >> > us
> >> >> > say
> >> >> > ~1.2-times.
> >> >> >
> >> >> > If I take a part of the algorithm, and run it separately all works
> >> >> > fine.
> >> >> > Consider code below:
> >> >> > function rank1update!(A, x, y)
> >> >> > for j = 1:size(A, 2)
> >> >> > @fastmath @inbounds @simd for i = 1:size(A, 1)
> >> >> > A[i,j] += 1.1 * y[j] * x[i]
> >> >> > end
> >> >> 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Erik Schnetter
I get a time ratio (bc / bb) of 1.1

It could be that you're just having bad luck with the particular
optimization decisions that LLVM makes for the combined code, or with
the parameters (sizes) for this benchmark. Maybe the performance
difference changes for different matrix sizes? There's a million
things you can try, e.g. starting Julia with the "-O" option, or using
a different LLVM version. What would really help is gather more
detailed information, e.g. by looking at the disassembled loop kernels
(to see whether something is wrong), or using a profiler to see where
the time is spent (Julia has a built-in profiler), or gathering
statistics about floating point instructions executed and cache
operations (that requires an external tool).

The disassembled code is CPU-specific and also depends on the LLVM
version. I'd be happy to have a quick glance at it if you create a
listing (with `@code_native`) and e.g. put it up as a gist
. I'd also need your CPU type (`versioninfo()` in
Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.

-erik

On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
 wrote:
> I've attached two notebooks, you can check the comparisons.
> The first one is to compare rank1updatede! and rank1updateb! functions. The
> Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The same
> applies to mygemv vs Blas.gemv.
> Combining the same routines into the mgs algorithm in the very first post,
> the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 6700HQ
> (that is important to mention, because on older processors the difference is
> not that big, it similar to comparing the routines rank1update and
> BLAS.ger). This is something what I'm trying to figure out why?
>
>
> On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>>
>> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
>>  wrote:
>> > The factor ~20% I've mentioned just because it is something what I've
>> > commonly observed, and of course can vary, and isn't that important.
>> >
>> > What bothers me is: why the performance drops 2-times, when I combine
>> > two
>> > routines where each one alone causes performance drop 0.2-times?
>>
>> I looked at the IJulia notebook you posted, but it wasn't obvious
>> which routines you mean. Can you point to exactly the source codes you
>> are comparing?
>>
>> -erik
>>
>> > In other words I have routines foo() and bar() and their equivalents in
>> > BLAS
>> > fooblas() barblas(); where
>> > @elapsed foo() / @elapsed fooblas() ~= 1.2
>> > The same for bar. Consider following pseudo-code
>> >   for k in 1:N
>> > foo()  # my Julia implementation of a BLAS function for example gemv
>> > bar()  # my Julia implementation of a BLAS function for example ger
>> >   end
>> > end
>> >
>> >
>> > function foobarblas()
>> >   for k in 1:N
>> > fooblas()  # this is equivalent of foo in BLAS for example gemv
>> > barblas()  # this is equivalent of bar in BLAS for example ger
>> >   end
>> > end
>> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>> >
>> >
>> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>> >>
>> >> The architecture-specific, manual BLAS optimizations don't just give
>> >> you an additional 20%. They can improve speed by a factor of a few.
>> >>
>> >> If you see a factor of 2.6, then that's probably to be accepted,
>> >> unless to really look into the details (generated assembler code,
>> >> measure cache misses, introduce manual vectorization and loop
>> >> unrolling, etc.) And you'll have to repeat that analysis if you're
>> >> using a different system.
>> >>
>> >> -erik
>> >>
>> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>> >>  wrote:
>> >> > Well, maybe the subject of the post is confusing. I've tried to write
>> >> > an
>> >> > algorithm which runs approximately as fast as using BLAS functions,
>> >> > but
>> >> > using pure Julia implementation. Sure, we know, that BLAS is highly
>> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let
>> >> > us
>> >> > say
>> >> > ~1.2-times.
>> >> >
>> >> > If I take a part of the algorithm, and run it separately all works
>> >> > fine.
>> >> > Consider code below:
>> >> > function rank1update!(A, x, y)
>> >> > for j = 1:size(A, 2)
>> >> > @fastmath @inbounds @simd for i = 1:size(A, 1)
>> >> > A[i,j] += 1.1 * y[j] * x[i]
>> >> > end
>> >> > end
>> >> > end
>> >> >
>> >> > function rank1updateb!(A, x, y)
>> >> > R = BLAS.ger!(1.1, x, y, A)
>> >> > end
>> >> >
>> >> > Here BLAS is ~1.2-times faster.
>> >> > However, calling it together with 'mygemv!' in the loop (see code in
>> >> > original post), the performance drops to ~2.6 times slower then using
>> >> > BLAS
>> >> > functions (gemv, ger)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>> >> >>
>> >> >> I'm not sure 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Igor Cerovsky
I've attached two notebooks, you can check the comparisons. 
The first one is to compare rank1updatede! and rank1updateb! functions. The 
Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The 
same applies to mygemv vs Blas.gemv. 
Combining the same routines into the mgs algorithm in the very first post, 
the resulting performance is mgs / mgs_blas is 2.6 on *my computer* i7 
6700HQ (that is important to mention, because on older processors the 
difference is not that big, it similar to comparing the routines rank1update 
and BLAS.ger). This is something what I'm trying to figure out why?


On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>
> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky 
>  wrote: 
> > The factor ~20% I've mentioned just because it is something what I've 
> > commonly observed, and of course can vary, and isn't that important. 
> > 
> > What bothers me is: why the performance drops 2-times, when I combine 
> two 
> > routines where each one alone causes performance drop 0.2-times? 
>
> I looked at the IJulia notebook you posted, but it wasn't obvious 
> which routines you mean. Can you point to exactly the source codes you 
> are comparing? 
>
> -erik 
>
> > In other words I have routines foo() and bar() and their equivalents in 
> BLAS 
> > fooblas() barblas(); where 
> > @elapsed foo() / @elapsed fooblas() ~= 1.2 
> > The same for bar. Consider following pseudo-code 
> >   for k in 1:N 
> > foo()  # my Julia implementation of a BLAS function for example gemv 
> > bar()  # my Julia implementation of a BLAS function for example ger 
> >   end 
> > end 
> > 
> > 
> > function foobarblas() 
> >   for k in 1:N 
> > fooblas()  # this is equivalent of foo in BLAS for example gemv 
> > barblas()  # this is equivalent of bar in BLAS for example ger 
> >   end 
> > end 
> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 
> > 
> > 
> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: 
> >> 
> >> The architecture-specific, manual BLAS optimizations don't just give 
> >> you an additional 20%. They can improve speed by a factor of a few. 
> >> 
> >> If you see a factor of 2.6, then that's probably to be accepted, 
> >> unless to really look into the details (generated assembler code, 
> >> measure cache misses, introduce manual vectorization and loop 
> >> unrolling, etc.) And you'll have to repeat that analysis if you're 
> >> using a different system. 
> >> 
> >> -erik 
> >> 
> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
> >>  wrote: 
> >> > Well, maybe the subject of the post is confusing. I've tried to write 
> an 
> >> > algorithm which runs approximately as fast as using BLAS functions, 
> but 
> >> > using pure Julia implementation. Sure, we know, that BLAS is highly 
> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let 
> us 
> >> > say 
> >> > ~1.2-times. 
> >> > 
> >> > If I take a part of the algorithm, and run it separately all works 
> fine. 
> >> > Consider code below: 
> >> > function rank1update!(A, x, y) 
> >> > for j = 1:size(A, 2) 
> >> > @fastmath @inbounds @simd for i = 1:size(A, 1) 
> >> > A[i,j] += 1.1 * y[j] * x[i] 
> >> > end 
> >> > end 
> >> > end 
> >> > 
> >> > function rank1updateb!(A, x, y) 
> >> > R = BLAS.ger!(1.1, x, y, A) 
> >> > end 
> >> > 
> >> > Here BLAS is ~1.2-times faster. 
> >> > However, calling it together with 'mygemv!' in the loop (see code in 
> >> > original post), the performance drops to ~2.6 times slower then using 
> >> > BLAS 
> >> > functions (gemv, ger) 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: 
> >> >> 
> >> >> I'm not sure what the expected result here is. BLAS is designed to 
> be 
> >> >> as 
> >> >> fast as possible at matrix multiply. I'd be more concerned if you 
> write 
> >> >> straightforward loop code and beat BLAS, since that means the BLAS 
> is 
> >> >> badly 
> >> >> mistuned. 
> >> >> 
> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <
> igor.c...@2bridgz.com> 
> >> >> wrote: 
> >> >>> 
> >> >>> Thanks Steven, I've thought there is something more behind... 
> >> >>> 
> >> >>> I shall note that that I forgot to mention matrix dimensions, which 
> is 
> >> >>> 1000 x 1000. 
> >> >>> 
> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: 
> >>  
> >>  You need a lot more than just fast loops to match the performance 
> of 
> >>  an 
> >>  optimized BLAS.See e.g. this notebook for some comments on the 
> >>  related 
> >>  case of matrix multiplication: 
> >>  
> >>  
> >>  
> >>  
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>  
> >> >> 
> >> >> 
> >> > 
> >> 
> >> 
> >> 
> >> -- 
> >> Erik Schnetter  
> >> 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-22 Thread Erik Schnetter
On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
 wrote:
> The factor ~20% I've mentioned just because it is something what I've
> commonly observed, and of course can vary, and isn't that important.
>
> What bothers me is: why the performance drops 2-times, when I combine two
> routines where each one alone causes performance drop 0.2-times?

I looked at the IJulia notebook you posted, but it wasn't obvious
which routines you mean. Can you point to exactly the source codes you
are comparing?

-erik

> In other words I have routines foo() and bar() and their equivalents in BLAS
> fooblas() barblas(); where
> @elapsed foo() / @elapsed fooblas() ~= 1.2
> The same for bar. Consider following pseudo-code
>   for k in 1:N
> foo()  # my Julia implementation of a BLAS function for example gemv
> bar()  # my Julia implementation of a BLAS function for example ger
>   end
> end
>
>
> function foobarblas()
>   for k in 1:N
> fooblas()  # this is equivalent of foo in BLAS for example gemv
> barblas()  # this is equivalent of bar in BLAS for example ger
>   end
> end
> then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>
>
> On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>>
>> The architecture-specific, manual BLAS optimizations don't just give
>> you an additional 20%. They can improve speed by a factor of a few.
>>
>> If you see a factor of 2.6, then that's probably to be accepted,
>> unless to really look into the details (generated assembler code,
>> measure cache misses, introduce manual vectorization and loop
>> unrolling, etc.) And you'll have to repeat that analysis if you're
>> using a different system.
>>
>> -erik
>>
>> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>>  wrote:
>> > Well, maybe the subject of the post is confusing. I've tried to write an
>> > algorithm which runs approximately as fast as using BLAS functions, but
>> > using pure Julia implementation. Sure, we know, that BLAS is highly
>> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us
>> > say
>> > ~1.2-times.
>> >
>> > If I take a part of the algorithm, and run it separately all works fine.
>> > Consider code below:
>> > function rank1update!(A, x, y)
>> > for j = 1:size(A, 2)
>> > @fastmath @inbounds @simd for i = 1:size(A, 1)
>> > A[i,j] += 1.1 * y[j] * x[i]
>> > end
>> > end
>> > end
>> >
>> > function rank1updateb!(A, x, y)
>> > R = BLAS.ger!(1.1, x, y, A)
>> > end
>> >
>> > Here BLAS is ~1.2-times faster.
>> > However, calling it together with 'mygemv!' in the loop (see code in
>> > original post), the performance drops to ~2.6 times slower then using
>> > BLAS
>> > functions (gemv, ger)
>> >
>> >
>> >
>> >
>> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>> >>
>> >> I'm not sure what the expected result here is. BLAS is designed to be
>> >> as
>> >> fast as possible at matrix multiply. I'd be more concerned if you write
>> >> straightforward loop code and beat BLAS, since that means the BLAS is
>> >> badly
>> >> mistuned.
>> >>
>> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky 
>> >> wrote:
>> >>>
>> >>> Thanks Steven, I've thought there is something more behind...
>> >>>
>> >>> I shall note that that I forgot to mention matrix dimensions, which is
>> >>> 1000 x 1000.
>> >>>
>> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>> 
>>  You need a lot more than just fast loops to match the performance of
>>  an
>>  optimized BLAS.See e.g. this notebook for some comments on the
>>  related
>>  case of matrix multiplication:
>> 
>> 
>> 
>>  http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Erik Schnetter 
>> http://www.perimeterinstitute.ca/personal/eschnetter/



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-22 Thread Igor Cerovsky
The factor ~20% I've mentioned just because it is something what I've 
commonly observed, and of course can vary, and isn't that important.

What bothers me is: why the performance drops 2-times, when I combine two 
routines where each one alone causes performance drop 0.2-times? 
In other words I have routines foo() and bar() and their equivalents in 
BLAS fooblas() barblas(); where 
*@elapsed foo() / @elapsed fooblas() ~= 1.2 *
The same for bar. Consider following pseudo-code
  for k in 1:N
foo()  # my Julia implementation of a BLAS function for example gemv
bar()  # my Julia implementation of a BLAS function for example ger
  end
end


function foobarblas()
  for k in 1:N
fooblas()  # this is equivalent of foo in BLAS for example gemv 
barblas()  # this is equivalent of bar in BLAS for example ger
  end
end
then *@elapsed foobar() / @elapsed foobarblas() ~= 2.6*


On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>
> The architecture-specific, manual BLAS optimizations don't just give 
> you an additional 20%. They can improve speed by a factor of a few. 
>
> If you see a factor of 2.6, then that's probably to be accepted, 
> unless to really look into the details (generated assembler code, 
> measure cache misses, introduce manual vectorization and loop 
> unrolling, etc.) And you'll have to repeat that analysis if you're 
> using a different system. 
>
> -erik 
>
> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
>  wrote: 
> > Well, maybe the subject of the post is confusing. I've tried to write an 
> > algorithm which runs approximately as fast as using BLAS functions, but 
> > using pure Julia implementation. Sure, we know, that BLAS is highly 
> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us 
> say 
> > ~1.2-times. 
> > 
> > If I take a part of the algorithm, and run it separately all works fine. 
> > Consider code below: 
> > function rank1update!(A, x, y) 
> > for j = 1:size(A, 2) 
> > @fastmath @inbounds @simd for i = 1:size(A, 1) 
> > A[i,j] += 1.1 * y[j] * x[i] 
> > end 
> > end 
> > end 
> > 
> > function rank1updateb!(A, x, y) 
> > R = BLAS.ger!(1.1, x, y, A) 
> > end 
> > 
> > Here BLAS is ~1.2-times faster. 
> > However, calling it together with 'mygemv!' in the loop (see code in 
> > original post), the performance drops to ~2.6 times slower then using 
> BLAS 
> > functions (gemv, ger) 
> > 
> > 
> > 
> > 
> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: 
> >> 
> >> I'm not sure what the expected result here is. BLAS is designed to be 
> as 
> >> fast as possible at matrix multiply. I'd be more concerned if you write 
> >> straightforward loop code and beat BLAS, since that means the BLAS is 
> badly 
> >> mistuned. 
> >> 
> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky  
> >> wrote: 
> >>> 
> >>> Thanks Steven, I've thought there is something more behind... 
> >>> 
> >>> I shall note that that I forgot to mention matrix dimensions, which is 
> >>> 1000 x 1000. 
> >>> 
> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: 
>  
>  You need a lot more than just fast loops to match the performance of 
> an 
>  optimized BLAS.See e.g. this notebook for some comments on the 
> related 
>  case of matrix multiplication: 
>  
>  
>  
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>  
> >> 
> >> 
> > 
>
>
>
> -- 
> Erik Schnetter  
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Erik Schnetter
The architecture-specific, manual BLAS optimizations don't just give
you an additional 20%. They can improve speed by a factor of a few.

If you see a factor of 2.6, then that's probably to be accepted,
unless to really look into the details (generated assembler code,
measure cache misses, introduce manual vectorization and loop
unrolling, etc.) And you'll have to repeat that analysis if you're
using a different system.

-erik

On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
 wrote:
> Well, maybe the subject of the post is confusing. I've tried to write an
> algorithm which runs approximately as fast as using BLAS functions, but
> using pure Julia implementation. Sure, we know, that BLAS is highly
> optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say
> ~1.2-times.
>
> If I take a part of the algorithm, and run it separately all works fine.
> Consider code below:
> function rank1update!(A, x, y)
> for j = 1:size(A, 2)
> @fastmath @inbounds @simd for i = 1:size(A, 1)
> A[i,j] += 1.1 * y[j] * x[i]
> end
> end
> end
>
> function rank1updateb!(A, x, y)
> R = BLAS.ger!(1.1, x, y, A)
> end
>
> Here BLAS is ~1.2-times faster.
> However, calling it together with 'mygemv!' in the loop (see code in
> original post), the performance drops to ~2.6 times slower then using BLAS
> functions (gemv, ger)
>
>
>
>
> On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>>
>> I'm not sure what the expected result here is. BLAS is designed to be as
>> fast as possible at matrix multiply. I'd be more concerned if you write
>> straightforward loop code and beat BLAS, since that means the BLAS is badly
>> mistuned.
>>
>> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky 
>> wrote:
>>>
>>> Thanks Steven, I've thought there is something more behind...
>>>
>>> I shall note that that I forgot to mention matrix dimensions, which is
>>> 1000 x 1000.
>>>
>>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:

 You need a lot more than just fast loops to match the performance of an
 optimized BLAS.See e.g. this notebook for some comments on the related
 case of matrix multiplication:


 http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>
>>
>



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Igor Cerovsky
Well, maybe the subject of the post is confusing. I've tried to write an 
algorithm which runs approximately as fast as using BLAS functions, but 
using pure Julia implementation. Sure, we know, that BLAS is highly 
optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say 
~1.2-times. 

If I take a part of the algorithm, and run it separately all works fine. 
Consider code below:
function rank1update!(A, x, y)
for j = 1:size(A, 2)
@fastmath @inbounds @simd for i = 1:size(A, 1)
A[i,j] += 1.1 * y[j] * x[i]
end
end
end

function rank1updateb!(A, x, y)
R = BLAS.ger!(1.1, x, y, A)
end

Here BLAS is ~1.2-times faster.
However, calling it together with 'mygemv!' in the loop (see code in 
original post), the performance drops to ~2.6 times slower then using BLAS 
functions (gemv, ger)




On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>
> I'm not sure what the expected result here is. BLAS is designed to be as 
> fast as possible at matrix multiply. I'd be more concerned if you write 
> straightforward loop code and beat BLAS, since that means the BLAS is badly 
> mistuned.
>
> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky  > wrote:
>
>> Thanks Steven, I've thought there is something more behind...
>>
>> I shall note that that I forgot to mention matrix dimensions, which is 
>> 1000 x 1000.
>>
>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>>
>>> You need a lot more than just fast loops to match the performance of an 
>>> optimized BLAS.See e.g. this notebook for some comments on the related 
>>> case of matrix multiplication:
>>>
>>>
>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>>
>>
>

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Stefan Karpinski
I'm not sure what the expected result here is. BLAS is designed to be as
fast as possible at matrix multiply. I'd be more concerned if you write
straightforward loop code and beat BLAS, since that means the BLAS is badly
mistuned.

On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky 
wrote:

> Thanks Steven, I've thought there is something more behind...
>
> I shall note that that I forgot to mention matrix dimensions, which is
> 1000 x 1000.
>
> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>
>> You need a lot more than just fast loops to match the performance of an
>> optimized BLAS.See e.g. this notebook for some comments on the related
>> case of matrix multiplication:
>>
>>
>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>
>


[julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Igor Cerovsky
Thanks Steven, I've thought there is something more behind...

I shall note that that I forgot to mention matrix dimensions, which is 1000 
x 1000.

On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>
> You need a lot more than just fast loops to match the performance of an 
> optimized BLAS.See e.g. this notebook for some comments on the related 
> case of matrix multiplication:
>
>
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>


[julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Steven G. Johnson
You need a lot more than just fast loops to match the performance of an 
optimized BLAS.See e.g. this notebook for some comments on the related 
case of matrix multiplication:

http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb


[julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Ján Adamčák
You can use blas_set_num_threads(1) in julia and it will use only 1 
thread... but this is not right answer

Dňa pondelok, 21. marca 2016 9:49:23 UTC+1 Igor Cerovsky napísal(-a):
>
> Hi,
>
> Trying to write custom and using BLAS functions implementation of 
> Gram-Schmidt algorithm I got more than 2-times slower performance for Julia 
> 0.4.3 on my computer Intel i7 6700HQ (on older processor i7 5500 the 
> performance gain is 1.2-times). The code below is a bit longer, but I got 
> the slow performance in the whole context. Trying to profile parts of the 
> algorithm I got only slightly different performance.
>
> Custom implementation:
>
> function rank1update!(DST, A, R, k)
> rows, cols = size(A)
> for j = k+1:cols
> @simd for i = 1:rows
> @inbounds DST[i,j] -= A[i, k] * R[k, j]
> end
> end
> end
>
> function mygemv!(DST, A, k, alpha)
> rows, cols = size(A)
> for j in k+1:cols
> s = 0.0
> @simd for i in 1:rows
> @inbounds s += A[i, k] * A[i, j]
> end  
> DST[k, j] = s * alpha
> end
> end
>
> function mgsf(M)
> rows, cols = size(M)
> Q = copy(M)
> R = eye(cols)
>
> for k in 1:cols
> alpha = 1.0 / sumabs2(sub(Q, :, k))
> mygemv!(R, Q, k, alpha)
> rank1update!(Q, Q, R, k)
> end
> Q, R
> end
>
> Implementation using BLAS functions:
> function mgs_blas(M)
> cols = size(M, 2)
> Q = copy(M)
> R = eye(cols)
>
> for k in 1:cols
> q_k = sub(Q, :, k)
> Q_sub = sub(Q, :, k+1:cols)
> R_sub = sub(R, k, k+1:cols)
>
> alpha = 1.0 / sumabs2(q_k)
> R[k, k+1:cols] = BLAS.gemv('T', alpha, Q_sub, q_k)
> BLAS.ger!(-1.0,  q_k, vec(R_sub), Q_sub)
> end
> 
> Q, R
> end
>
> And results; using BLAS the performance gain is ~2.6 times:
>
> # custom implementation
> Q2, R2 = @time mgsf(T);
>
>   0.714916 seconds (4.99 k allocations: 15.411 MB, 0.08% gc time)
>
>
> # implementation using BLAS functions 
>
> Q5, R5 = @time mgs_blas(T);
>
>   0.339278 seconds (16.45 k allocations: 23.521 MB, 0.76% gc time)
>
>
>
> A hint: Looking at performance graph in the Task Manager it seems BLAS 
> uses more cores.
> The question that remains is: what is going on?
>
> Thanks for explanation.
>
>