[julia-users] Re: JLD crash on WIN

2016-11-08 Thread Igor Cerovsky
I've also submitted an issue to JLD, and the problem is originally 
addressed here: https://github.com/JuliaIO/JLD.jl/issues/103

On Tuesday, 8 November 2016 14:50:49 UTC+1, Igor Cerovsky wrote:
>
> Hello,
>
> Julia 0.5 crashes when loading previously saved data using JLD; code below 
> shows a simple example, that causes crash:
> using JLD
>
> type foo 
>   a::Int
>   b::String
> end
> f = foo(1, "a")
> save("d:/work/Julia/data.jld", "f", f)
> l = load("d:/work/Julia/data.jld", "f") # crashes here
>
> however, everything works fine with:
> using JLD
>
> a = [1,2]
> save("d:/work/Julia/data.jld", "a", a)
> l = load("d:/work/Julia/data.jld", "a")
>
> Julia Version 0.5.0 (2016-09-19 18:14 UTC); Official http://julialang.org/ 
> release; x86_64-w64-mingw32
>
> Any idea how to fix the problem?
>
> Thanks.
>


[julia-users] JLD crash on WIN

2016-11-08 Thread Igor Cerovsky
Hello,

Julia 0.5 crashes when loading previously saved data using JLD; code below 
shows a simple example, that causes crash:
using JLD

type foo 
  a::Int
  b::String
end
f = foo(1, "a")
save("d:/work/Julia/data.jld", "f", f)
l = load("d:/work/Julia/data.jld", "f") # crashes here

however, everything works fine with:
using JLD

a = [1,2]
save("d:/work/Julia/data.jld", "a", a)
l = load("d:/work/Julia/data.jld", "a")

Julia Version 0.5.0 (2016-09-19 18:14 UTC); Official http://julialang.org/ 
release; x86_64-w64-mingw32

Any idea how to fix the problem?

Thanks.


Re: [julia-users] call compiled Julia image from multiple threads in C++

2016-10-06 Thread Igor Cerovsky
Thanks for quick answer.

Is there a best practice to achieve similar behavior - calling compiled 
Julia code in parallel?



[julia-users] call compiled Julia image from multiple threads in C++

2016-10-06 Thread Igor Cerovsky
Hello,

Calling compiled Julia 0.5 code from C++ (MSVS15) from multiple threads a 
code that follows fails.

Consider following example under WIN using Julia 0.5:
#include 
#include 
#include "julia.h"

//@Base.ccallable Int64 jl_foo(x::Int64) = (x + 1)

void TestJl()
{
  jl_options.compile_enabled = JL_OPTIONS_COMPILE_OFF;
  jl_options.startupfile = JL_OPTIONS_STARTUPFILE_OFF;
  jl_options.use_precompiled = JL_OPTIONS_USE_PRECOMPILED_YES;
  jl_init_with_image(NULL, "myJulia.dll");

  jl_function_t *func1 = jl_get_function(jl_main_module, "jl_foo");
  jl_value_t* in = jl_box_int64(1);
  jl_value_t* ret = NULL;
  JL_GC_PUSH2(, );
  // use computation in Julia here ...
  ret = jl_call1(func1, in);
  std::cout << "returned: " << jl_unbox_int64(ret) << "\n";
  JL_GC_POP();

  jl_atexit_hook(0);
}

int main()
{
  std::vector threads;
  for (size_t i = 0; i < 5; ++i)
  {
threads.emplace_back(std::thread{ TestJl });
  }

  for (size_t i = 0; i < threads.size(); ++i)
  {
threads[i].join();
  }

  return 0;
}

Error message: ": for the -enable-tail-merge option: may only occur zero or 
one times!"

Single threaded version (just call TestJL()) as well 1thread-version (1 
thread in the for loop) works fine.

Please, can anyone give some hints/explanation, if calling code from 
compiled Julia is possible from multiple threads?


Re: [julia-users] Re: documentation reversed order for multiple methods

2016-05-12 Thread Igor Cerovsky
Thanks Michael, the behavior you mentioned in 0.5 is what I've expected:
adding to docstrings using one double quotes pair shall add docstring after
previous method for given function, which is inside three double quotes
pair.

On 12 May 2016 at 13:42, Michael Hatherly <michaelhathe...@gmail.com> wrote:

> The ordering of docstrings in Julia 0.4 is based on the method
> signatures, which can sometimes not be all that obvious. They
> are not printed in a random order though. It’s best just to not
> rely on a specific order in 0.4.
>
> In 0.5 this order has been changed,
> https://github.com/JuliaLang/julia/pull/15266,
> to the definition order of the docstrings, which should be much
> less surprising.
>
> — Mike
> On Thursday, 12 May 2016 10:23:16 UTC+2, Igor Cerovsky wrote:
>>
>> Hello,
>>
>> Trying to write documentation according to Julia manual I've found
>> following reversed order of methods for given function; is that a bug?:
>> """
>> foo()
>>
>> I'm major foo...
>> """
>> function foo()
>> end
>>
>> "
>> foo(x)
>>
>> Here comes additional doc...
>> "
>> function foo(x)
>> end
>>
>> resulting documentation looks like:
>> help?> foo
>>   foo(x)
>>
>>   Here comes additional doc...
>>
>>   foo()
>>
>>   I'm major foo...
>>
>>
>>
>>


[julia-users] documentation reversed order for multiple methods

2016-05-12 Thread Igor Cerovsky
Hello,

Trying to write documentation according to Julia manual I've found 
following reversed order of methods for given function; is that a bug?:
"""
foo()

I'm major foo...
"""
function foo()
end

"
foo(x)

Here comes additional doc...
"
function foo(x)
end

resulting documentation looks like:
help?> foo
  foo(x)

  Here comes additional doc...

  foo()

  I'm major foo...





[julia-users] delegatin constructors

2016-04-25 Thread Igor Cerovsky
Hello,

Does Julia supports delegating constructors similar to c++11? If yes, what 
is a syntax? Thanks.

// c++ code

class class_a {
public:
class_a() {}
class_a(string str) : m_string{ str } {}
class_a(string str, double dbl) : m_string{ str }, m_double{ dbl } {}
double m_double;
string m_string;
};





[julia-users] strategy design pattern

2016-04-06 Thread Igor Cerovsky
Hi,

By exploring Julia I'm curios what would be a best way to implement 
strategy design pattern in Julia. I'm posting a C++ example, for which I 
would like to write equivalent in Julia.
Thanks.

#include 
#include 
#include 

template
void foo_t(const T& t)
{
  std::cout << "I'm using foo_t... \n";
  for (auto a : t)  std::cout << a << '\n';
}

template
void bar_t(const T& t)
{
  std::cout << "I'm usimg bar_t... \n";
  for (auto a : t)  std::cout << a+10 << '\n';
}

template
class alg_t
{
public:
  template
  explicit alg_t(F strategy) : strategy(std::move(strategy)) { }

  void run(const T& t) { strategy(t); }

private:
  std::function strategy;
};

int main()
{
  using VcD = std::vector;
  VcD vec{1., 2., 3.};

  alg_t alg_foo(foo_t);
  alg_foo.run(vec);

  alg_t alg_bar(bar_t);
  alg_bar.run(vec);

  return 0;
}





Re: [julia-users] Re: performace of loops Julia vs Blas

2016-04-04 Thread Igor Cerovsky
Thank you all for suggestions and troubleshooting. I'll try 0.4.5 Julia 
version later, post results.

On Thursday, 24 March 2016 14:31:29 UTC+1, Yichao Yu wrote:
>
> On Thu, Mar 24, 2016 at 9:26 AM, Erik Schnetter <schn...@gmail.com 
> > wrote: 
> > Using only SSE2 instead of AVX is slower by a factor of two. 
> > 
> > You can try a newer version of Julia (0.4.5?) that should build 
> > against LLVM 3.7.1 instead of LLVM 3.3. 
>
> IIRC the 0.4.* binaries should be built with llvm 3.3. 
>
> > 
> > -erik 
> > 
> > On Thu, Mar 24, 2016 at 8:54 AM, Yichao Yu <yyc...@gmail.com 
> > wrote: 
> >> On Thu, Mar 24, 2016 at 3:22 AM, Igor Cerovsky 
> >> <igor.c...@2bridgz.com > wrote: 
> >>> I cannot find ymm instuctions in the assembly code. Does 
> >>> LLVM libLLVM-3.3 supports AVX2 instruction set? 
> >> 
> >> Disabled due to llvm bug. The nightly binary is probably also compiled 
> >> without it. 
> > 
> > 
> > 
> > -- 
> > Erik Schnetter <schn...@gmail.com > 
> > http://www.perimeterinstitute.ca/personal/eschnetter/ 
>


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Igor Cerovsky
I cannot find ymm instuctions in the assembly code. Does 
LLVM libLLVM-3.3 supports AVX2 instruction set?


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-24 Thread Igor Cerovsky
assembler code is attached.

versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Igor

On Wednesday, 23 March 2016 16:05:20 UTC+1, Erik Schnetter wrote:
>
> I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz". 
>
> -erik 
>
> On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky 
> <igor.c...@2bridgz.com > wrote: 
> > Thanks, Erik. I've thought there is something deeper in the LLVM. 
> > Since I'm quite new to Julia, I'll follow your suggestions and send you 
> > some outputs. 
> > What is a processor you were running the benchmarks? 
> > 
> > On 23 March 2016 at 15:42, Erik Schnetter <schn...@gmail.com 
> > wrote: 
> >> 
> >> I get a time ratio (bc / bb) of 1.1 
> >> 
> >> It could be that you're just having bad luck with the particular 
> >> optimization decisions that LLVM makes for the combined code, or with 
> >> the parameters (sizes) for this benchmark. Maybe the performance 
> >> difference changes for different matrix sizes? There's a million 
> >> things you can try, e.g. starting Julia with the "-O" option, or using 
> >> a different LLVM version. What would really help is gather more 
> >> detailed information, e.g. by looking at the disassembled loop kernels 
> >> (to see whether something is wrong), or using a profiler to see where 
> >> the time is spent (Julia has a built-in profiler), or gathering 
> >> statistics about floating point instructions executed and cache 
> >> operations (that requires an external tool). 
> >> 
> >> The disassembled code is CPU-specific and also depends on the LLVM 
> >> version. I'd be happy to have a quick glance at it if you create a 
> >> listing (with `@code_native`) and e.g. put it up as a gist 
> >> . I'd also need your CPU type (`versioninfo()` in 
> >> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though. 
> >> 
> >> -erik 
> >> 
> >> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky 
> >> <igor.c...@2bridgz.com > wrote: 
> >> > I've attached two notebooks, you can check the comparisons. 
> >> > The first one is to compare rank1updatede! and rank1updateb! 
> functions. 
> >> > The 
> >> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. 
> The 
> >> > same 
> >> > applies to mygemv vs Blas.gemv. 
> >> > Combining the same routines into the mgs algorithm in the very first 
> >> > post, 
> >> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 
> >> > 6700HQ 
> >> > (that is important to mention, because on older processors the 
> >> > difference is 
> >> > not that big, it similar to comparing the routines rank1update and 
> >> > BLAS.ger). This is something what I'm trying to figure out why? 
> >> > 
> >> > 
> >> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote: 
> >> >> 
> >> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky 
> >> >> <igor.c...@2bridgz.com> wrote: 
> >> >> > The factor ~20% I've mentioned just because it is something what 
> I've 
> >> >> > commonly observed, and of course can vary, and isn't that 
> important. 
> >> >> > 
> >> >> > What bothers me is: why the performance drops 2-times, when I 
> combine 
> >> >> > two 
> >> >> > routines where each one alone causes performance drop 0.2-times? 
> >> >> 
> >> >> I looked at the IJulia notebook you posted, but it wasn't obvious 
> >> >> which routines you mean. Can you point to exactly the source codes 
> you 
> >> >> are comparing? 
> >> >> 
> >> >> -erik 
> >> >> 
> >> >> > In other words I have routines foo() and bar() and their 
> equivalents 
> >> >> > in 
> >> >> > BLAS 
> >> >> > fooblas() barblas(); where 
> >> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2 
> >> >> > The same for bar. Consider following pseudo-code 
> >> >> >   for k in 1:N 
> >> >> > foo()  # my Julia implementation o

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Igor Cerovsky
Thanks, Erik. I've thought there is something deeper in the LLVM.
Since I'm quite new to Julia, I'll follow your suggestions and send you
some outputs.
What is a processor you were running the benchmarks?

On 23 March 2016 at 15:42, Erik Schnetter <schnet...@gmail.com> wrote:

> I get a time ratio (bc / bb) of 1.1
>
> It could be that you're just having bad luck with the particular
> optimization decisions that LLVM makes for the combined code, or with
> the parameters (sizes) for this benchmark. Maybe the performance
> difference changes for different matrix sizes? There's a million
> things you can try, e.g. starting Julia with the "-O" option, or using
> a different LLVM version. What would really help is gather more
> detailed information, e.g. by looking at the disassembled loop kernels
> (to see whether something is wrong), or using a profiler to see where
> the time is spent (Julia has a built-in profiler), or gathering
> statistics about floating point instructions executed and cache
> operations (that requires an external tool).
>
> The disassembled code is CPU-specific and also depends on the LLVM
> version. I'd be happy to have a quick glance at it if you create a
> listing (with `@code_native`) and e.g. put it up as a gist
> . I'd also need your CPU type (`versioninfo()` in
> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.
>
> -erik
>
> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
> <igor.cerov...@2bridgz.com> wrote:
> > I've attached two notebooks, you can check the comparisons.
> > The first one is to compare rank1updatede! and rank1updateb! functions.
> The
> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The
> same
> > applies to mygemv vs Blas.gemv.
> > Combining the same routines into the mgs algorithm in the very first
> post,
> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7
> 6700HQ
> > (that is important to mention, because on older processors the
> difference is
> > not that big, it similar to comparing the routines rank1update and
> > BLAS.ger). This is something what I'm trying to figure out why?
> >
> >
> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
> >>
> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
> >> <igor.c...@2bridgz.com> wrote:
> >> > The factor ~20% I've mentioned just because it is something what I've
> >> > commonly observed, and of course can vary, and isn't that important.
> >> >
> >> > What bothers me is: why the performance drops 2-times, when I combine
> >> > two
> >> > routines where each one alone causes performance drop 0.2-times?
> >>
> >> I looked at the IJulia notebook you posted, but it wasn't obvious
> >> which routines you mean. Can you point to exactly the source codes you
> >> are comparing?
> >>
> >> -erik
> >>
> >> > In other words I have routines foo() and bar() and their equivalents
> in
> >> > BLAS
> >> > fooblas() barblas(); where
> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2
> >> > The same for bar. Consider following pseudo-code
> >> >   for k in 1:N
> >> > foo()  # my Julia implementation of a BLAS function for example
> gemv
> >> > bar()  # my Julia implementation of a BLAS function for example
> ger
> >> >   end
> >> > end
> >> >
> >> >
> >> > function foobarblas()
> >> >   for k in 1:N
> >> > fooblas()  # this is equivalent of foo in BLAS for example gemv
> >> > barblas()  # this is equivalent of bar in BLAS for example ger
> >> >   end
> >> > end
> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
> >> >
> >> >
> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
> >> >>
> >> >> The architecture-specific, manual BLAS optimizations don't just give
> >> >> you an additional 20%. They can improve speed by a factor of a few.
> >> >>
> >> >> If you see a factor of 2.6, then that's probably to be accepted,
> >> >> unless to really look into the details (generated assembler code,
> >> >> measure cache misses, introduce manual vectorization and loop
> >> >> unrolling, etc.) And you'll have to repeat that analysis if you're
> >> >> using a different system.
> >> >>
> >> >> -erik
> >> >>
> >> >> On Mon, 

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-23 Thread Igor Cerovsky
I've attached two notebooks, you can check the comparisons. 
The first one is to compare rank1updatede! and rank1updateb! functions. The 
Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The 
same applies to mygemv vs Blas.gemv. 
Combining the same routines into the mgs algorithm in the very first post, 
the resulting performance is mgs / mgs_blas is 2.6 on *my computer* i7 
6700HQ (that is important to mention, because on older processors the 
difference is not that big, it similar to comparing the routines rank1update 
and BLAS.ger). This is something what I'm trying to figure out why?


On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>
> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky 
> <igor.c...@2bridgz.com > wrote: 
> > The factor ~20% I've mentioned just because it is something what I've 
> > commonly observed, and of course can vary, and isn't that important. 
> > 
> > What bothers me is: why the performance drops 2-times, when I combine 
> two 
> > routines where each one alone causes performance drop 0.2-times? 
>
> I looked at the IJulia notebook you posted, but it wasn't obvious 
> which routines you mean. Can you point to exactly the source codes you 
> are comparing? 
>
> -erik 
>
> > In other words I have routines foo() and bar() and their equivalents in 
> BLAS 
> > fooblas() barblas(); where 
> > @elapsed foo() / @elapsed fooblas() ~= 1.2 
> > The same for bar. Consider following pseudo-code 
> >   for k in 1:N 
> > foo()  # my Julia implementation of a BLAS function for example gemv 
> > bar()  # my Julia implementation of a BLAS function for example ger 
> >   end 
> > end 
> > 
> > 
> > function foobarblas() 
> >   for k in 1:N 
> > fooblas()  # this is equivalent of foo in BLAS for example gemv 
> > barblas()  # this is equivalent of bar in BLAS for example ger 
> >   end 
> > end 
> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 
> > 
> > 
> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: 
> >> 
> >> The architecture-specific, manual BLAS optimizations don't just give 
> >> you an additional 20%. They can improve speed by a factor of a few. 
> >> 
> >> If you see a factor of 2.6, then that's probably to be accepted, 
> >> unless to really look into the details (generated assembler code, 
> >> measure cache misses, introduce manual vectorization and loop 
> >> unrolling, etc.) And you'll have to repeat that analysis if you're 
> >> using a different system. 
> >> 
> >> -erik 
> >> 
> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
> >> <igor.c...@2bridgz.com> wrote: 
> >> > Well, maybe the subject of the post is confusing. I've tried to write 
> an 
> >> > algorithm which runs approximately as fast as using BLAS functions, 
> but 
> >> > using pure Julia implementation. Sure, we know, that BLAS is highly 
> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let 
> us 
> >> > say 
> >> > ~1.2-times. 
> >> > 
> >> > If I take a part of the algorithm, and run it separately all works 
> fine. 
> >> > Consider code below: 
> >> > function rank1update!(A, x, y) 
> >> > for j = 1:size(A, 2) 
> >> > @fastmath @inbounds @simd for i = 1:size(A, 1) 
> >> > A[i,j] += 1.1 * y[j] * x[i] 
> >> > end 
> >> > end 
> >> > end 
> >> > 
> >> > function rank1updateb!(A, x, y) 
> >> > R = BLAS.ger!(1.1, x, y, A) 
> >> > end 
> >> > 
> >> > Here BLAS is ~1.2-times faster. 
> >> > However, calling it together with 'mygemv!' in the loop (see code in 
> >> > original post), the performance drops to ~2.6 times slower then using 
> >> > BLAS 
> >> > functions (gemv, ger) 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: 
> >> >> 
> >> >> I'm not sure what the expected result here is. BLAS is designed to 
> be 
> >> >> as 
> >> >> fast as possible at matrix multiply. I'd be more concerned if you 
> write 
> >> >> straightforward loop code and beat BLAS, since that means the BLAS 
> is 
> >> >> badly 
> >> >> mistuned. 
> >> >> 
> >> >> On Mon, Mar 21, 201

Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-22 Thread Igor Cerovsky
The factor ~20% I've mentioned just because it is something what I've 
commonly observed, and of course can vary, and isn't that important.

What bothers me is: why the performance drops 2-times, when I combine two 
routines where each one alone causes performance drop 0.2-times? 
In other words I have routines foo() and bar() and their equivalents in 
BLAS fooblas() barblas(); where 
*@elapsed foo() / @elapsed fooblas() ~= 1.2 *
The same for bar. Consider following pseudo-code
  for k in 1:N
foo()  # my Julia implementation of a BLAS function for example gemv
bar()  # my Julia implementation of a BLAS function for example ger
  end
end


function foobarblas()
  for k in 1:N
fooblas()  # this is equivalent of foo in BLAS for example gemv 
barblas()  # this is equivalent of bar in BLAS for example ger
  end
end
then *@elapsed foobar() / @elapsed foobarblas() ~= 2.6*


On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>
> The architecture-specific, manual BLAS optimizations don't just give 
> you an additional 20%. They can improve speed by a factor of a few. 
>
> If you see a factor of 2.6, then that's probably to be accepted, 
> unless to really look into the details (generated assembler code, 
> measure cache misses, introduce manual vectorization and loop 
> unrolling, etc.) And you'll have to repeat that analysis if you're 
> using a different system. 
>
> -erik 
>
> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
> <igor.c...@2bridgz.com > wrote: 
> > Well, maybe the subject of the post is confusing. I've tried to write an 
> > algorithm which runs approximately as fast as using BLAS functions, but 
> > using pure Julia implementation. Sure, we know, that BLAS is highly 
> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us 
> say 
> > ~1.2-times. 
> > 
> > If I take a part of the algorithm, and run it separately all works fine. 
> > Consider code below: 
> > function rank1update!(A, x, y) 
> > for j = 1:size(A, 2) 
> > @fastmath @inbounds @simd for i = 1:size(A, 1) 
> > A[i,j] += 1.1 * y[j] * x[i] 
> > end 
> > end 
> > end 
> > 
> > function rank1updateb!(A, x, y) 
> > R = BLAS.ger!(1.1, x, y, A) 
> > end 
> > 
> > Here BLAS is ~1.2-times faster. 
> > However, calling it together with 'mygemv!' in the loop (see code in 
> > original post), the performance drops to ~2.6 times slower then using 
> BLAS 
> > functions (gemv, ger) 
> > 
> > 
> > 
> > 
> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: 
> >> 
> >> I'm not sure what the expected result here is. BLAS is designed to be 
> as 
> >> fast as possible at matrix multiply. I'd be more concerned if you write 
> >> straightforward loop code and beat BLAS, since that means the BLAS is 
> badly 
> >> mistuned. 
> >> 
> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <igor.c...@2bridgz.com> 
> >> wrote: 
> >>> 
> >>> Thanks Steven, I've thought there is something more behind... 
> >>> 
> >>> I shall note that that I forgot to mention matrix dimensions, which is 
> >>> 1000 x 1000. 
> >>> 
> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: 
> >>>> 
> >>>> You need a lot more than just fast loops to match the performance of 
> an 
> >>>> optimized BLAS.See e.g. this notebook for some comments on the 
> related 
> >>>> case of matrix multiplication: 
> >>>> 
> >>>> 
> >>>> 
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>  
> >> 
> >> 
> > 
>
>
>
> -- 
> Erik Schnetter <schn...@gmail.com > 
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>


Re: [julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Igor Cerovsky
Well, maybe the subject of the post is confusing. I've tried to write an 
algorithm which runs approximately as fast as using BLAS functions, but 
using pure Julia implementation. Sure, we know, that BLAS is highly 
optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say 
~1.2-times. 

If I take a part of the algorithm, and run it separately all works fine. 
Consider code below:
function rank1update!(A, x, y)
for j = 1:size(A, 2)
@fastmath @inbounds @simd for i = 1:size(A, 1)
A[i,j] += 1.1 * y[j] * x[i]
end
end
end

function rank1updateb!(A, x, y)
R = BLAS.ger!(1.1, x, y, A)
end

Here BLAS is ~1.2-times faster.
However, calling it together with 'mygemv!' in the loop (see code in 
original post), the performance drops to ~2.6 times slower then using BLAS 
functions (gemv, ger)




On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>
> I'm not sure what the expected result here is. BLAS is designed to be as 
> fast as possible at matrix multiply. I'd be more concerned if you write 
> straightforward loop code and beat BLAS, since that means the BLAS is badly 
> mistuned.
>
> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <igor.c...@2bridgz.com 
> > wrote:
>
>> Thanks Steven, I've thought there is something more behind...
>>
>> I shall note that that I forgot to mention matrix dimensions, which is 
>> 1000 x 1000.
>>
>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>>
>>> You need a lot more than just fast loops to match the performance of an 
>>> optimized BLAS.See e.g. this notebook for some comments on the related 
>>> case of matrix multiplication:
>>>
>>>
>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>>
>>
>

[julia-users] Re: performace of loops Julia vs Blas

2016-03-21 Thread Igor Cerovsky
Thanks Steven, I've thought there is something more behind...

I shall note that that I forgot to mention matrix dimensions, which is 1000 
x 1000.

On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>
> You need a lot more than just fast loops to match the performance of an 
> optimized BLAS.See e.g. this notebook for some comments on the related 
> case of matrix multiplication:
>
>
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>


[julia-users] performace of loops Julia vs Blas

2016-03-21 Thread Igor Cerovsky
Hi,

Trying to write custom and using BLAS functions implementation of 
Gram-Schmidt algorithm I got more than 2-times slower performance for Julia 
0.4.3 on my computer Intel i7 6700HQ (on older processor i7 5500 the 
performance gain is 1.2-times). The code below is a bit longer, but I got 
the slow performance in the whole context. Trying to profile parts of the 
algorithm I got only slightly different performance.

Custom implementation:

function rank1update!(DST, A, R, k)
rows, cols = size(A)
for j = k+1:cols
@simd for i = 1:rows
@inbounds DST[i,j] -= A[i, k] * R[k, j]
end
end
end

function mygemv!(DST, A, k, alpha)
rows, cols = size(A)
for j in k+1:cols
s = 0.0
@simd for i in 1:rows
@inbounds s += A[i, k] * A[i, j]
end  
DST[k, j] = s * alpha
end
end

function mgsf(M)
rows, cols = size(M)
Q = copy(M)
R = eye(cols)

for k in 1:cols
alpha = 1.0 / sumabs2(sub(Q, :, k))
mygemv!(R, Q, k, alpha)
rank1update!(Q, Q, R, k)
end
Q, R
end

Implementation using BLAS functions:
function mgs_blas(M)
cols = size(M, 2)
Q = copy(M)
R = eye(cols)

for k in 1:cols
q_k = sub(Q, :, k)
Q_sub = sub(Q, :, k+1:cols)
R_sub = sub(R, k, k+1:cols)

alpha = 1.0 / sumabs2(q_k)
R[k, k+1:cols] = BLAS.gemv('T', alpha, Q_sub, q_k)
BLAS.ger!(-1.0,  q_k, vec(R_sub), Q_sub)
end

Q, R
end

And results; using BLAS the performance gain is ~2.6 times:

# custom implementation
Q2, R2 = @time mgsf(T);

  0.714916 seconds (4.99 k allocations: 15.411 MB, 0.08% gc time)


# implementation using BLAS functions 

Q5, R5 = @time mgs_blas(T);

  0.339278 seconds (16.45 k allocations: 23.521 MB, 0.76% gc time)



A hint: Looking at performance graph in the Task Manager it seems BLAS uses 
more cores.
The question that remains is: what is going on?

Thanks for explanation.



Re: [julia-users] broadcast performance is slow

2016-03-21 Thread Igor Cerovsky
Stefan, thanks.

On Friday, 18 March 2016 15:11:22 UTC+1, Stefan Karpinski wrote:
>
> This was one of the motivating issues for the function redesign which is 
> now on Julia master. If you try the latest development version of Julia, 
> these should both be much faster.
>
> On Fri, Mar 18, 2016 at 8:50 AM, Igor Cerovsky <igor.c...@2bridgz.com 
> > wrote:
>
>> Hello,
>>
>> By writing a code for unit normal scaling I've found a big differences 
>> related to where a function used in broadcast is defined, *globally* vs* 
>> locally*. Consider functions below:
>>
>> function scun2!(A)
>> shift = mean( A, 1)
>> stretch = std(A, 1)
>> 
>> f(a, b, c) = (a - b) / c   # defined locally
>> broadcast!(f, A, A, shift, stretch)
>> 
>> shift, stretch
>> end
>>
>> f_scun(a, b, c) = (a - b) / c  # defined globally
>> function scun3!(A)
>> shift = mean( A)
>> stretch = std(A, 1)
>> 
>> broadcast!(f_scun, A, A, shift, stretch)
>> 
>> shift, stretch
>> end
>>
>> Resulting performance is:
>>
>> R2 = copy(T)
>>
>> @time sh2, sc2 = scun2!(R2);
>>
>>   0.035527 seconds (19.51 k allocations: 967.273 KB)
>>
>>
>> R3 = copy(T)
>>
>> @time sh3, sc3 = scun3!(R3);
>>
>>   0.009705 seconds (54 allocations: 17.547 KB)
>>
>>
>> How can be explained, that if f_scun is defined outside the function the 
>> performance is 3.6 times better (number of allocations is also large)? I'm 
>> using Julia 0.4.3
>>
>> Thank you,
>> Igor
>>
>>
>>
>>
>>
>

[julia-users] broadcast performance is slow

2016-03-19 Thread Igor Cerovsky
Hello,

By writing a code for unit normal scaling I've found a big differences 
related to where a function used in broadcast is defined, *globally* vs* 
locally*. Consider functions below:

function scun2!(A)
shift = mean( A, 1)
stretch = std(A, 1)

f(a, b, c) = (a - b) / c   # defined locally
broadcast!(f, A, A, shift, stretch)

shift, stretch
end

f_scun(a, b, c) = (a - b) / c  # defined globally
function scun3!(A)
shift = mean( A)
stretch = std(A, 1)

broadcast!(f_scun, A, A, shift, stretch)

shift, stretch
end

Resulting performance is:

R2 = copy(T)

@time sh2, sc2 = scun2!(R2);

  0.035527 seconds (19.51 k allocations: 967.273 KB)


R3 = copy(T)

@time sh3, sc3 = scun3!(R3);

  0.009705 seconds (54 allocations: 17.547 KB)


How can be explained, that if f_scun is defined outside the function the 
performance is 3.6 times better (number of allocations is also large)? I'm 
using Julia 0.4.3

Thank you,
Igor