[julia-users] Re: help understanding different ways of wrapping functions

2016-09-28 Thread K leo
Thanks so much for the tips.  The culprit is the keyword argument 
(xRat=0.).  Declaring it made the wrapped code twice as fast, but still way 
slower than the inline code.  But making it positional made the wrapped 
code just a little slower than the inline code - big improvement.

On Wednesday, September 28, 2016 at 2:50:40 PM UTC+8, Gunnar Farnebäck 
wrote:
>
> It's normal that manually inlined code of this kind is faster than wrapped 
> code unless the compiler manages to see the full inlining potential. In 
> this case the huge memory allocations for the wrapped solutions indicates 
> that it's nowhere near doing that at all. I doubt it will take you all the 
> way but start with modifying your inner M_CPS function to only take 
> positional arguments or declaring the type of the keyword argument as 
> suggested in the performance tips section of the manual.
>
> Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo:
>>
>> I tested a few different ways of wrapping functions.  It looks different 
>> ways of wrapping has slightly different costs.  But the most confusing to 
>> me is that putting everything inline looks much faster than wrapping things 
>> up.  I would understand this in other languages, but I thought Julia 
>> advocates simple wrapping.  Can anyone help explain what is happening 
>> below, and how I can do most efficient wrapping in the demo code?
>>
>> Demo code is included below.
>>
>> julia> versioninfo()
>> Julia Version 0.5.0
>> Commit 3c9d753 (2016-09-19 18:14 UTC)
>> Platform Info:
>>   System: Linux (x86_64-pc-linux-gnu)
>>   CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
>>   WORD_SIZE: 64
>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>   LAPACK: libopenblas64_
>>   LIBM: libopenlibm
>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>
>> julia> testFunc()
>> calling LoopCP (everything inline)
>>   0.097556 seconds (2.10 k allocations: 290.625 KB)
>> elapsed time (ns): 97555896
>> bytes allocated:   297600
>> pool allocs:   2100
>> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0]
>>
>> calling LoopCP0 (slightly wrapped)
>>   4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time)
>> elapsed time (ns): 4173830495
>> gc time (ns):  243516584
>> bytes allocated:   2396838538
>> pool allocs:   49783357
>> GC pauses: 104
>> full collections:  1
>> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0]
>>
>> calling LoopCP1 (wrapped one way)
>>   5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time)
>> elapsed time (ns): 5274722983
>> gc time (ns):  191036337
>> bytes allocated:   2553752638
>> pool allocs:   59585834
>> GC pauses: 112
>> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0]
>>
>> calling LoopCP2 (wrapped another way)
>>   5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time)
>> elapsed time (ns): 5212894550
>> gc time (ns):  187696529
>> bytes allocated:   2553577600
>> pool allocs:   59582100
>> GC pauses: 111
>> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0]
>>
>> const dim=1000
>>>
>>>
 type Tech
>>>
>>> a::Array{Float64,1}
>>>
>>> c::Array{Int,1}
>>>
>>>
 function Tech()
>>>
>>> this = new()
>>>
>>> this.a = zeros(Float64, dim)
>>>
>>> this.c = rand([0,1;], dim)
>>>
>>> this
>>>
>>> end
>>>
>>> end
>>>
>>>
 function LoopCP(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.)
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function LoopCP0(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> M_CPS(i, csign, tech, xRat=xRat)
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function, 
 xRat=0.)
>>>
>>> @inbounds for i = 1:dim
>>>
>>> molecule(i, csign, tech; xRat=xRat)
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> MoleculeWrapS(csign, tech, molecule, xRat)
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS)
>>>
>>>
 

Re: [julia-users] Re: help understanding different ways of wrapping functions

2016-09-28 Thread Mauro
On Wed, 2016-09-28 at 08:50, Gunnar Farnebäck  wrote:
> It's normal that manually inlined code of this kind is faster than wrapped
> code unless the compiler manages to see the full inlining potential. In
> this case the huge memory allocations for the wrapped solutions indicates
> that it's nowhere near doing that at all. I doubt it will take you all the
> way but start with modifying your inner M_CPS function to only take
> positional arguments or declaring the type of the keyword argument as
> suggested in the performance tips section of the manual.

Even annotated keywords are slower than normal, positional ones (except
when their default value is used, as far as I recall).

> Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo:
>>
>> I tested a few different ways of wrapping functions.  It looks different
>> ways of wrapping has slightly different costs.  But the most confusing to
>> me is that putting everything inline looks much faster than wrapping things
>> up.  I would understand this in other languages, but I thought Julia
>> advocates simple wrapping.  Can anyone help explain what is happening
>> below, and how I can do most efficient wrapping in the demo code?
>>
>> Demo code is included below.
>>
>> julia> versioninfo()
>> Julia Version 0.5.0
>> Commit 3c9d753 (2016-09-19 18:14 UTC)
>> Platform Info:
>>   System: Linux (x86_64-pc-linux-gnu)
>>   CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
>>   WORD_SIZE: 64
>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>   LAPACK: libopenblas64_
>>   LIBM: libopenlibm
>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>
>> julia> testFunc()
>> calling LoopCP (everything inline)
>>   0.097556 seconds (2.10 k allocations: 290.625 KB)
>> elapsed time (ns): 97555896
>> bytes allocated:   297600
>> pool allocs:   2100
>> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0]
>>
>> calling LoopCP0 (slightly wrapped)
>>   4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time)
>> elapsed time (ns): 4173830495
>> gc time (ns):  243516584
>> bytes allocated:   2396838538
>> pool allocs:   49783357
>> GC pauses: 104
>> full collections:  1
>> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0]
>>
>> calling LoopCP1 (wrapped one way)
>>   5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time)
>> elapsed time (ns): 5274722983
>> gc time (ns):  191036337
>> bytes allocated:   2553752638
>> pool allocs:   59585834
>> GC pauses: 112
>> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0]
>>
>> calling LoopCP2 (wrapped another way)
>>   5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time)
>> elapsed time (ns): 5212894550
>> gc time (ns):  187696529
>> bytes allocated:   2553577600
>> pool allocs:   59582100
>> GC pauses: 111
>> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0]
>>
>> const dim=1000
>>>
>>>
 type Tech
>>>
>>> a::Array{Float64,1}
>>>
>>> c::Array{Int,1}
>>>
>>>
 function Tech()
>>>
>>> this = new()
>>>
>>> this.a = zeros(Float64, dim)
>>>
>>> this.c = rand([0,1;], dim)
>>>
>>> this
>>>
>>> end
>>>
>>> end
>>>
>>>
 function LoopCP(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.)
>>>
>>> if csign == tech.c[i]
>>>
>>> tech.a[i] += 2.*xRat
>>>
>>> else
>>>
>>> tech.a[i] = 0.
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function LoopCP0(csign::Int, tech::Tech)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> @inbounds for i = 1:dim
>>>
>>> M_CPS(i, csign, tech, xRat=xRat)
>>>
>>> end
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function,
 xRat=0.)
>>>
>>> @inbounds for i = 1:dim
>>>
>>> molecule(i, csign, tech; xRat=xRat)
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function)
>>>
>>> for j=1:10
>>>
>>> for xRat in [1.:20.;]
>>>
>>> MoleculeWrapS(csign, tech, molecule, xRat)
>>>
>>> end #
>>>
>>> end
>>>
>>> nothing
>>>
>>> end
>>>
>>>
 LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS)
>>>
>>>
 WrapCPS(csign::Int, tech::Tech, xRat=0.) = MoleculeWrapS(csign, tech,
 M_CPS, xRat)
>>>
>>>
 function LoopRunnerM2(csign::Int, tech::Tech, loop::Function)
>>>
>>> 

[julia-users] Re: help understanding different ways of wrapping functions

2016-09-28 Thread Gunnar Farnebäck
It's normal that manually inlined code of this kind is faster than wrapped 
code unless the compiler manages to see the full inlining potential. In 
this case the huge memory allocations for the wrapped solutions indicates 
that it's nowhere near doing that at all. I doubt it will take you all the 
way but start with modifying your inner M_CPS function to only take 
positional arguments or declaring the type of the keyword argument as 
suggested in the performance tips section of the manual.

Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo:
>
> I tested a few different ways of wrapping functions.  It looks different 
> ways of wrapping has slightly different costs.  But the most confusing to 
> me is that putting everything inline looks much faster than wrapping things 
> up.  I would understand this in other languages, but I thought Julia 
> advocates simple wrapping.  Can anyone help explain what is happening 
> below, and how I can do most efficient wrapping in the demo code?
>
> Demo code is included below.
>
> julia> versioninfo()
> Julia Version 0.5.0
> Commit 3c9d753 (2016-09-19 18:14 UTC)
> Platform Info:
>   System: Linux (x86_64-pc-linux-gnu)
>   CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblas64_
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>
> julia> testFunc()
> calling LoopCP (everything inline)
>   0.097556 seconds (2.10 k allocations: 290.625 KB)
> elapsed time (ns): 97555896
> bytes allocated:   297600
> pool allocs:   2100
> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0]
>
> calling LoopCP0 (slightly wrapped)
>   4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time)
> elapsed time (ns): 4173830495
> gc time (ns):  243516584
> bytes allocated:   2396838538
> pool allocs:   49783357
> GC pauses: 104
> full collections:  1
> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0]
>
> calling LoopCP1 (wrapped one way)
>   5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time)
> elapsed time (ns): 5274722983
> gc time (ns):  191036337
> bytes allocated:   2553752638
> pool allocs:   59585834
> GC pauses: 112
> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0]
>
> calling LoopCP2 (wrapped another way)
>   5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time)
> elapsed time (ns): 5212894550
> gc time (ns):  187696529
> bytes allocated:   2553577600
> pool allocs:   59582100
> GC pauses: 111
> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0]
>
> const dim=1000
>>
>>
>>> type Tech
>>
>> a::Array{Float64,1}
>>
>> c::Array{Int,1}
>>
>>
>>> function Tech()
>>
>> this = new()
>>
>> this.a = zeros(Float64, dim)
>>
>> this.c = rand([0,1;], dim)
>>
>> this
>>
>> end
>>
>> end
>>
>>
>>> function LoopCP(csign::Int, tech::Tech)
>>
>> for j=1:10
>>
>> for xRat in [1.:20.;]
>>
>> @inbounds for i = 1:dim
>>
>> if csign == tech.c[i]
>>
>> tech.a[i] += 2.*xRat
>>
>> else
>>
>> tech.a[i] = 0.
>>
>> end
>>
>> end
>>
>> end #
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.)
>>
>> if csign == tech.c[i]
>>
>> tech.a[i] += 2.*xRat
>>
>> else
>>
>> tech.a[i] = 0.
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> function LoopCP0(csign::Int, tech::Tech)
>>
>> for j=1:10
>>
>> for xRat in [1.:20.;]
>>
>> @inbounds for i = 1:dim
>>
>> M_CPS(i, csign, tech, xRat=xRat)
>>
>> end
>>
>> end #
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function, 
>>> xRat=0.)
>>
>> @inbounds for i = 1:dim
>>
>> molecule(i, csign, tech; xRat=xRat)
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function)
>>
>> for j=1:10
>>
>> for xRat in [1.:20.;]
>>
>> MoleculeWrapS(csign, tech, molecule, xRat)
>>
>> end #
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS)
>>
>>
>>> WrapCPS(csign::Int, tech::Tech, xRat=0.) = MoleculeWrapS(csign, tech, 
>>> M_CPS, xRat)
>>
>>
>>> function LoopRunnerM2(csign::Int, tech::Tech, loop::Function)
>>
>> for j=1:10
>>
>> for xRat in [1.:20.;]
>>
>> loop(csign, tech, xRat)
>>
>> end #
>>
>> end
>>
>> nothing
>>
>> end
>>
>>
>>> LoopCP2(csign::Int, tech::Tech) = LoopRunnerM2(csign, tech, WrapCPS)
>>
>>
>>> function testFunc()
>>
>> tech = Tech()
>>
>> nloops = 100
>>
>>
>>> println("calling LoopCP (everything inline)")
>>
>> tech.a = zeros(tech.a)