[julia-users] Re: help understanding different ways of wrapping functions
Thanks so much for the tips. The culprit is the keyword argument (xRat=0.). Declaring it made the wrapped code twice as fast, but still way slower than the inline code. But making it positional made the wrapped code just a little slower than the inline code - big improvement. On Wednesday, September 28, 2016 at 2:50:40 PM UTC+8, Gunnar Farnebäck wrote: > > It's normal that manually inlined code of this kind is faster than wrapped > code unless the compiler manages to see the full inlining potential. In > this case the huge memory allocations for the wrapped solutions indicates > that it's nowhere near doing that at all. I doubt it will take you all the > way but start with modifying your inner M_CPS function to only take > positional arguments or declaring the type of the keyword argument as > suggested in the performance tips section of the manual. > > Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo: >> >> I tested a few different ways of wrapping functions. It looks different >> ways of wrapping has slightly different costs. But the most confusing to >> me is that putting everything inline looks much faster than wrapping things >> up. I would understand this in other languages, but I thought Julia >> advocates simple wrapping. Can anyone help explain what is happening >> below, and how I can do most efficient wrapping in the demo code? >> >> Demo code is included below. >> >> julia> versioninfo() >> Julia Version 0.5.0 >> Commit 3c9d753 (2016-09-19 18:14 UTC) >> Platform Info: >> System: Linux (x86_64-pc-linux-gnu) >> CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >> LAPACK: libopenblas64_ >> LIBM: libopenlibm >> LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) >> >> julia> testFunc() >> calling LoopCP (everything inline) >> 0.097556 seconds (2.10 k allocations: 290.625 KB) >> elapsed time (ns): 97555896 >> bytes allocated: 297600 >> pool allocs: 2100 >> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0] >> >> calling LoopCP0 (slightly wrapped) >> 4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time) >> elapsed time (ns): 4173830495 >> gc time (ns): 243516584 >> bytes allocated: 2396838538 >> pool allocs: 49783357 >> GC pauses: 104 >> full collections: 1 >> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0] >> >> calling LoopCP1 (wrapped one way) >> 5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time) >> elapsed time (ns): 5274722983 >> gc time (ns): 191036337 >> bytes allocated: 2553752638 >> pool allocs: 59585834 >> GC pauses: 112 >> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0] >> >> calling LoopCP2 (wrapped another way) >> 5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time) >> elapsed time (ns): 5212894550 >> gc time (ns): 187696529 >> bytes allocated: 2553577600 >> pool allocs: 59582100 >> GC pauses: 111 >> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0] >> >> const dim=1000 >>> >>> type Tech >>> >>> a::Array{Float64,1} >>> >>> c::Array{Int,1} >>> >>> function Tech() >>> >>> this = new() >>> >>> this.a = zeros(Float64, dim) >>> >>> this.c = rand([0,1;], dim) >>> >>> this >>> >>> end >>> >>> end >>> >>> function LoopCP(csign::Int, tech::Tech) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> @inbounds for i = 1:dim >>> >>> if csign == tech.c[i] >>> >>> tech.a[i] += 2.*xRat >>> >>> else >>> >>> tech.a[i] = 0. >>> >>> end >>> >>> end >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.) >>> >>> if csign == tech.c[i] >>> >>> tech.a[i] += 2.*xRat >>> >>> else >>> >>> tech.a[i] = 0. >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function LoopCP0(csign::Int, tech::Tech) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> @inbounds for i = 1:dim >>> >>> M_CPS(i, csign, tech, xRat=xRat) >>> >>> end >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function, xRat=0.) >>> >>> @inbounds for i = 1:dim >>> >>> molecule(i, csign, tech; xRat=xRat) >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> MoleculeWrapS(csign, tech, molecule, xRat) >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS) >>> >>>
Re: [julia-users] Re: help understanding different ways of wrapping functions
On Wed, 2016-09-28 at 08:50, Gunnar Farnebäckwrote: > It's normal that manually inlined code of this kind is faster than wrapped > code unless the compiler manages to see the full inlining potential. In > this case the huge memory allocations for the wrapped solutions indicates > that it's nowhere near doing that at all. I doubt it will take you all the > way but start with modifying your inner M_CPS function to only take > positional arguments or declaring the type of the keyword argument as > suggested in the performance tips section of the manual. Even annotated keywords are slower than normal, positional ones (except when their default value is used, as far as I recall). > Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo: >> >> I tested a few different ways of wrapping functions. It looks different >> ways of wrapping has slightly different costs. But the most confusing to >> me is that putting everything inline looks much faster than wrapping things >> up. I would understand this in other languages, but I thought Julia >> advocates simple wrapping. Can anyone help explain what is happening >> below, and how I can do most efficient wrapping in the demo code? >> >> Demo code is included below. >> >> julia> versioninfo() >> Julia Version 0.5.0 >> Commit 3c9d753 (2016-09-19 18:14 UTC) >> Platform Info: >> System: Linux (x86_64-pc-linux-gnu) >> CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >> LAPACK: libopenblas64_ >> LIBM: libopenlibm >> LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) >> >> julia> testFunc() >> calling LoopCP (everything inline) >> 0.097556 seconds (2.10 k allocations: 290.625 KB) >> elapsed time (ns): 97555896 >> bytes allocated: 297600 >> pool allocs: 2100 >> [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0] >> >> calling LoopCP0 (slightly wrapped) >> 4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time) >> elapsed time (ns): 4173830495 >> gc time (ns): 243516584 >> bytes allocated: 2396838538 >> pool allocs: 49783357 >> GC pauses: 104 >> full collections: 1 >> [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0] >> >> calling LoopCP1 (wrapped one way) >> 5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time) >> elapsed time (ns): 5274722983 >> gc time (ns): 191036337 >> bytes allocated: 2553752638 >> pool allocs: 59585834 >> GC pauses: 112 >> [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0] >> >> calling LoopCP2 (wrapped another way) >> 5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time) >> elapsed time (ns): 5212894550 >> gc time (ns): 187696529 >> bytes allocated: 2553577600 >> pool allocs: 59582100 >> GC pauses: 111 >> [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0] >> >> const dim=1000 >>> >>> type Tech >>> >>> a::Array{Float64,1} >>> >>> c::Array{Int,1} >>> >>> function Tech() >>> >>> this = new() >>> >>> this.a = zeros(Float64, dim) >>> >>> this.c = rand([0,1;], dim) >>> >>> this >>> >>> end >>> >>> end >>> >>> function LoopCP(csign::Int, tech::Tech) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> @inbounds for i = 1:dim >>> >>> if csign == tech.c[i] >>> >>> tech.a[i] += 2.*xRat >>> >>> else >>> >>> tech.a[i] = 0. >>> >>> end >>> >>> end >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.) >>> >>> if csign == tech.c[i] >>> >>> tech.a[i] += 2.*xRat >>> >>> else >>> >>> tech.a[i] = 0. >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function LoopCP0(csign::Int, tech::Tech) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> @inbounds for i = 1:dim >>> >>> M_CPS(i, csign, tech, xRat=xRat) >>> >>> end >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function, xRat=0.) >>> >>> @inbounds for i = 1:dim >>> >>> molecule(i, csign, tech; xRat=xRat) >>> >>> end >>> >>> nothing >>> >>> end >>> >>> function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function) >>> >>> for j=1:10 >>> >>> for xRat in [1.:20.;] >>> >>> MoleculeWrapS(csign, tech, molecule, xRat) >>> >>> end # >>> >>> end >>> >>> nothing >>> >>> end >>> >>> LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS) >>> >>> WrapCPS(csign::Int, tech::Tech, xRat=0.) = MoleculeWrapS(csign, tech, M_CPS, xRat) >>> >>> function LoopRunnerM2(csign::Int, tech::Tech, loop::Function) >>> >>>
[julia-users] Re: help understanding different ways of wrapping functions
It's normal that manually inlined code of this kind is faster than wrapped code unless the compiler manages to see the full inlining potential. In this case the huge memory allocations for the wrapped solutions indicates that it's nowhere near doing that at all. I doubt it will take you all the way but start with modifying your inner M_CPS function to only take positional arguments or declaring the type of the keyword argument as suggested in the performance tips section of the manual. Den onsdag 28 september 2016 kl. 06:29:37 UTC+2 skrev K leo: > > I tested a few different ways of wrapping functions. It looks different > ways of wrapping has slightly different costs. But the most confusing to > me is that putting everything inline looks much faster than wrapping things > up. I would understand this in other languages, but I thought Julia > advocates simple wrapping. Can anyone help explain what is happening > below, and how I can do most efficient wrapping in the demo code? > > Demo code is included below. > > julia> versioninfo() > Julia Version 0.5.0 > Commit 3c9d753 (2016-09-19 18:14 UTC) > Platform Info: > System: Linux (x86_64-pc-linux-gnu) > CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz > WORD_SIZE: 64 > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) > LAPACK: libopenblas64_ > LIBM: libopenlibm > LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) > > julia> testFunc() > calling LoopCP (everything inline) > 0.097556 seconds (2.10 k allocations: 290.625 KB) > elapsed time (ns): 97555896 > bytes allocated: 297600 > pool allocs: 2100 > [0.0,4200.0,0.0,0.0,4200.0,4200.0,4200.0,4200.0,0.0,4200.0,4200.0] > > calling LoopCP0 (slightly wrapped) > 4.173830 seconds (49.78 M allocations: 2.232 GB, 5.83% gc time) > elapsed time (ns): 4173830495 > gc time (ns): 243516584 > bytes allocated: 2396838538 > pool allocs: 49783357 > GC pauses: 104 > full collections: 1 > [4200.0,0.0,4200.0,4200.0,0.0,0.0,0.0,0.0,4200.0,0.0,0.0] > > calling LoopCP1 (wrapped one way) > 5.274723 seconds (59.59 M allocations: 2.378 GB, 3.62% gc time) > elapsed time (ns): 5274722983 > gc time (ns): 191036337 > bytes allocated: 2553752638 > pool allocs: 59585834 > GC pauses: 112 > [8400.0,0.0,8400.0,8400.0,0.0,0.0,0.0,0.0,8400.0,0.0,0.0] > > calling LoopCP2 (wrapped another way) > 5.212895 seconds (59.58 M allocations: 2.378 GB, 3.60% gc time) > elapsed time (ns): 5212894550 > gc time (ns): 187696529 > bytes allocated: 2553577600 > pool allocs: 59582100 > GC pauses: 111 > [0.0,8400.0,0.0,0.0,8400.0,8400.0,8400.0,8400.0,0.0,8400.0,8400.0] > > const dim=1000 >> >> >>> type Tech >> >> a::Array{Float64,1} >> >> c::Array{Int,1} >> >> >>> function Tech() >> >> this = new() >> >> this.a = zeros(Float64, dim) >> >> this.c = rand([0,1;], dim) >> >> this >> >> end >> >> end >> >> >>> function LoopCP(csign::Int, tech::Tech) >> >> for j=1:10 >> >> for xRat in [1.:20.;] >> >> @inbounds for i = 1:dim >> >> if csign == tech.c[i] >> >> tech.a[i] += 2.*xRat >> >> else >> >> tech.a[i] = 0. >> >> end >> >> end >> >> end # >> >> end >> >> nothing >> >> end >> >> >>> function M_CPS(i::Int, csign::Int, tech::Tech; xRat=0.) >> >> if csign == tech.c[i] >> >> tech.a[i] += 2.*xRat >> >> else >> >> tech.a[i] = 0. >> >> end >> >> nothing >> >> end >> >> >>> function LoopCP0(csign::Int, tech::Tech) >> >> for j=1:10 >> >> for xRat in [1.:20.;] >> >> @inbounds for i = 1:dim >> >> M_CPS(i, csign, tech, xRat=xRat) >> >> end >> >> end # >> >> end >> >> nothing >> >> end >> >> >>> function MoleculeWrapS(csign::Int, tech::Tech, molecule::Function, >>> xRat=0.) >> >> @inbounds for i = 1:dim >> >> molecule(i, csign, tech; xRat=xRat) >> >> end >> >> nothing >> >> end >> >> >>> function LoopRunnerM1(csign::Int, tech::Tech, molecule::Function) >> >> for j=1:10 >> >> for xRat in [1.:20.;] >> >> MoleculeWrapS(csign, tech, molecule, xRat) >> >> end # >> >> end >> >> nothing >> >> end >> >> >>> LoopCP1(csign::Int, tech::Tech) = LoopRunnerM1(csign, tech, M_CPS) >> >> >>> WrapCPS(csign::Int, tech::Tech, xRat=0.) = MoleculeWrapS(csign, tech, >>> M_CPS, xRat) >> >> >>> function LoopRunnerM2(csign::Int, tech::Tech, loop::Function) >> >> for j=1:10 >> >> for xRat in [1.:20.;] >> >> loop(csign, tech, xRat) >> >> end # >> >> end >> >> nothing >> >> end >> >> >>> LoopCP2(csign::Int, tech::Tech) = LoopRunnerM2(csign, tech, WrapCPS) >> >> >>> function testFunc() >> >> tech = Tech() >> >> nloops = 100 >> >> >>> println("calling LoopCP (everything inline)") >> >> tech.a = zeros(tech.a)