On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky <igor.cerov...@2bridgz.com> wrote: > The factor ~20% I've mentioned just because it is something what I've > commonly observed, and of course can vary, and isn't that important. > > What bothers me is: why the performance drops 2-times, when I combine two > routines where each one alone causes performance drop 0.2-times?
I looked at the IJulia notebook you posted, but it wasn't obvious which routines you mean. Can you point to exactly the source codes you are comparing? -erik > In other words I have routines foo() and bar() and their equivalents in BLAS > fooblas() barblas(); where > @elapsed foo() / @elapsed fooblas() ~= 1.2 > The same for bar. Consider following pseudo-code > for k in 1:N > foo() # my Julia implementation of a BLAS function for example gemv > bar() # my Julia implementation of a BLAS function for example ger > end > end > > > function foobarblas() > for k in 1:N > fooblas() # this is equivalent of foo in BLAS for example gemv > barblas() # this is equivalent of bar in BLAS for example ger > end > end > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 > > > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: >> >> The architecture-specific, manual BLAS optimizations don't just give >> you an additional 20%. They can improve speed by a factor of a few. >> >> If you see a factor of 2.6, then that's probably to be accepted, >> unless to really look into the details (generated assembler code, >> measure cache misses, introduce manual vectorization and loop >> unrolling, etc.) And you'll have to repeat that analysis if you're >> using a different system. >> >> -erik >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky >> <igor.c...@2bridgz.com> wrote: >> > Well, maybe the subject of the post is confusing. I've tried to write an >> > algorithm which runs approximately as fast as using BLAS functions, but >> > using pure Julia implementation. Sure, we know, that BLAS is highly >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us >> > say >> > ~1.2-times. >> > >> > If I take a part of the algorithm, and run it separately all works fine. >> > Consider code below: >> > function rank1update!(A, x, y) >> > for j = 1:size(A, 2) >> > @fastmath @inbounds @simd for i = 1:size(A, 1) >> > A[i,j] += 1.1 * y[j] * x[i] >> > end >> > end >> > end >> > >> > function rank1updateb!(A, x, y) >> > R = BLAS.ger!(1.1, x, y, A) >> > end >> > >> > Here BLAS is ~1.2-times faster. >> > However, calling it together with 'mygemv!' in the loop (see code in >> > original post), the performance drops to ~2.6 times slower then using >> > BLAS >> > functions (gemv, ger) >> > >> > >> > >> > >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: >> >> >> >> I'm not sure what the expected result here is. BLAS is designed to be >> >> as >> >> fast as possible at matrix multiply. I'd be more concerned if you write >> >> straightforward loop code and beat BLAS, since that means the BLAS is >> >> badly >> >> mistuned. >> >> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <igor.c...@2bridgz.com> >> >> wrote: >> >>> >> >>> Thanks Steven, I've thought there is something more behind... >> >>> >> >>> I shall note that that I forgot to mention matrix dimensions, which is >> >>> 1000 x 1000. >> >>> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: >> >>>> >> >>>> You need a lot more than just fast loops to match the performance of >> >>>> an >> >>>> optimized BLAS. See e.g. this notebook for some comments on the >> >>>> related >> >>>> case of matrix multiplication: >> >>>> >> >>>> >> >>>> >> >>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb >> >> >> >> >> > >> >> >> >> -- >> Erik Schnetter <schn...@gmail.com> >> http://www.perimeterinstitute.ca/personal/eschnetter/ -- Erik Schnetter <schnet...@gmail.com> http://www.perimeterinstitute.ca/personal/eschnetter/