[algogeeks] Re: optimisation

2012-03-06 Thread Gene
This is a nice paper. I think that for matrix multiplication it ends up saying pretty much the same as we've been discussing. The OP said serial code. Vector code isn't serial. However it's easy to try vectorization these days. The latest versions of gcc will do a very reasonable job with the

Re: [algogeeks] Re: optimisation

2012-03-05 Thread Arun Vishwanathan
@Gene: if all matrices are of N x N , and my size of L1 cache is L1, then If I need a block of A and B to fit in cache would I need it as 2 x (blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused. I tried

[algogeeks] Re: optimisation

2012-03-05 Thread Gene
You can get an initial guess with math. The rest you'll need to determine by experiment. The math will be Block Dimension = sqrt( L1 / sizeof(FLOAT) / K ) . K could be 2 or 3 depending on the loop order. A reason you can't predict perfectly is that interrupt handlers can add to cache load in

Re: [algogeeks] Re: optimisation

2012-03-05 Thread Sairam Ravu
Here is the nice link for writing fast matrix-matrix multiplication. http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100 Apart from this we can vectorize the code and also we can do unrolling to get very good performance. -- With love and regards, Sairam Ravu I M.Tech(CS) Sri

[algogeeks] Re: optimisation

2012-02-29 Thread Gene
For big matrices, using all the caches well is very important. The usual approach is block tiling as you say. You want two blocks to fit nicely in the L1 cache. The most highly optimized schemes will have a hierarchy of tiles where two tiles at the second level will fit in the L2 cache, etc.

Re: [algogeeks] Re: optimisation

2012-02-29 Thread Arun Vishwanathan
@all: Thanks a lot On Wed, Feb 29, 2012 at 9:02 AM, Gene gene.ress...@gmail.com wrote: For big matrices, using all the caches well is very important. The usual approach is block tiling as you say. You want two blocks to fit nicely in the L1 cache. The most highly optimized schemes will

[algogeeks] Re: optimisation

2012-02-27 Thread Dave
@Arun: I'm assuming that you are optimizing for large matrices. Try making your basic operation a 4x4 matrix multiplication, written out (i.e., without loops), where the 4x4 matrices are subarrays of the operand arrays. This should give you good cache utilization and data reuse, as each element of

[algogeeks] Re: optimisation

2012-02-27 Thread Dave
@Arun: The problem with Strassen usually isn't overflow, but instability. The error bounds will be based on the largest element, rather than on each individual element. You won't want to recurse all the way to 1x1 blocks, as the bookkeeping becomes more expensive as recursion depth increases.

[algogeeks] Re: Optimisation to reduce time...

2011-07-03 Thread rajeevrvis
@sunny Ok , Thanks On Jul 3, 12:58 pm, sunny agrawal sunny816.i...@gmail.com wrote: You should have posted the problem link i think u are trying this one. http://www.codechef.com/problems/MULTQ3/  http://www.codechef.com/problems/MULTQ3/use RMQ or Binary Indexed Trees. Brute Force won't work