This is a nice paper. I think that for matrix multiplication it ends
up saying pretty much the same as we've been discussing.
The OP said serial code. Vector code isn't serial. However it's
easy to try vectorization these days. The latest versions of gcc will
do a very reasonable job with the
@Gene: if all matrices are of N x N , and my size of L1 cache is L1, then
If I need a block of A and B to fit in cache would I need it as 2 x
(blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I
need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused.
I tried
You can get an initial guess with math. The rest you'll need to
determine by experiment. The math will be
Block Dimension = sqrt( L1 / sizeof(FLOAT) / K ) .
K could be 2 or 3 depending on the loop order. A reason you can't
predict perfectly is that interrupt handlers can add to cache load in
Here is the nice link for writing fast matrix-matrix multiplication.
http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100
Apart from this we can vectorize the code and also we can do
unrolling to get very good performance.
--
With love and regards,
Sairam Ravu
I M.Tech(CS)
Sri
For big matrices, using all the caches well is very important. The
usual approach is block tiling as you say. You want two blocks to fit
nicely in the L1 cache. The most highly optimized schemes will have a
hierarchy of tiles where two tiles at the second level will fit in the
L2 cache, etc.
@all: Thanks a lot
On Wed, Feb 29, 2012 at 9:02 AM, Gene gene.ress...@gmail.com wrote:
For big matrices, using all the caches well is very important. The
usual approach is block tiling as you say. You want two blocks to fit
nicely in the L1 cache. The most highly optimized schemes will
@Arun: I'm assuming that you are optimizing for large matrices. Try
making your basic operation a 4x4 matrix multiplication, written out
(i.e., without loops), where the 4x4 matrices are subarrays of the
operand arrays. This should give you good cache utilization and data
reuse, as each element of
@Arun: The problem with Strassen usually isn't overflow, but
instability. The error bounds will be based on the largest element,
rather than on each individual element. You won't want to recurse all
the way to 1x1 blocks, as the bookkeeping becomes more expensive as
recursion depth increases.
@sunny Ok , Thanks
On Jul 3, 12:58 pm, sunny agrawal sunny816.i...@gmail.com wrote:
You should have posted the problem link
i think u are trying this one. http://www.codechef.com/problems/MULTQ3/
http://www.codechef.com/problems/MULTQ3/use RMQ or Binary Indexed Trees.
Brute Force won't work