[algogeeks] Re: optimisation

2012-03-06 Thread Gene
This is a nice paper.  I think that for matrix multiplication it ends
up saying pretty much the same as we've been discussing.

The OP said "serial" code. Vector code isn't serial.  However it's
easy to try vectorization these days. The latest versions of gcc will
do a very reasonable job with the right optimization flags.  However
be aware that if you create a binary with vector code for one machine
it can easily fail on a different one. Unlike the basic x86
instruction set, there are several levels and branches of "standards."



On Mar 6, 1:47 am, Sairam Ravu  wrote:
> Here is the nice link for writing fast matrix-matrix  
> multiplication.http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100
>
> Apart from this we can vectorize the code and also we can  do
> unrolling to get very good performance.
>
> --
> With love and regards,
> Sairam Ravu
> I M.Tech(CS)
> Sri Sathya Sai Institute of Higher Learning
> "To live life, you must think it, measure it, experiment with it,
> dance it, paint it, draw it, and calculate it"

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] Re: optimisation

2012-03-05 Thread Sairam Ravu
Here is the nice link for writing fast matrix-matrix  multiplication.
http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100

Apart from this we can vectorize the code and also we can  do
unrolling to get very good performance.




-- 
With love and regards,
Sairam Ravu
I M.Tech(CS)
Sri Sathya Sai Institute of Higher Learning
"To live life, you must think it, measure it, experiment with it,
dance it, paint it, draw it, and calculate it"

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: optimisation

2012-03-05 Thread Gene
You can get an initial guess with math.  The rest you'll need to
determine by experiment.  The math will be

Block Dimension = sqrt( L1 / sizeof(FLOAT) / K )  .

K could be 2 or 3 depending on the loop order.  A reason you can't
predict perfectly is that interrupt handlers can add to cache load in
unpredictable ways. You might find it interesting to turn off your
network and see if you can get further improvement by making blocks a
little bigger than you could with the network turned on.


On Mar 5, 2:08 pm, Arun Vishwanathan  wrote:
> @Gene: if all matrices are of N x N , and my size of L1 cache is L1, then
> If I need a block of A and B to fit in cache would I need it as 2 x
> (blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I
> need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused.
> I tried a sample and I think I got somewhat good speedup for block size 32
> ( for matrix dimension 512, 1024 etc )for my L1 of size 16 kbytes and L2
> 256 kbytes...Any comments or inferences?
>
> On Wed, Feb 29, 2012 at 9:31 AM, Arun Vishwanathan
> wrote:
>
>
>
>
>
>
>
>
>
> > @all: Thanks a lot
>
> > On Wed, Feb 29, 2012 at 9:02 AM, Gene  wrote:
>
> >> For big matrices, using all the caches well is very important.  The
> >> usual approach is block tiling as you say.  You want two blocks to fit
> >> nicely in the L1 cache.  The most highly optimized schemes will have a
> >> hierarchy of tiles where two tiles at the second level will fit in the
> >> L2 cache, etc. Additional levels can be based on memory interfaces,
> >> interleaving, page size, and even cylinder size on disk (for really
> >> huge matrices). The idea is _never_ to generate more cache misses than
> >> you need to. A miss causes a factor of 10 to 1 performance
> >> multiple on that operation.
>
> >> Multiplying within the innermost blocks should also consider prefetch:
> >> you'll get best performance touching locations in contiguous ascending
> >> order.  The inner loops are
>
> >> for i = 1 to ma
> >>  for j = 1 to nb
> >>    for k = 1 to na
> >>      r[i,j] += a[i,k] * b[k,j]
>
> >> This ignores that r[i,j] needs to be zeroed out somewhere.  But with
> >> this assumption, the loops can be juxtaposed in any order without
> >> changing the result. You should explore the 6 possible orderings for
> >> the best one.  Of course you also have to zero out the sums in the
> >> best possible manner.
>
> >> FWIW, a GPU will normally outperform a general purpose CPU with ease
> >> on this problem.  Since even cell phones are getting GPUs these days,
> >> tweaking single-processor matrix code is a dying art.
>
> >> Cheers.
>
> >> On Feb 27, 6:57 pm, Arun Vishwanathan  wrote:
> >> > Hi all,
>
> >> > We have this challenge to make the fastest executing serial matrix
> >> > multiplication code. I have tried using matrix transpose( in C for row
> >> > major ) and loop unrolling.I was able to obtain little speedup. Does
> >> anyone
> >> > have any hints/papers that I could read upon and try to speed up
> >> further?I
> >> > had tried a bit of block tiling but was not successful.
>
> >> > Thanks
> >> > Arun
>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Algorithm Geeks" group.
> >> To post to this group, send email to algogeeks@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> algogeeks+unsubscr...@googlegroups.com.
> >> For more options, visit this group at
> >>http://groups.google.com/group/algogeeks?hl=en.
>
> > --
> >  "People often say that motivation doesn't last. Well, neither does
> > bathing - that's why we recommend it daily."
>
> --
>  "People often say that motivation doesn't last. Well, neither does bathing
> - that's why we recommend it daily."

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] Re: optimisation

2012-03-05 Thread Arun Vishwanathan
@Gene: if all matrices are of N x N , and my size of L1 cache is L1, then
If I need a block of A and B to fit in cache would I need it as 2 x
(blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I
need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused.
I tried a sample and I think I got somewhat good speedup for block size 32
( for matrix dimension 512, 1024 etc )for my L1 of size 16 kbytes and L2
256 kbytes...Any comments or inferences?


On Wed, Feb 29, 2012 at 9:31 AM, Arun Vishwanathan
wrote:

> @all: Thanks a lot
>
>
> On Wed, Feb 29, 2012 at 9:02 AM, Gene  wrote:
>
>> For big matrices, using all the caches well is very important.  The
>> usual approach is block tiling as you say.  You want two blocks to fit
>> nicely in the L1 cache.  The most highly optimized schemes will have a
>> hierarchy of tiles where two tiles at the second level will fit in the
>> L2 cache, etc. Additional levels can be based on memory interfaces,
>> interleaving, page size, and even cylinder size on disk (for really
>> huge matrices). The idea is _never_ to generate more cache misses than
>> you need to. A miss causes a factor of 10 to 1 performance
>> multiple on that operation.
>>
>> Multiplying within the innermost blocks should also consider prefetch:
>> you'll get best performance touching locations in contiguous ascending
>> order.  The inner loops are
>>
>> for i = 1 to ma
>>  for j = 1 to nb
>>for k = 1 to na
>>  r[i,j] += a[i,k] * b[k,j]
>>
>> This ignores that r[i,j] needs to be zeroed out somewhere.  But with
>> this assumption, the loops can be juxtaposed in any order without
>> changing the result. You should explore the 6 possible orderings for
>> the best one.  Of course you also have to zero out the sums in the
>> best possible manner.
>>
>> FWIW, a GPU will normally outperform a general purpose CPU with ease
>> on this problem.  Since even cell phones are getting GPUs these days,
>> tweaking single-processor matrix code is a dying art.
>>
>> Cheers.
>>
>> On Feb 27, 6:57 pm, Arun Vishwanathan  wrote:
>> > Hi all,
>> >
>> > We have this challenge to make the fastest executing serial matrix
>> > multiplication code. I have tried using matrix transpose( in C for row
>> > major ) and loop unrolling.I was able to obtain little speedup. Does
>> anyone
>> > have any hints/papers that I could read upon and try to speed up
>> further?I
>> > had tried a bit of block tiling but was not successful.
>> >
>> > Thanks
>> > Arun
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Algorithm Geeks" group.
>> To post to this group, send email to algogeeks@googlegroups.com.
>> To unsubscribe from this group, send email to
>> algogeeks+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/algogeeks?hl=en.
>>
>>
>
>
> --
>  "People often say that motivation doesn't last. Well, neither does
> bathing - that's why we recommend it daily."
>
>


-- 
 "People often say that motivation doesn't last. Well, neither does bathing
- that's why we recommend it daily."

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] Re: optimisation

2012-02-29 Thread Arun Vishwanathan
@all: Thanks a lot

On Wed, Feb 29, 2012 at 9:02 AM, Gene  wrote:

> For big matrices, using all the caches well is very important.  The
> usual approach is block tiling as you say.  You want two blocks to fit
> nicely in the L1 cache.  The most highly optimized schemes will have a
> hierarchy of tiles where two tiles at the second level will fit in the
> L2 cache, etc. Additional levels can be based on memory interfaces,
> interleaving, page size, and even cylinder size on disk (for really
> huge matrices). The idea is _never_ to generate more cache misses than
> you need to. A miss causes a factor of 10 to 1 performance
> multiple on that operation.
>
> Multiplying within the innermost blocks should also consider prefetch:
> you'll get best performance touching locations in contiguous ascending
> order.  The inner loops are
>
> for i = 1 to ma
>  for j = 1 to nb
>for k = 1 to na
>  r[i,j] += a[i,k] * b[k,j]
>
> This ignores that r[i,j] needs to be zeroed out somewhere.  But with
> this assumption, the loops can be juxtaposed in any order without
> changing the result. You should explore the 6 possible orderings for
> the best one.  Of course you also have to zero out the sums in the
> best possible manner.
>
> FWIW, a GPU will normally outperform a general purpose CPU with ease
> on this problem.  Since even cell phones are getting GPUs these days,
> tweaking single-processor matrix code is a dying art.
>
> Cheers.
>
> On Feb 27, 6:57 pm, Arun Vishwanathan  wrote:
> > Hi all,
> >
> > We have this challenge to make the fastest executing serial matrix
> > multiplication code. I have tried using matrix transpose( in C for row
> > major ) and loop unrolling.I was able to obtain little speedup. Does
> anyone
> > have any hints/papers that I could read upon and try to speed up
> further?I
> > had tried a bit of block tiling but was not successful.
> >
> > Thanks
> > Arun
>
> --
> You received this message because you are subscribed to the Google Groups
> "Algorithm Geeks" group.
> To post to this group, send email to algogeeks@googlegroups.com.
> To unsubscribe from this group, send email to
> algogeeks+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/algogeeks?hl=en.
>
>


-- 
 "People often say that motivation doesn't last. Well, neither does bathing
- that's why we recommend it daily."

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: optimisation

2012-02-29 Thread Gene
For big matrices, using all the caches well is very important.  The
usual approach is block tiling as you say.  You want two blocks to fit
nicely in the L1 cache.  The most highly optimized schemes will have a
hierarchy of tiles where two tiles at the second level will fit in the
L2 cache, etc. Additional levels can be based on memory interfaces,
interleaving, page size, and even cylinder size on disk (for really
huge matrices). The idea is _never_ to generate more cache misses than
you need to. A miss causes a factor of 10 to 1 performance
multiple on that operation.

Multiplying within the innermost blocks should also consider prefetch:
you'll get best performance touching locations in contiguous ascending
order.  The inner loops are

for i = 1 to ma
  for j = 1 to nb
for k = 1 to na
  r[i,j] += a[i,k] * b[k,j]

This ignores that r[i,j] needs to be zeroed out somewhere.  But with
this assumption, the loops can be juxtaposed in any order without
changing the result. You should explore the 6 possible orderings for
the best one.  Of course you also have to zero out the sums in the
best possible manner.

FWIW, a GPU will normally outperform a general purpose CPU with ease
on this problem.  Since even cell phones are getting GPUs these days,
tweaking single-processor matrix code is a dying art.

Cheers.

On Feb 27, 6:57 pm, Arun Vishwanathan  wrote:
> Hi all,
>
> We have this challenge to make the fastest executing serial matrix
> multiplication code. I have tried using matrix transpose( in C for row
> major ) and loop unrolling.I was able to obtain little speedup. Does anyone
> have any hints/papers that I could read upon and try to speed up further?I
> had tried a bit of block tiling but was not successful.
>
> Thanks
> Arun

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: optimisation

2012-02-27 Thread Dave
@Arun: The problem with Strassen usually isn't overflow, but
instability. The error bounds will be based on the largest element,
rather than on each individual element. You won't want to recurse all
the way to 1x1 blocks, as the bookkeeping becomes more expensive as
recursion depth increases.

Dave

On Feb 27, 10:12 pm, atul anand  wrote:
> strassen multiplication , but it may cause overflow
>
> On Tue, Feb 28, 2012 at 5:27 AM, Arun Vishwanathan
> wrote:
>
>
>
> > Hi all,
>
> > We have this challenge to make the fastest executing serial matrix
> > multiplication code. I have tried using matrix transpose( in C for row
> > major ) and loop unrolling.I was able to obtain little speedup. Does anyone
> > have any hints/papers that I could read upon and try to speed up further?I
> > had tried a bit of block tiling but was not successful.
>
> > Thanks
> > Arun
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "Algorithm Geeks" group.
> > To post to this group, send email to algogeeks@googlegroups.com.
> > To unsubscribe from this group, send email to
> > algogeeks+unsubscr...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/algogeeks?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: optimisation

2012-02-27 Thread Dave
@Arun: I'm assuming that you are optimizing for large matrices. Try
making your basic operation a 4x4 matrix multiplication, written out
(i.e., without loops), where the 4x4 matrices are subarrays of the
operand arrays. This should give you good cache utilization and data
reuse, as each element of the two 4x4 operand matrices will be used 4
times. You will have to clean up around the edges if the matrix
dimensions are not multiples of 4.

Dave

On Feb 27, 5:57 pm, Arun Vishwanathan  wrote:
> Hi all,
>
> We have this challenge to make the fastest executing serial matrix
> multiplication code. I have tried using matrix transpose( in C for row
> major ) and loop unrolling.I was able to obtain little speedup. Does anyone
> have any hints/papers that I could read upon and try to speed up further?I
> had tried a bit of block tiling but was not successful.
>
> Thanks
> Arun

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: Optimisation to reduce time...

2011-07-03 Thread rajeevrvis
@sunny Ok , Thanks

On Jul 3, 12:58 pm, sunny agrawal  wrote:
> You should have posted the problem link
> i think u are trying this one. 
>
>  use RMQ or Binary Indexed Trees.
> Brute Force won't work
>
> On Sun, Jul 3, 2011 at 1:17 PM, rajeevrvis wrote:
>
>
>
>
>
>
>
>
>
> > Hi Here is the code  . I want to optimize it to run faster .
> > Can Anyone help me???
>
> > #include
> > void main()
> > {
> >  int n,q,a[10]={0},b[10],c[10],d[10],i,count,j;
> >  scanf("%d%d",&n,&q);
> >  for(i=0;i >  for(i=0;i >   if(b[i]==0)
> >   {
> >    for(j=c[i];j<=d[i];j++)
> >     a[j]=a[j]+1;
> >   }
> >    else
> >    {
> >     count =0;
> >     for(j=c[i];j<=d[i];j++)
> >      if(a[j]%3==0)
> >       count++;
> >     printf("%d\n",count);
> >     }
> >    }
>
> >  Regards
>
> > rajeevrvis
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Algorithm Geeks" group.
> > To post to this group, send email to algogeeks@googlegroups.com.
> > To unsubscribe from this group, send email to
> > algogeeks+unsubscr...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/algogeeks?hl=en.
>
> --
> Sunny Aggrawal
> B-Tech IV year,CSI
> Indian Institute Of Technology,Roorkee

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.