subject:"\[sage\-devel\] Re\: slightly OT\: new M4RI library"

[sage-devel] Re: slightly OT: new M4RI library

2008-05-20 Thread Martin Albrecht


On Tuesday 20 May 2008, Bill Hart wrote:
 Martin,

 On the M4RI website you say that M4R inversion is asymptotically fast
 with time complexity n^3/log_2(n), but Strassen's method has
 complexity n^log2(7), which I would call asymptotically fast.

 Do you just mean asymptotically faster than the classical algorithm?

Yes, I'll fix that.

 By the way, I wrote some code ages ago for computing the row echelon
 form of sparse matrices over GF2 which essentially inverts 6000x6000
 sparse matrices in about a second.

Excellent, is the code available somewhere? How dense are these matrices?

My plan for M4RI is:

- polish what we have for multiplication now, find appropriate/acceptable 
parameters for a wide range of platforms, make sure it builds everywhere etc.

- during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation. 
This is the actual application we are most interested in

- Once that is done, I want to look into sparse matrices (which are probably 
much more relevant for my work than dense matrices) so I'd be interested in 
looking into your old code. I'm not sure this should go under the roof of the 
M4RI library but if it does then this probably warrants a name change :-)

Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-20 Thread Bill Hart


My code is not downloadable anywhere that I am aware of. I used to use
it in my quadratic sieve before I switched to Jason P's block Lanczos
code (BL solves a more restricted problem but is fine for the QS). My
old code is some of the earliest mathematical C code that I wrote, so
can certainly be improved. I can certainly make it available at some
point. The crossover with block Lanczos was about 6000x6000.

The matrices of 6000x6000 had something like 12 nonzero entries in
each row, so quite sparse. Of course the matrix quickly becomes dense
as the algorithm proceeds. I was also taking advantage of the fact
that the left hand side of the matrix was more dense than the right,
which is probably not the situation you would have arising in your
research, Martin.

At any rate, it might make a reasonable base case for an
implementation of something.

Bill.

On 20 May, 09:47, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Tuesday 20 May 2008, Bill Hart wrote:

  Martin,

  On the M4RI website you say that M4R inversion is asymptotically fast
  with time complexity n^3/log_2(n), but Strassen's method has
  complexity n^log2(7), which I would call asymptotically fast.

  Do you just mean asymptotically faster than the classical algorithm?

 Yes, I'll fix that.

  By the way, I wrote some code ages ago for computing the row echelon
  form of sparse matrices over GF2 which essentially inverts 6000x6000
  sparse matrices in about a second.

 Excellent, is the code available somewhere? How dense are these matrices?

 My plan for M4RI is:

 - polish what we have for multiplication now, find appropriate/acceptable
 parameters for a wide range of platforms, make sure it builds everywhere etc.

 - during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation.
 This is the actual application we are most interested in

 - Once that is done, I want to look into sparse matrices (which are probably
 much more relevant for my work than dense matrices) so I'd be interested in
 looking into your old code. I'm not sure this should go under the roof of the
 M4RI library but if it does then this probably warrants a name change :-)

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Bill Hart


Martin,

does your machine speed up at all with any of the SSE code? I spent
considerable time yesterday trying to optimise the combine3 function I
wrote (note: I didn't submit this improved code). Whilst I did speed
it up considerably by removing %'s and /'s and changing the function
prototype to accept rows directly and lots of other minor
improvements, the overall result was still much slower than the
generic C code.

If the SSE code doesn't ever speed it up (and it might not be faster
to use SSE when there is at least one load/store per arithmetic
operation), then it might be worth throwing all the SSE code out
completely, since it just polutes what is otherwise very nice looking
code.

I did check by compiling the code to assembler without assembling it
(the -S option in gcc), that it is actually using SSE instructions and
that it is doing this in what I would consider an optimal way. So it
really is SSE itself that is slow, not just inefficiencies introduced
by the compiler.

One possible avenue for exploration is the gf2x library by Zimmerman,
Brent, et al. They use SSE in their code for polynomials over GF2,
apparently to good effect. We could look to see if there is some trick
to using it efficiently. The difference there however is probably that
the polynomials get quite long and multiple arithmetic operations need
to be done for each store/load. I don't know if any of their tricks
can help us for that reason.

Bill.

On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Monday 19 May 2008, Bill Hart wrote:



  Well, apparently there are speed gains right up to 8 Gray tables,
  though I really have no idea why that is.

  Here are the new times:

  Magma Old New

  1x1:
     2.940s 3.13s 2.32s

  16384x16384:
     9.250s 12.96s 9.17s

  2x2:
     16.57s 22.43s 16.49s

  32000x32000:
     59.1s 90.2s 81.6s 65.7s

  So now we beat Magma all the way up to 2x2.

  I'll put the adjusted code in the directory:

 http://sage.math.washington.edu/home/wbhart/m4ri3gray/

  in just a moment. I also found a memory leak in my code which I fixed.

  Bill.

 Cool, I'll take a look tomorrow. It seems we're closing in on the classical
 algorithm though. Although, you don't seem to decrease k anymore.

 As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
 many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
 that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
 on cache lines. I'm widely speculating here.

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Bill Hart


I had a look at gf2x. The reason they can make use of sse is because
of the shl and shr capabilities. Doing that 128 bits at a time is more
efficient since one operation can be done instead of quite a few using
general purpose regs.

Bill.

On 19 May, 13:52, Bill Hart [EMAIL PROTECTED] wrote:
 Martin,

 does your machine speed up at all with any of the SSE code? I spent
 considerable time yesterday trying to optimise the combine3 function I
 wrote (note: I didn't submit this improved code). Whilst I did speed
 it up considerably by removing %'s and /'s and changing the function
 prototype to accept rows directly and lots of other minor
 improvements, the overall result was still much slower than the
 generic C code.

 If the SSE code doesn't ever speed it up (and it might not be faster
 to use SSE when there is at least one load/store per arithmetic
 operation), then it might be worth throwing all the SSE code out
 completely, since it just polutes what is otherwise very nice looking
 code.

 I did check by compiling the code to assembler without assembling it
 (the -S option in gcc), that it is actually using SSE instructions and
 that it is doing this in what I would consider an optimal way. So it
 really is SSE itself that is slow, not just inefficiencies introduced
 by the compiler.

 One possible avenue for exploration is the gf2x library by Zimmerman,
 Brent, et al. They use SSE in their code for polynomials over GF2,
 apparently to good effect. We could look to see if there is some trick
 to using it efficiently. The difference there however is probably that
 the polynomials get quite long and multiple arithmetic operations need
 to be done for each store/load. I don't know if any of their tricks
 can help us for that reason.

 Bill.

 On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  On Monday 19 May 2008, Bill Hart wrote:

   Well, apparently there are speed gains right up to 8 Gray tables,
   though I really have no idea why that is.

   Here are the new times:

   Magma Old New

   1x1:
      2.940s 3.13s 2.32s

   16384x16384:
      9.250s 12.96s 9.17s

   2x2:
      16.57s 22.43s 16.49s

   32000x32000:
      59.1s 90.2s 81.6s 65.7s

   So now we beat Magma all the way up to 2x2.

   I'll put the adjusted code in the directory:

  http://sage.math.washington.edu/home/wbhart/m4ri3gray/

   in just a moment. I also found a memory leak in my code which I fixed.

   Bill.

  Cool, I'll take a look tomorrow. It seems we're closing in on the classical
  algorithm though. Although, you don't seem to decrease k anymore.

  As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
  many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
  that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
  on cache lines. I'm widely speculating here.

  Martin

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Martin Albrecht


Bill,

I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is 
worth the trouble (I agree that it make the otherwise pretty looking code 
ugly). 

I have some timings (for an old implementation) here: 

   http://trac.sagemath.org/sage_trac/ticket/3204#comment:2

My guess is that SSE2 is slower on the Opteron because SSE2 is basically an 
Intel thing and only provided by AMD for compatibility reasons. There are 
several reports of SSE2 being slow on the Opteron and I guess the SSE2 
integer operations were not focused for speed since MMX/SSE is all about 
floating point mainly.

One thing I noticed on the Opteron is that if I put the code in mzd_combine 
vs. putting the same code directly in the function made huge difference. I 
blamed it on better cache prefetching support but that was probably 
preliminary.

My proposal: 
 - This evening I'll update my code with your 8 Gray tables and check the 
performance on the C2D
 - Then I'll re-introduce SSE2 and check whether it makes a worthy difference, 
if not we drop SSE2 from the multiplication.

Martin

PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel 
implementation of Strassen-Winograd yesterday and it gives - as is - a 
speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma 
feels like cheating, first we should aim for better speed with the same 
resources and then we switch to parallel implementations for even better 
times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix 
multiplication in 1.7 seconds on my notebook one week ago :-)

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Bill Hart


You seemed to be getting up to 8% at points there. That's definitely
worth it. I'll be interested to see this evening how it comes out,
though I recommend optimising my combine3 function (which I suppose
should now be combine8), even including it inline rather than have it
in a separate file.

Of course on the Opteron, SSE should be switched off, since it is
definitely slower by about 5%-10% even with careful optimisation.

Bill.

On 19 May, 14:23, Martin Albrecht [EMAIL PROTECTED]
wrote:
 Bill,

 I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is
 worth the trouble (I agree that it make the otherwise pretty looking code
 ugly).

 I have some timings (for an old implementation) here:

    http://trac.sagemath.org/sage_trac/ticket/3204#comment:2

 My guess is that SSE2 is slower on the Opteron because SSE2 is basically an
 Intel thing and only provided by AMD for compatibility reasons. There are
 several reports of SSE2 being slow on the Opteron and I guess the SSE2
 integer operations were not focused for speed since MMX/SSE is all about
 floating point mainly.

 One thing I noticed on the Opteron is that if I put the code in mzd_combine
 vs. putting the same code directly in the function made huge difference. I
 blamed it on better cache prefetching support but that was probably
 preliminary.

 My proposal:
  - This evening I'll update my code with your 8 Gray tables and check the
 performance on the C2D
  - Then I'll re-introduce SSE2 and check whether it makes a worthy difference,
 if not we drop SSE2 from the multiplication.

 Martin

 PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel
 implementation of Strassen-Winograd yesterday and it gives - as is - a
 speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma
 feels like cheating, first we should aim for better speed with the same
 resources and then we switch to parallel implementations for even better
 times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix
 multiplication in 1.7 seconds on my notebook one week ago :-)

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Martin Albrecht


On Monday 19 May 2008, Bill Hart wrote:
 You seemed to be getting up to 8% at points there. That's definitely
 worth it. I'll be interested to see this evening how it comes out,
 though I recommend optimising my combine3 function (which I suppose
 should now be combine8), even including it inline rather than have it
 in a separate file.

 Of course on the Opteron, SSE should be switched off, since it is
 definitely slower by about 5%-10% even with careful optimisation.

 Bill.

Okay, so a good compromise is to remove all SSE2 stuff from the main function 
_mzd_mul_m4rm_impl and put it in static inline _mzd_combine8 function which 
is specifically tailored towards this particular application. Thus the code 
still looks relatively pretty/elegant but we can have SSE2 support. 

Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Martin Albrecht


On Monday 19 May 2008, Bill Hart wrote:
 You seemed to be getting up to 8% at points there. That's definitely
 worth it. I'll be interested to see this evening how it comes out,
 though I recommend optimising my combine3 function (which I suppose
 should now be combine8), even including it inline rather than have it
 in a separate file.

Bill,

some progress report for the C2D:

I incorporated your changes with the following modifications:
- if (x1==0  x2==0  x3==0  x8==8) ... I removed this one since it seems 
rather unlikely that this is true
- I added #define called GRAY8 which defines whether 8 or 4 tables ought to be 
used. I figured this might be handy for machines with a smaller L1.
- I added the correct a_nc%k values back in
- We don't need to make sure that the allocated buffers are correctly aligned, 
since we try allocate with _mm_malloc. If that is not available we should 
probably just use posix_memalign.

The speed-up on the C2D (similar to the Opteron) is considerable (the last 
column is a parallel toy implementation):

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*)
Matrix DimensionMagma   M4RIM4RI (OpenMP), walltime
10,000 x 10,000 2.920   2.270   1.470
16,384 x 16,384 11.140  9.130   5.540
20,000 x 20,000 20.370  16.110  11.800
32,000 x 32,000 75.510  64.340  40.040

Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x 
2*10^4 in this configuration:

sage: A = random_matrix(GF(2),10^3, 10^3);
sage: B = random_matrix(GF(2),10^3, 10^3);
sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B) 
True
sage: A = random_matrix(GF(2),2*10^4, 2*10^4);
sage: B = random_matrix(GF(2),2*10^4, 2*10^4);
sage: time A._multiply_m4rm(B)
CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s
Wall time: 18.65

64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11
Matrix DimensionMagma 2.13-5 (64-bit)   M4RI-20080518 (64-bit)
10,000 x 10,000 4.190   4.290
16,384 x 16,384 15.360  15.230
20,000 x 20,000 29.530  28.640
32,000 x 32,000 103.970 114.620

That does seem to roughly match what you reported. I'll now look into SSE2 
support.

Cheers,
Martin

(*) Note: Magma is 64-bit but not optimised for C2D.

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-19 Thread Bill Hart


Martin,

On the M4RI website you say that M4R inversion is asymptotically fast
with time complexity n^3/log_2(n), but Strassen's method has
complexity n^log2(7), which I would call asymptotically fast.

Do you just mean asymptotically faster than the classical algorithm?
By the way, I wrote some code ages ago for computing the row echelon
form of sparse matrices over GF2 which essentially inverts 6000x6000
sparse matrices in about a second.

Bill.

On 19 May, 19:58, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Monday 19 May 2008, Bill Hart wrote:

  You seemed to be getting up to 8% at points there. That's definitely
  worth it. I'll be interested to see this evening how it comes out,
  though I recommend optimising my combine3 function (which I suppose
  should now be combine8), even including it inline rather than have it
  in a separate file.

 Bill,

 some progress report for the C2D:

 I incorporated your changes with the following modifications:
 - if (x1==0  x2==0  x3==0  x8==8) ... I removed this one since it seems
 rather unlikely that this is true
 - I added #define called GRAY8 which defines whether 8 or 4 tables ought to be
 used. I figured this might be handy for machines with a smaller L1.
 - I added the correct a_nc%k values back in
 - We don't need to make sure that the allocated buffers are correctly aligned,
 since we try allocate with _mm_malloc. If that is not available we should
 probably just use posix_memalign.

 The speed-up on the C2D (similar to the Opteron) is considerable (the last
 column is a parallel toy implementation):

 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*)
 Matrix Dimension        Magma   M4RI            M4RI (OpenMP), walltime
 10,000 x 10,000         2.920           2.270           1.470
 16,384 x 16,384         11.140          9.130           5.540
 20,000 x 20,000         20.370          16.110          11.800
 32,000 x 32,000         75.510          64.340          40.040

 Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x
 2*10^4 in this configuration:

 sage: A = random_matrix(GF(2),10^3, 10^3);
 sage: B = random_matrix(GF(2),10^3, 10^3);
 sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B)
 True
 sage: A = random_matrix(GF(2),2*10^4, 2*10^4);
 sage: B = random_matrix(GF(2),2*10^4, 2*10^4);
 sage: time A._multiply_m4rm(B)
 CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s
 Wall time: 18.65

 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11
 Matrix Dimension        Magma 2.13-5 (64-bit)           M4RI-20080518 (64-bit)
 10,000 x 10,000         4.190                           4.290
 16,384 x 16,384         15.360                          15.230
 20,000 x 20,000         29.530                          28.640
 32,000 x 32,000         103.970                 114.620

 That does seem to roughly match what you reported. I'll now look into SSE2
 support.

 Cheers,
 Martin

 (*) Note: Magma is 64-bit but not optimised for C2D.

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


I sorted out what was wrong. In my combine code I was combining
operands in the wrong order. This meant that a whole lot of zeroes
ended up where they shouldn't be, hence the if (x) thing working. So
my times were unfortunately completely screwed up. I went back to a
fresh tarball with the original code and applied my changes one by
one. Only two changes actually improved the time.

First I give Magma times vs your (Martin's) code:

1x1: 2.940s 3.13s

16384x16384: 9.250s 12.96s

2x2: 16.57s 22.43s

32000x32000: 59.1s 90.2s

Now I give times for my code with a single Gray table. The legend
below indicates what the three times mean.

1x1: 7.76s 6.70s 4.40s

16384x16384: 44.6s 37.3s 18.3s

2x2: 53.3s 45.9s 31.2s

32000x32000:  ---   194s  134s

0) cutoff = 2048, original code no SSE
1) cutoff = 3600, CACHE BLOCK A (256, 768)
2) cutoff = 7200, fix k = 6

So it is clear that two Gray tables is much better than one. The only
thing that is puzzling me now is how you get away with k = 7 with two
Gray tables, whereas I was using k = 6 with one table. Also, you stick
with BLOCK = 768, whereas I found it optimal to switch to 256 for
16384x16384. However, if I make any additional changes to your code it
just slows down. It must have to do with this copying out that you
did. That must significantly affect cache performance.

Anyhow, I'm now going to wipe all versions I have (except my working
one with an optimal single Gray table implementation) and just start
from your code as a base for further improvements. We are still up to
50% slower than Magma on the Opteron!!

Bill.

On 18 May, 02:21, Bill Hart [EMAIL PROTECTED] wrote:
 I managed to get the modified version from the spkg. Nice code!!

 Unfortunately it is not as fast on my opteron. So more work tomorrow
 for me to try and get it down to the same times as I have with my
 version.

 Here are the times all on my opteron. Note your CTD version was
 optimal at a cutoff of 2048, not 7200 as for my code. Now I am worried
 that maybe my code is actually broken somehow and still passing the
 test code. I'll carefully make the changes to your code tomorrow to
 see if that is the case.

 Magma CTD-M4RI:2048 AMD-M4RI:7200 AMD-M4RI:2048

 1x1: 2.940s 3.13s 3.442s 4.132s

 16384x16384: 9.250s 12.96s 11.47s 11.80s

 2x2: 16.57s 22.43s 19.3s 26.05s

 32000x32000: 59.05s 90.20s 71.9s 71.8s

 Bill.

 On 18 May, 01:58, Bill Hart [EMAIL PROTECTED] wrote:

  On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED]
  wrote:

   My version is here:

      
   http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

   (this needs an updated patch for Sage)

   and here:

      http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

   (which is the raw source).

  This pure C version seems to be the old version, before you made
  either of the two big changes.

  Bill.
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Martin Albrecht

Hi,

first, I recorded the different speed-ups in a small table for an overview in 
the attachment (I think we've come a far way :-)) To disable the copying out 
one needs to edit 

/* we copy the matrix first since it is only constant memory
   overhead and improves data locality, if you remove it make sure
   there are no speed regressions */
/* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
packedmatrix *Cbar = mzd_init(C-nrows, C-ncols);
Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
mzd_copy(C, Cbar);
mzd_free(Cbar);
return C;

in strassen.c to

/* we copy the matrix first since it is only constant memory
   overhead and improves data locality, if you remove it make sure
   there are no speed regressions */
C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
return C;

This disables the copying out.

Martin

PS: If I find some time later today I'll make some changes such that SSE2 can 
be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is 
used.

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

Title: Sheet1
 





Sheet1


  
  
  
  
  Magma2.14-10
  
  
  M4RI-Orig
  
  
  CacheFriendlyM4RM
  
  
  TwoTables
  
  
  CacheFriendly,k=6,cutoff=7200
  


  
  1x1
  
  
  2.94
  
  
  8
  
  
  4.529
  
  
  3.13
  
  
  4.4
  


  
  16000x16000
  
  
  9.25
  
  
  43
  
  
  20
  
  
  12.96
  
  
  18.3
  


  
  2x2
  
  
  16.57
  
  
  59
  
  
  32.34
  
  
  22.43
  
  
  31.2
  


  
  32000x32000
  
  
  59.1
  
  
  --
  
  
  --
  
  
  90.2
  
  
  134
  



Top

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


If two Gray tables is better than one, then you can't have enough of a
good thing right? So I made 3 Gray tables now.

The files I modified are in:

http://sage.math.washington.edu/home/wbhart/m4ri3gray/

There are three main modifications:

1) Make all matrices SSE aligned if we have SSE2, and make all rows of
all matrices aligned. This required a fix in brilliantrussian.c, since
it makes some assumptions about how the data is stored in matrices.

2) Introduce function combine2 and combine3 which use SSE to do a[i]
^= b[i] ^ c[i] and  a[i] ^= b[i] ^ c[i] ^ d[i].

3) Code to use three Gray tables. I had to remove the a_nc%k's since
these were causing segfaults for reasons I don't understand.

Here are the Magma times, your old times and the new times on my
Opteron:

1x1: 2.940s 3.13s 2.82s

16384x16384: 9.250s 12.96s 11.25s

2x2: 16.57s 22.43s 19.39s

32000x32000: 59.1s 90.2s 81.56s

Note I do not use the new combine2 and combine3 functions as they slow
it down on my machine. I cannot believe how useless SSE seems to be!

I've commented the three lines out that use combine3 in
brilliantrussian.c in case you want to try it with SSE on your CTD.

Bill.


On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED]
wrote:
 Hi,

 first, I recorded the different speed-ups in a small table for an overview in
 the attachment (I think we've come a far way :-)) To disable the copying out
 one needs to edit

     /* we copy the matrix first since it is only constant memory
        overhead and improves data locality, if you remove it make sure
        there are no speed regressions */
     /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
     packedmatrix *Cbar = mzd_init(C-nrows, C-ncols);
     Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
     mzd_copy(C, Cbar);
     mzd_free(Cbar);
     return C;

 in strassen.c to

     /* we copy the matrix first since it is only constant memory
        overhead and improves data locality, if you remove it make sure
        there are no speed regressions */
     C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
     return C;

 This disables the copying out.

 Martin

 PS: If I find some time later today I'll make some changes such that SSE2 can
 be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
 used.

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]

  timings.html
 2KDownload
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


The copying out makes 50% difference (its better with copying) to the
speed of 16384x16384 but no difference to 1x1 or 2x2.

That's wierd.

Bill.

On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED]
wrote:
 Hi,

 first, I recorded the different speed-ups in a small table for an overview in
 the attachment (I think we've come a far way :-)) To disable the copying out
 one needs to edit

     /* we copy the matrix first since it is only constant memory
        overhead and improves data locality, if you remove it make sure
        there are no speed regressions */
     /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
     packedmatrix *Cbar = mzd_init(C-nrows, C-ncols);
     Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
     mzd_copy(C, Cbar);
     mzd_free(Cbar);
     return C;

 in strassen.c to

     /* we copy the matrix first since it is only constant memory
        overhead and improves data locality, if you remove it make sure
        there are no speed regressions */
     C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
     return C;

 This disables the copying out.

 Martin

 PS: If I find some time later today I'll make some changes such that SSE2 can
 be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
 used.

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]

  timings.html
 2KDownload
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


I tried copying out the input matrices to the M4RM routine, but only
when the rows aren't all contiguous in memory. This didn't speed
anything up. Of course the reason for that is the second matrix is
only read a handful of times, to construct the Gray tables which are
then used extensively. The first matrix is read out of order anyway,
by M4RM, so there's no point making all its rows contiguous.

It's hard to see how to get that last bit we need to beat Magma.

What I don't quite understand now is the fact that we are beating
Magma all the way up to 1x1, which is surely past their
crossover to Strassen. But we start losing for large matrices when we
use Strassen. I checked that the addition is not slower than Magma
(its not, it's up to 5 times faster).

The only trick I have left to try is to use twice the number of Gray
tables, but make them half the width, but that seems like cheating,
since we should already have a fast enough base case by now!

Bill.

On 18 May, 20:09, Bill Hart [EMAIL PROTECTED] wrote:
 The copying out makes 50% difference (its better with copying) to the
 speed of 16384x16384 but no difference to 1x1 or 2x2.

 That's wierd.

 Bill.

 On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  Hi,

  first, I recorded the different speed-ups in a small table for an overview 
  in
  the attachment (I think we've come a far way :-)) To disable the copying out
  one needs to edit

      /* we copy the matrix first since it is only constant memory
         overhead and improves data locality, if you remove it make sure
         there are no speed regressions */
      /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
      packedmatrix *Cbar = mzd_init(C-nrows, C-ncols);
      Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
      mzd_copy(C, Cbar);
      mzd_free(Cbar);
      return C;

  in strassen.c to

      /* we copy the matrix first since it is only constant memory
         overhead and improves data locality, if you remove it make sure
         there are no speed regressions */
      C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
      return C;

  This disables the copying out.

  Martin

  PS: If I find some time later today I'll make some changes such that SSE2 
  can
  be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
  used.

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]

   timings.html
  2KDownload
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


Well, apparently there are speed gains right up to 8 Gray tables,
though I really have no idea why that is.

Here are the new times:

Magma Old New

1x1:
   2.940s 3.13s 2.32s

16384x16384:
   9.250s 12.96s 9.17s

2x2:
   16.57s 22.43s 16.49s

32000x32000:
   59.1s 90.2s 81.6s 65.7s

So now we beat Magma all the way up to 2x2.

I'll put the adjusted code in the directory:

http://sage.math.washington.edu/home/wbhart/m4ri3gray/

in just a moment. I also found a memory leak in my code which I fixed.

Bill.
On 18 May, 19:19, Bill Hart [EMAIL PROTECTED] wrote:
 If two Gray tables is better than one, then you can't have enough of a
 good thing right? So I made 3 Gray tables now.

 The files I modified are in:

 http://sage.math.washington.edu/home/wbhart/m4ri3gray/

 There are three main modifications:

 1) Make all matrices SSE aligned if we have SSE2, and make all rows of
 all matrices aligned. This required a fix in brilliantrussian.c, since
 it makes some assumptions about how the data is stored in matrices.

 2) Introduce function combine2 and combine3 which use SSE to do a[i]
 ^= b[i] ^ c[i] and  a[i] ^= b[i] ^ c[i] ^ d[i].

 3) Code to use three Gray tables. I had to remove the a_nc%k's since
 these were causing segfaults for reasons I don't understand.

 Here are the Magma times, your old times and the new times on my
 Opteron:

 1x1: 2.940s 3.13s 2.82s

 16384x16384: 9.250s 12.96s 11.25s

 2x2: 16.57s 22.43s 19.39s

 32000x32000: 59.1s 90.2s 81.56s

 Note I do not use the new combine2 and combine3 functions as they slow
 it down on my machine. I cannot believe how useless SSE seems to be!

 I've commented the three lines out that use combine3 in
 brilliantrussian.c in case you want to try it with SSE on your CTD.

 Bill.

 On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  Hi,

  first, I recorded the different speed-ups in a small table for an overview 
  in
  the attachment (I think we've come a far way :-)) To disable the copying out
  one needs to edit

      /* we copy the matrix first since it is only constant memory
         overhead and improves data locality, if you remove it make sure
         there are no speed regressions */
      /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
      packedmatrix *Cbar = mzd_init(C-nrows, C-ncols);
      Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
      mzd_copy(C, Cbar);
      mzd_free(Cbar);
      return C;

  in strassen.c to

      /* we copy the matrix first since it is only constant memory
         overhead and improves data locality, if you remove it make sure
         there are no speed regressions */
      C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
      return C;

  This disables the copying out.

  Martin

  PS: If I find some time later today I'll make some changes such that SSE2 
  can
  be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
  used.

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]

   timings.html
  2KDownload
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Martin Albrecht


On Monday 19 May 2008, Bill Hart wrote:
 Well, apparently there are speed gains right up to 8 Gray tables,
 though I really have no idea why that is.

 Here are the new times:

 Magma Old New

 1x1:
2.940s 3.13s 2.32s

 16384x16384:
9.250s 12.96s 9.17s

 2x2:
16.57s 22.43s 16.49s

 32000x32000:
59.1s 90.2s 81.6s 65.7s

 So now we beat Magma all the way up to 2x2.

 I'll put the adjusted code in the directory:

 http://sage.math.washington.edu/home/wbhart/m4ri3gray/

 in just a moment. I also found a memory leak in my code which I fixed.

 Bill.


Cool, I'll take a look tomorrow. It seems we're closing in on the classical 
algorithm though. Although, you don't seem to decrease k anymore.

As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too 
many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if 
that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned 
on cache lines. I'm widely speculating here.

Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-18 Thread Bill Hart


I figured out why 8 tables is optimal on this machine. The L1 cache is
128kb and at 2048x2048, 8 Gray tables with k = 5 fits exactly in half
the cache, which is probably optimal.

Bill.

On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Monday 19 May 2008, Bill Hart wrote:



  Well, apparently there are speed gains right up to 8 Gray tables,
  though I really have no idea why that is.

  Here are the new times:

  Magma Old New

  1x1:
     2.940s 3.13s 2.32s

  16384x16384:
     9.250s 12.96s 9.17s

  2x2:
     16.57s 22.43s 16.49s

  32000x32000:
     59.1s 90.2s 81.6s 65.7s

  So now we beat Magma all the way up to 2x2.

  I'll put the adjusted code in the directory:

 http://sage.math.washington.edu/home/wbhart/m4ri3gray/

  in just a moment. I also found a memory leak in my code which I fixed.

  Bill.

 Cool, I'll take a look tomorrow. It seems we're closing in on the classical
 algorithm though. Although, you don't seem to decrease k anymore.

 As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
 many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
 that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
 on cache lines. I'm widely speculating here.

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


Heres another idea which should speed things up a bit.

For 1x1 we currently use k = 6. Instead of this, we could use
k = 5 and make two Gray tables simultaneously. This will still fit in
cache.

Instead of doing 6 bits at a time, we can then do 10 bits at a time.
We'd load the appropriate line from the first Gray table, then the
appropriate one from the second and xor them, then xor with the output
matrix. This should decrease the number of loads and stores
considerably. Moreover, the SSE instructions will then be much more
efficient as the ratio of arithmetic instructions to loads and stores
is higher.

Of course one could also do 16 bits at a time, by doing 4 tables, but
I think this might actually get slower again since you've only
increased the amount of work done by 60%, but you've had a 30 %
increase in instructions.

Bill.

On 17 May, 17:45, Bill Hart [EMAIL PROTECTED] wrote:
 Martin,

 The test code still passes if you change RADIX to 128. I've no idea
 how it passes, but it does. Shame the results are not correct, because
 this speeds the code up by a factor of 2.

 I notice that in the SSE code, you check to see if alignment can be
 achieved, otherwise it doesn't use SSE. But this introduces an
 unpredictable branch. Also, where ther are three operands, you can't
 use SSE2 because the likelihood of all three being aligned is too
 small.

 I think a better idea would be to explicitly force all matrices and
 all rows to be 128 bit aligned if the matrices are wide enough to
 benefit from SSE2, Then the combine function can always use SSE2 and
 there will be no need to check for alignment.

 I experimented with interleaving MMX and GPR XOR's, but this doesn't
 speed anything up. There are more instructions emitted and the time
 stays about the same. The only way interleaving the MMX and GPR code
 would speed things up is if there was more computation going on in the
 registers and less memory loading and storing, I think.

 Bill.

 On 17 May, 15:45, Bill Hart [EMAIL PROTECTED] wrote:

  Hi Martin,

  Here is another 10% improvement. In the loop at the bottom of
  mzd_combine you can explicitly unroll by a factor of 8:

      word * end = b1_ptr + wide;
      register word * end8 = end - 8;
      while (b1_ptr  end8)
      {
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
      }
      while (b1_ptr  end)
      {
           *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
      }

  I did this in combination with changing the crossover for 1x1
  from 3600 to 7200.

  Bill.

  On 17 May, 09:40, Martin Albrecht [EMAIL PROTECTED]
  wrote:

   On Saturday 17 May 2008, Bill Hart wrote:

In going from 5000x5000 to 1x1 Magma's time increases by a
factor of less than 4. That is impossible. Strassen will never help us
there. They must be doing something else. Probably something clever.

Bill.

    I was stuck there too yesterday. Maybe only at 1x1 the pipeline 
   gets
   fully utilised?

   Martin

   PS: If we run out of idea we can simply go for parallelism, that should 
   help
   on sage.math ;-)

   --
   name: Martin Albrecht
   _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
   _www:http://www.informatik.uni-bremen.de/~malb
   _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Martin Albrecht


On Saturday 17 May 2008, Bill Hart wrote:
 Martin,

 The test code still passes if you change RADIX to 128. I've no idea
 how it passes, but it does. Shame the results are not correct, because
 this speeds the code up by a factor of 2.

Since all routines use the RADIX and I only check if their results match they 
are all wrong in the same way but it isn't detected. I should add a test with 
known answers I suppose.

 I notice that in the SSE code, you check to see if alignment can be
 achieved, otherwise it doesn't use SSE. But this introduces an
 unpredictable branch. Also, where ther are three operands, you can't
 use SSE2 because the likelihood of all three being aligned is too
 small.

 I think a better idea would be to explicitly force all matrices and
 all rows to be 128 bit aligned if the matrices are wide enough to
 benefit from SSE2, Then the combine function can always use SSE2 and
 there will be no need to check for alignment.

I'll try that.

 I experimented with interleaving MMX and GPR XOR's, but this doesn't
 speed anything up. There are more instructions emitted and the time
 stays about the same. The only way interleaving the MMX and GPR code
 would speed things up is if there was more computation going on in the
 registers and less memory loading and storing, I think.

I came to the same conclusion (but my code might not have been as good as 
your's). I improved other areas of the code (e.g. use naiv multiplication 
rather than M4RM if B-ncols  RADIX since it is faster etc.) I can forward 
you my newest tarball (but the speed improvements aren't really noticable 
yet).

Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Martin Albrecht


On Saturday 17 May 2008, Martin Albrecht wrote:
  I think a better idea would be to explicitly force all matrices and
  all rows to be 128 bit aligned if the matrices are wide enough to
  benefit from SSE2, Then the combine function can always use SSE2 and
  there will be no need to check for alignment.

 That doesn't seem to make a noticeable difference for me (on C2D). However,
 I realised that the multiplications where the target matrix is a real
 matrix rather than a window (which has bad data locality). Copying
 everything over seems not like a good idea but it at least indicates an
 area for improvements.

Okay, if I only copy when we crossover to M4RM then the memory overhead is 
constant (~ cutoff^2) and the performance still improves.

Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix DimensionMagma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000 2.920   3.610
16,384 x 16,384 11.140  12.120
20,000 x 20,000 20.370  24.390
32,000 x 32,000 74.290  94.910

New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix DimensionMagma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000 2.920   2.990
16,384 x 16,384 11.140  11.750
20,000 x 20,000 20.370  21.180
32,000 x 32,000 74.290  86.570

On Opteron things don't look this way, but I think sage.math is pretty heavily 
used right now such that my benchmarks there are not very telling.

Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


That's looking good. Would you like me to run it on an unburdened
opteron to see how it goes? If you like you can send me a tarball and
I'll try it out.

I think our best bet for a significant improvement now is the idea of
using two Gray tables of half the size simultaneously. I also realised
it possibly improves the cache performance for the A matrix too.

I was casually wondering whether Magma might use a highly optimised
Winograd's algorithm instead of the naive algorithm. But over GF2 I
think it probably actually takes longer, since it basically replaces
n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
half row additions, plus a pile of other overhead.

Bill.

On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Saturday 17 May 2008, Martin Albrecht wrote:

   I think a better idea would be to explicitly force all matrices and
   all rows to be 128 bit aligned if the matrices are wide enough to
   benefit from SSE2, Then the combine function can always use SSE2 and
   there will be no need to check for alignment.

  That doesn't seem to make a noticeable difference for me (on C2D). However,
  I realised that the multiplications where the target matrix is a real
  matrix rather than a window (which has bad data locality). Copying
  everything over seems not like a good idea but it at least indicates an
  area for improvements.

 Okay, if I only copy when we crossover to M4RM then the memory overhead is
 constant (~ cutoff^2) and the performance still improves.

 Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
 Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
 10,000 x 10,000         2.920                           3.610
 16,384 x 16,384         11.140                          12.120
 20,000 x 20,000         20.370                          24.390
 32,000 x 32,000 74.290                          94.910

 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
 Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
 10,000 x 10,000         2.920                           2.990
 16,384 x 16,384         11.140                          11.750
 20,000 x 20,000         20.370                          21.180
 32,000 x 32,000 74.290                          86.570

 On Opteron things don't look this way, but I think sage.math is pretty heavily
 used right now such that my benchmarks there are not very telling.

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


Yet another idea.

Suppose we do not combine entire rows in the Gray table, but only half
rows. Once half a row is bigger than a single cache line (512 bits on
the Opteron) we may as well work with half rows. This allows us to
work with twice as many rows at once in the Gray tables (each of half
the size). This means that we are dealing with twice as many bits from
rows of A as usual and twice as many rows of B as usual, but we need
to do it all again for the second half of the rows. This means we get
twice the work done in the same amount of cache space.

Combined with the idea of using two Gray tables of 2^5 combinations of
rows instead of a single table of 2^6 combinations of rows, this would
equate to dealing with 20 bits of each row of A at a time and 20 rows
of B at a time.

With this scheme, there would then be 4 arithmetic operations in SSE
registers, 5 loads and 1 store, when combining rows from Gray tables,
instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.

This is another example where copying the data (the half rows) out and
reordering it so it has better locality, would probably make a big
difference. That sort of thing always works exceptionally well on AMD
chips.

Bill.

On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote:
 That's looking good. Would you like me to run it on an unburdened
 opteron to see how it goes? If you like you can send me a tarball and
 I'll try it out.

 I think our best bet for a significant improvement now is the idea of
 using two Gray tables of half the size simultaneously. I also realised
 it possibly improves the cache performance for the A matrix too.

 I was casually wondering whether Magma might use a highly optimised
 Winograd's algorithm instead of the naive algorithm. But over GF2 I
 think it probably actually takes longer, since it basically replaces
 n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
 half row additions, plus a pile of other overhead.

 Bill.

 On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  On Saturday 17 May 2008, Martin Albrecht wrote:

I think a better idea would be to explicitly force all matrices and
all rows to be 128 bit aligned if the matrices are wide enough to
benefit from SSE2, Then the combine function can always use SSE2 and
there will be no need to check for alignment.

   That doesn't seem to make a noticeable difference for me (on C2D). 
   However,
   I realised that the multiplications where the target matrix is a real
   matrix rather than a window (which has bad data locality). Copying
   everything over seems not like a good idea but it at least indicates an
   area for improvements.

  Okay, if I only copy when we crossover to M4RM then the memory overhead is
  constant (~ cutoff^2) and the performance still improves.

  Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
  Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
  10,000 x 10,000         2.920                           3.610
  16,384 x 16,384         11.140                          12.120
  20,000 x 20,000         20.370                          24.390
  32,000 x 32,000 74.290                          94.910

  New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
  Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
  10,000 x 10,000         2.920                           2.990
  16,384 x 16,384         11.140                          11.750
  20,000 x 20,000         20.370                          21.180
  32,000 x 32,000 74.290                          86.570

  On Opteron things don't look this way, but I think sage.math is pretty 
  heavily
  used right now such that my benchmarks there are not very telling.

  Martin

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


Martin,

Here's a really unusual thing. Perhaps you can confirm this. I get a
20% improvement if I add:

if (x)
{
}

in the three obvious places in the function _mzd_mul_m4rm_impl. This
stops it mpz_combining the zero row.

But I don't understand why this works. The time should be only 1.5%
better since k = 6 and there are 2^k rows in the table, only one of
which is zero.

Could it be that your coinflip function is not quite random?

Anyway, I'm down to 3.40s for 1x1 with this change. Test
functions still pass.

Bill.


On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote:
 Yet another idea.

 Suppose we do not combine entire rows in the Gray table, but only half
 rows. Once half a row is bigger than a single cache line (512 bits on
 the Opteron) we may as well work with half rows. This allows us to
 work with twice as many rows at once in the Gray tables (each of half
 the size). This means that we are dealing with twice as many bits from
 rows of A as usual and twice as many rows of B as usual, but we need
 to do it all again for the second half of the rows. This means we get
 twice the work done in the same amount of cache space.

 Combined with the idea of using two Gray tables of 2^5 combinations of
 rows instead of a single table of 2^6 combinations of rows, this would
 equate to dealing with 20 bits of each row of A at a time and 20 rows
 of B at a time.

 With this scheme, there would then be 4 arithmetic operations in SSE
 registers, 5 loads and 1 store, when combining rows from Gray tables,
 instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
 changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.

 This is another example where copying the data (the half rows) out and
 reordering it so it has better locality, would probably make a big
 difference. That sort of thing always works exceptionally well on AMD
 chips.

 Bill.

 On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote:

  That's looking good. Would you like me to run it on an unburdened
  opteron to see how it goes? If you like you can send me a tarball and
  I'll try it out.

  I think our best bet for a significant improvement now is the idea of
  using two Gray tables of half the size simultaneously. I also realised
  it possibly improves the cache performance for the A matrix too.

  I was casually wondering whether Magma might use a highly optimised
  Winograd's algorithm instead of the naive algorithm. But over GF2 I
  think it probably actually takes longer, since it basically replaces
  n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
  half row additions, plus a pile of other overhead.

  Bill.

  On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED]
  wrote:

   On Saturday 17 May 2008, Martin Albrecht wrote:

 I think a better idea would be to explicitly force all matrices and
 all rows to be 128 bit aligned if the matrices are wide enough to
 benefit from SSE2, Then the combine function can always use SSE2 and
 there will be no need to check for alignment.

That doesn't seem to make a noticeable difference for me (on C2D). 
However,
I realised that the multiplications where the target matrix is a real
matrix rather than a window (which has bad data locality). Copying
everything over seems not like a good idea but it at least indicates an
area for improvements.

   Okay, if I only copy when we crossover to M4RM then the memory overhead is
   constant (~ cutoff^2) and the performance still improves.

   Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
   Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
   10,000 x 10,000         2.920                           3.610
   16,384 x 16,384         11.140                          12.120
   20,000 x 20,000         20.370                          24.390
   32,000 x 32,000 74.290                          94.910

   New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
   Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
   10,000 x 10,000         2.920                           2.990
   16,384 x 16,384         11.140                          11.750
   20,000 x 20,000         20.370                          21.180
   32,000 x 32,000 74.290                          86.570

   On Opteron things don't look this way, but I think sage.math is pretty 
   heavily
   used right now such that my benchmarks there are not very telling.

   Martin

   --
   name: Martin Albrecht
   _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
   _www:http://www.informatik.uni-bremen.de/~malb
   _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


I suppose that this might be due to the ends of rows all being zero as
they aren't a multiple of 64 bits long. But I checked for 16384x16384
and we are nearly down to the speed of Magma there too. I just don't
get it. The coinflip has to be broken I think.

Bill.

On 17 May, 22:40, Bill Hart [EMAIL PROTECTED] wrote:
 Martin,

 Here's a really unusual thing. Perhaps you can confirm this. I get a
 20% improvement if I add:

 if (x)
 {

 }

 in the three obvious places in the function _mzd_mul_m4rm_impl. This
 stops it mpz_combining the zero row.

 But I don't understand why this works. The time should be only 1.5%
 better since k = 6 and there are 2^k rows in the table, only one of
 which is zero.

 Could it be that your coinflip function is not quite random?

 Anyway, I'm down to 3.40s for 1x1 with this change. Test
 functions still pass.

 Bill.

 On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote:

  Yet another idea.

  Suppose we do not combine entire rows in the Gray table, but only half
  rows. Once half a row is bigger than a single cache line (512 bits on
  the Opteron) we may as well work with half rows. This allows us to
  work with twice as many rows at once in the Gray tables (each of half
  the size). This means that we are dealing with twice as many bits from
  rows of A as usual and twice as many rows of B as usual, but we need
  to do it all again for the second half of the rows. This means we get
  twice the work done in the same amount of cache space.

  Combined with the idea of using two Gray tables of 2^5 combinations of
  rows instead of a single table of 2^6 combinations of rows, this would
  equate to dealing with 20 bits of each row of A at a time and 20 rows
  of B at a time.

  With this scheme, there would then be 4 arithmetic operations in SSE
  registers, 5 loads and 1 store, when combining rows from Gray tables,
  instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
  changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.

  This is another example where copying the data (the half rows) out and
  reordering it so it has better locality, would probably make a big
  difference. That sort of thing always works exceptionally well on AMD
  chips.

  Bill.

  On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote:

   That's looking good. Would you like me to run it on an unburdened
   opteron to see how it goes? If you like you can send me a tarball and
   I'll try it out.

   I think our best bet for a significant improvement now is the idea of
   using two Gray tables of half the size simultaneously. I also realised
   it possibly improves the cache performance for the A matrix too.

   I was casually wondering whether Magma might use a highly optimised
   Winograd's algorithm instead of the naive algorithm. But over GF2 I
   think it probably actually takes longer, since it basically replaces
   n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
   half row additions, plus a pile of other overhead.

   Bill.

   On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED]
   wrote:

On Saturday 17 May 2008, Martin Albrecht wrote:

  I think a better idea would be to explicitly force all matrices and
  all rows to be 128 bit aligned if the matrices are wide enough to
  benefit from SSE2, Then the combine function can always use SSE2 and
  there will be no need to check for alignment.

 That doesn't seem to make a noticeable difference for me (on C2D). 
 However,
 I realised that the multiplications where the target matrix is a real
 matrix rather than a window (which has bad data locality). Copying
 everything over seems not like a good idea but it at least indicates 
 an
 area for improvements.

Okay, if I only copy when we crossover to M4RM then the memory overhead 
is
constant (~ cutoff^2) and the performance still improves.

Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000         2.920                           3.610
16,384 x 16,384         11.140                          12.120
20,000 x 20,000         20.370                          24.390
32,000 x 32,000 74.290                          94.910

New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000         2.920                           2.990
16,384 x 16,384         11.140                          11.750
20,000 x 20,000         20.370                          21.180
32,000 x 32,000 74.290                          86.570

On Opteron things don't look this way, but I think sage.math is pretty 
heavily
used right now such that my benchmarks there are not very telling.

Martin

--
name: Martin Albrecht
_pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


I checked the coinflip and it is definitely fine. There is no greater
probability of 6 zeroes in a row than there ought to be. So the
speedup I just reported is quite a mystery.

Bill.

On 17 May, 22:57, Bill Hart [EMAIL PROTECTED] wrote:
 I suppose that this might be due to the ends of rows all being zero as
 they aren't a multiple of 64 bits long. But I checked for 16384x16384
 and we are nearly down to the speed of Magma there too. I just don't
 get it. The coinflip has to be broken I think.

 Bill.

 On 17 May, 22:40, Bill Hart [EMAIL PROTECTED] wrote:

  Martin,

  Here's a really unusual thing. Perhaps you can confirm this. I get a
  20% improvement if I add:

  if (x)
  {

  }

  in the three obvious places in the function _mzd_mul_m4rm_impl. This
  stops it mpz_combining the zero row.

  But I don't understand why this works. The time should be only 1.5%
  better since k = 6 and there are 2^k rows in the table, only one of
  which is zero.

  Could it be that your coinflip function is not quite random?

  Anyway, I'm down to 3.40s for 1x1 with this change. Test
  functions still pass.

  Bill.

  On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote:

   Yet another idea.

   Suppose we do not combine entire rows in the Gray table, but only half
   rows. Once half a row is bigger than a single cache line (512 bits on
   the Opteron) we may as well work with half rows. This allows us to
   work with twice as many rows at once in the Gray tables (each of half
   the size). This means that we are dealing with twice as many bits from
   rows of A as usual and twice as many rows of B as usual, but we need
   to do it all again for the second half of the rows. This means we get
   twice the work done in the same amount of cache space.

   Combined with the idea of using two Gray tables of 2^5 combinations of
   rows instead of a single table of 2^6 combinations of rows, this would
   equate to dealing with 20 bits of each row of A at a time and 20 rows
   of B at a time.

   With this scheme, there would then be 4 arithmetic operations in SSE
   registers, 5 loads and 1 store, when combining rows from Gray tables,
   instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
   changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.

   This is another example where copying the data (the half rows) out and
   reordering it so it has better locality, would probably make a big
   difference. That sort of thing always works exceptionally well on AMD
   chips.

   Bill.

   On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote:

That's looking good. Would you like me to run it on an unburdened
opteron to see how it goes? If you like you can send me a tarball and
I'll try it out.

I think our best bet for a significant improvement now is the idea of
using two Gray tables of half the size simultaneously. I also realised
it possibly improves the cache performance for the A matrix too.

I was casually wondering whether Magma might use a highly optimised
Winograd's algorithm instead of the naive algorithm. But over GF2 I
think it probably actually takes longer, since it basically replaces
n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
half row additions, plus a pile of other overhead.

Bill.

On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED]
wrote:

 On Saturday 17 May 2008, Martin Albrecht wrote:

   I think a better idea would be to explicitly force all matrices 
   and
   all rows to be 128 bit aligned if the matrices are wide enough to
   benefit from SSE2, Then the combine function can always use SSE2 
   and
   there will be no need to check for alignment.

  That doesn't seem to make a noticeable difference for me (on C2D). 
  However,
  I realised that the multiplications where the target matrix is a 
  real
  matrix rather than a window (which has bad data locality). Copying
  everything over seems not like a good idea but it at least 
  indicates an
  area for improvements.

 Okay, if I only copy when we crossover to M4RM then the memory 
 overhead is
 constant (~ cutoff^2) and the performance still improves.

 Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
 Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
 10,000 x 10,000         2.920                           3.610
 16,384 x 16,384         11.140                          12.120
 20,000 x 20,000         20.370                          24.390
 32,000 x 32,000 74.290                          94.910

 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
 Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
 10,000 x 10,000         2.920                           2.990
 16,384 x 16,384         11.140                          11.750
 20,000 x 20,000         20.370                          21.180

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


Woot!!

On 17 May, 23:46, Martin Albrecht [EMAIL PROTECTED]
wrote:
  Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
  Matrix Dimension   Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
  10,000 x 10,000    2.920                           3.610
  16,384 x 16,384    11.140                          12.120
  20,000 x 20,000    20.370                          24.390
  32,000 x 32,000    74.290                          94.910

  New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
  Matrix Dimension   Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
  10,000 x 10,000    2.920                           2.990
  16,384 x 16,384    11.140                          11.750
  20,000 x 20,000    20.370                          21.180
  32,000 x 32,000    74.290                          86.570

 If you take this + Bill's idea:



  For 1x1 we currently use k = 6. Instead of this, we could use
  k = 5 and make two Gray tables simultaneously. This will still fit in
  cache.

  Instead of doing 6 bits at a time, we can then do 10 bits at a time.
  We'd load the appropriate line from the first Gray table, then the
  appropriate one from the second and xor them, then xor with the output
  matrix. This should decrease the number of loads and stores
  considerably. Moreover, the SSE instructions will then be much more
  efficient as the ratio of arithmetic instructions to loads and stores
  is higher.

  Of course one could also do 16 bits at a time, by doing 4 tables, but
  I think this might actually get slower again since you've only
  increased the amount of work done by 60%, but you've had a 30 %
  increase in instructions.

 You get (on the C2D):

 sage: B = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
 sage: A = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
 sage: time C= A._multiply_strassen(B,cutoff=2^11)
 CPU times: user 75.82 s, sys: 0.22 s, total: 76.04 s
 Wall time: 76.31

 sage: A = random_matrix(GF(2), 2*10^4, 2*10^4)
 sage: B = random_matrix(GF(2), 2*10^4, 2*10^4)
 sage: time C= A._multiply_strassen(B,cutoff=2^11)
 CPU times: user 19.14 s, sys: 0.09 s, total: 19.24 s
 Wall time: 19.29

 sage: B = random_matrix(GF(2), 2^14, 2^14)
 sage: A = random_matrix(GF(2), 2^14, 2^14)
 sage: time C= A._multiply_strassen(B,cutoff=2^11)
 CPU times: user 10.62 s, sys: 0.05 s, total: 10.67 s
 Wall time: 10.70

 sage: B = random_matrix(GF(2), 10^4, 10^4)
 sage: A = random_matrix(GF(2), 10^4, 10^4)
 sage: time C= A._multiply_strassen(B,cutoff=2^11)
 CPU times: user 2.73 s, sys: 0.02 s, total: 2.75 s
 Wall time: 2.76

 i.e  the speed of my current Magma install on the same computer (mind you,
 this one might not be optimised for the C2D but for the Opteron, I don't
 know). The times above don't have SSE2 yet. I guess documenting Bill's tricks

  - process the rows of A in blocks
  - use two rather than one Gray code table

 well is in order since now M4RM looks quite different from the original
 algorithm. I'll do that tomorrow.

 Again, thanks Bill!
 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Martin Albrecht


 I suppose that this might be due to the ends of rows all being zero as
 they aren't a multiple of 64 bits long. But I checked for 16384x16384
 and we are nearly down to the speed of Magma there too. I just don't
 get it. The coinflip has to be broken I think.

If one uses M4RI with the new patch from within Sage another PRBG is used, but 
coinflip should be fine. Don't see these speedups (but I have two Gray code 
tables and this warrants for more if's)

Hi, I think we might consider merging our two forks again? Or do you also have 
the two Gray code tables? Are your timings on the Opteron? Because then 
things look really goo since mine are on the C2D.

Exciting times,
Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


I don't have the two Gray code tables, so it would be good to get your
version. Also my code is currently a mess, so it would be good to
clean it up by merging with a cleaner version (yours). Tomorrow I'll
check carefully what I've changed and try and merge the ideas if there
are any you don't have which definitely improve performance on the
Opteron.

The speedups I am seeing from the ifs are possibly a feature of the
Opteron cache algorithms. It is very sensitive when things just begin
to fall out of cache, as they certainly are here. Not combining with
the zero row just nudges things closer in to the cache boundary since
it never has to read that row.

I have checked and the speedups are quite reproducible, and they
definitely come from the ifs, though I am now using a crossover with
Strassen of 7200!!

Bill.

On 18 May, 00:12, Martin Albrecht [EMAIL PROTECTED]
wrote:
  I suppose that this might be due to the ends of rows all being zero as
  they aren't a multiple of 64 bits long. But I checked for 16384x16384
  and we are nearly down to the speed of Magma there too. I just don't
  get it. The coinflip has to be broken I think.

 If one uses M4RI with the new patch from within Sage another PRBG is used, but
 coinflip should be fine. Don't see these speedups (but I have two Gray code
 tables and this warrants for more if's)

 Hi, I think we might consider merging our two forks again? Or do you also have
 the two Gray code tables? Are your timings on the Opteron? Because then
 things look really goo since mine are on the C2D.

 Exciting times,
 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


P.S: yes all my times are on a 2.8Ghz Opteron. Cpuinfo says:

[EMAIL PROTECTED]:~/m4ri-20080514/testsuite cat /proc/cpuinfo
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 65
model name  : Dual-Core AMD Opteron(tm) Processor 2220
stepping: 3
cpu MHz : 1000.000
cache size  : 1024 KB
snip
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
extapic cr8_legacy
snip
The 1000.000 there refers to the FSB.

Bill.

On 18 May, 00:12, Martin Albrecht [EMAIL PROTECTED]
wrote:
  I suppose that this might be due to the ends of rows all being zero as
  they aren't a multiple of 64 bits long. But I checked for 16384x16384
  and we are nearly down to the speed of Magma there too. I just don't
  get it. The coinflip has to be broken I think.

 If one uses M4RI with the new patch from within Sage another PRBG is used, but
 coinflip should be fine. Don't see these speedups (but I have two Gray code
 tables and this warrants for more if's)

 Hi, I think we might consider merging our two forks again? Or do you also have
 the two Gray code tables? Are your timings on the Opteron? Because then
 things look really goo since mine are on the C2D.

 Exciting times,
 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Martin Albrecht


On Sunday 18 May 2008, Bill Hart wrote:
 I don't have the two Gray code tables, so it would be good to get your
 version. Also my code is currently a mess, so it would be good to
 clean it up by merging with a cleaner version (yours). Tomorrow I'll
 check carefully what I've changed and try and merge the ideas if there
 are any you don't have which definitely improve performance on the
 Opteron.

 The speedups I am seeing from the ifs are possibly a feature of the
 Opteron cache algorithms. It is very sensitive when things just begin
 to fall out of cache, as they certainly are here. Not combining with
 the zero row just nudges things closer in to the cache boundary since
 it never has to read that row.

 I have checked and the speedups are quite reproducible, and they
 definitely come from the ifs, though I am now using a crossover with
 Strassen of 7200!!

I'm using a crossover of 2048 here, so maybe our improvements are orthogonal? 
Even more puzzling, I'd expect that my crossover should be bigger than yours. 
(on a side note: my code changes how the crossover is used, your 
version: 'size  cutoff', my version: '|cutoff - size| is minimal' which 
should give a actual cutoffs closer to the desired values).

My version is here:

   http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

(this needs an updated patch for Sage)

and here:

   http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

(which is the raw source). Those don't have SSE2 yet but it doesn't seem to 
make that much of a difference anyway. I'll add that back before doing an 
official release. However, unfortunately I'll probably have limited/no time 
tomorrow to commit.

Martin

PS: To give at least some indication that my code still does the right thing, 
a 'known answer' test:

sage: A = random_matrix(GF(2), 10^3, 10^3)
sage: B = random_matrix(GF(2), 10^3, 10^3)
sage: (A*B)._magma_() == A._magma_() * B._magma_()
True
sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() * 
B._magma_()
True


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


Here are the times I get with the different cutoffs.

Magma M4RI:7200 M4RI:2048

1x1:
   2.940s 3.442s 4.132s

16384x16384:
   9.250s 11.47s 11.80s

2x2:
   16.57s 19.3s 26.05s

32000x32000:
   59.05s 71.9s 71.8s

So it seems when there is not an exact cut, the higher cutoff is
substantially better. Don't know why that is.

Tomorrow I'll see if there is anything I have that speeds up your
code. I'm hopeful we'll be within about 5% on the Opteron by then. The
other ideas I outlined above should push us 10-15% ahead of Magma if
we end up implementing them, I think. Of course one can go too crazy
with optimisation.

Bill.

On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Sunday 18 May 2008, Bill Hart wrote:



  I don't have the two Gray code tables, so it would be good to get your
  version. Also my code is currently a mess, so it would be good to
  clean it up by merging with a cleaner version (yours). Tomorrow I'll
  check carefully what I've changed and try and merge the ideas if there
  are any you don't have which definitely improve performance on the
  Opteron.

  The speedups I am seeing from the ifs are possibly a feature of the
  Opteron cache algorithms. It is very sensitive when things just begin
  to fall out of cache, as they certainly are here. Not combining with
  the zero row just nudges things closer in to the cache boundary since
  it never has to read that row.

  I have checked and the speedups are quite reproducible, and they
  definitely come from the ifs, though I am now using a crossover with
  Strassen of 7200!!

 I'm using a crossover of 2048 here, so maybe our improvements are orthogonal?
 Even more puzzling, I'd expect that my crossover should be bigger than yours.
 (on a side note: my code changes how the crossover is used, your
 version: 'size  cutoff', my version: '|cutoff - size| is minimal' which
 should give a actual cutoffs closer to the desired values).

 My version is here:

    http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

 (this needs an updated patch for Sage)

 and here:

    http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

 (which is the raw source). Those don't have SSE2 yet but it doesn't seem to
 make that much of a difference anyway. I'll add that back before doing an
 official release. However, unfortunately I'll probably have limited/no time
 tomorrow to commit.

 Martin

 PS: To give at least some indication that my code still does the right thing,
 a 'known answer' test:

 sage: A = random_matrix(GF(2), 10^3, 10^3)
 sage: B = random_matrix(GF(2), 10^3, 10^3)
 sage: (A*B)._magma_() == A._magma_() * B._magma_()
 True
 sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() *
 B._magma_()
 True

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread David Harvey



On May 17, 2008, at 8:38 PM, Bill Hart wrote:

 Of course one can go too crazy with optimisation.

No surely that never happens around here.

david


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart




On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED]
wrote:
 My version is here:

    http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

 (this needs an updated patch for Sage)

 and here:

    http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

 (which is the raw source).

This pure C version seems to be the old version, before you made
either of the two big changes.

Bill.
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-17 Thread Bill Hart


I managed to get the modified version from the spkg. Nice code!!

Unfortunately it is not as fast on my opteron. So more work tomorrow
for me to try and get it down to the same times as I have with my
version.

Here are the times all on my opteron. Note your CTD version was
optimal at a cutoff of 2048, not 7200 as for my code. Now I am worried
that maybe my code is actually broken somehow and still passing the
test code. I'll carefully make the changes to your code tomorrow to
see if that is the case.

Magma CTD-M4RI:2048 AMD-M4RI:7200 AMD-M4RI:2048

1x1: 2.940s 3.13s 3.442s 4.132s

16384x16384: 9.250s 12.96s 11.47s 11.80s

2x2: 16.57s 22.43s 19.3s 26.05s

32000x32000: 59.05s 90.20s 71.9s 71.8s

Bill.

On 18 May, 01:58, Bill Hart [EMAIL PROTECTED] wrote:
 On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  My version is here:

     http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

  (this needs an updated patch for Sage)

  and here:

     http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

  (which is the raw source).

 This pure C version seems to be the old version, before you made
 either of the two big changes.

 Bill.
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.

The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma now. Hopefully I didn't just mistime things before, and this is
a real improvement.

I am still trying things out.

Bill.

On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote:
 I think it might just be possible to get down to the speed of Magma
 with a highly optimised classical multiplication routine. At 3600X3600
 one clearly has to do 3600x3600 scalar products of a row by a column.
 We'll assume one of the matrices has been transposed to facilitate
 this.

 If we use SSE2 then we can operate 128 bits at a time. There are 16
 SSE registers.

 The basic idea would be to load 4 SSE registers from different rows of
 matrix A and 2 from different columns of matrix B. We compute the
 scalar products of all 8 combinations of rowsxcolumns simultaneously.
 For this we need 2 temporary registers and 8 registers to hold the
 running totals. So all in all we need 4+2+2+8 = 16 registers. We only
 need to do an AND and an XOR to do the multiplication and addition
 required.

 Caching becomes irrelevant if we choose a large selection of rows from
 A and a large selection of columns from B and do all the possible
 scalar products or rows by columns before moving to the next part.

 Assuming the Opteron can do the memory loads at the same time as doing
 arithmetic not depending on those loads, careful instruction
 scheduling should get everything down to the cost of an SSE AND and an
 SSE OR per 128x128 bit pair in the classical algorithm. That puts the
 entire algorithm at pretty close to what Magma is doing timewise.

 Another option is to not use SSE and just use the 64 bit integer
 registers. The disadvantage is one has to load things 64 bits at a
 time. But the advantage is the Opteron can do three arithmetic
 operations per cycle if properly scheduled. That gets 50% more work
 done than the SSE registers, which can only do 1 x 128 bit operation
 per cycle.

 Of course I'm making the assumption here that the Opteron can indeed
 do loads at the same time as arithmetic.

 if not, then there is no way Magma can be using classical
 multiplication out to 3600x3600 since there are simply too many
 operations to perform in the number of cycles available.

 Bill.

 On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  On Thursday 15 May 2008, Bill Hart wrote:

   Here is the graph of Magma times:

  http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

   The crossover is not clear. The change from a smooth curve to a
   squiggly line is about 3600. So presumably that is it, but the graph
   also seems to change character at about 6200 or 7000 as well. One of
   these changes may be cache related.
   At 3600, the total data for all three matrices is almost 5mb and the
   cache on my machine is 1024kb. But if Magma is using classical
   multiplication, then this is pretty much irrelevant anyway, since you
   can keep the working data within the cache for quite a while during
   the algorithm.

  On the other hand: a squiggly line is what one one would expect for
  Strassen-Winograd due to the extra rows/columns that have to be taken care
  of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan 
  Steel
  once gave a talk about his implementation and stated that they don't use
  classical block multiplication (I saw some slides with that remark).

  Martin

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


1x1 is down to 4.5s now, so nearly a 2x speedup.

2x2 is down to 32.0s, so again nearly a 2x speedup.

Bill.

On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote:
 I made some changes to the original code so it would use the cache a
 bit better. The test code seems to pass, so I don't think I've screwed
 anything up.

 The time for 16384x16384 on my machine is now 20s, so a factor of
 2.15x faster. The time for 2000x2000 also seems to be the same time as
 Magma now. Hopefully I didn't just mistime things before, and this is
 a real improvement.

 I am still trying things out.

 Bill.

 On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote:

  I think it might just be possible to get down to the speed of Magma
  with a highly optimised classical multiplication routine. At 3600X3600
  one clearly has to do 3600x3600 scalar products of a row by a column.
  We'll assume one of the matrices has been transposed to facilitate
  this.

  If we use SSE2 then we can operate 128 bits at a time. There are 16
  SSE registers.

  The basic idea would be to load 4 SSE registers from different rows of
  matrix A and 2 from different columns of matrix B. We compute the
  scalar products of all 8 combinations of rowsxcolumns simultaneously.
  For this we need 2 temporary registers and 8 registers to hold the
  running totals. So all in all we need 4+2+2+8 = 16 registers. We only
  need to do an AND and an XOR to do the multiplication and addition
  required.

  Caching becomes irrelevant if we choose a large selection of rows from
  A and a large selection of columns from B and do all the possible
  scalar products or rows by columns before moving to the next part.

  Assuming the Opteron can do the memory loads at the same time as doing
  arithmetic not depending on those loads, careful instruction
  scheduling should get everything down to the cost of an SSE AND and an
  SSE OR per 128x128 bit pair in the classical algorithm. That puts the
  entire algorithm at pretty close to what Magma is doing timewise.

  Another option is to not use SSE and just use the 64 bit integer
  registers. The disadvantage is one has to load things 64 bits at a
  time. But the advantage is the Opteron can do three arithmetic
  operations per cycle if properly scheduled. That gets 50% more work
  done than the SSE registers, which can only do 1 x 128 bit operation
  per cycle.

  Of course I'm making the assumption here that the Opteron can indeed
  do loads at the same time as arithmetic.

  if not, then there is no way Magma can be using classical
  multiplication out to 3600x3600 since there are simply too many
  operations to perform in the number of cycles available.

  Bill.

  On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED]
  wrote:

   On Thursday 15 May 2008, Bill Hart wrote:

Here is the graph of Magma times:

   http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as well. One of
these changes may be cache related.
At 3600, the total data for all three matrices is almost 5mb and the
cache on my machine is 1024kb. But if Magma is using classical
multiplication, then this is pretty much irrelevant anyway, since you
can keep the working data within the cache for quite a while during
the algorithm.

   On the other hand: a squiggly line is what one one would expect for
   Strassen-Winograd due to the extra rows/columns that have to be taken care
   of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan 
   Steel
   once gave a talk about his implementation and stated that they don't use
   classical block multiplication (I saw some slides with that remark).

   Martin

   --
   name: Martin Albrecht
   _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
   _www:http://www.informatik.uni-bremen.de/~malb
   _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


I tried cache blocking matrix B, but the times are exactly the same. I
think the processor is happy to keep loading B one row at a time
sequentially, and since it is only done a handful of times in the
algorithm, it accounts for little of the runtime.

It might go faster in combination with storing the gray tables, but
the optimal block size seems to be about 100 (*k) for the matrix B,
which would necessitate storage of 100 gray tables, which is an awful
lot of memory. Probably I think it is best to avoid this kind of
memory usage.

So I am not sure where the other factor of 2 will come from.

Bill.

On 16 May, 14:37, Bill Hart [EMAIL PROTECTED] wrote:
 Here is the modified _mzd_mul_m4rm_impl function which gives this
 speedup:

 http://sage.math.washington.edu/home/wbhart/m4rmul.c

 I used a crossover of 3600 and I indicate at the top of this file how
 constants should be changed to get the speedups for the various values
 of n. I didn't put any code in to automatically choose the correct
 values.

 I am presuming that the test_multiplication code actually tests this
 function, otherwise maybe my code is just rubbish.

 The basic idea is to cache block the A matrix into groups of BLOCK
 rows. If one also cache blocked the B matrix and left the computed
 gray tables in memory instead of writing over them and recalculating
 them all the time, as I currently do, one could probably get the other
 factor of 2 that we need.

 Note I've turned SSE off in this function, but left it on in
 _mzd_combine since it makes no real difference there.

 Bill.

 On 16 May, 14:16, Bill Hart [EMAIL PROTECTED] wrote:

  1x1 is down to 4.5s now, so nearly a 2x speedup.

  2x2 is down to 32.0s, so again nearly a 2x speedup.

  Bill.

  On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote:

   I made some changes to the original code so it would use the cache a
   bit better. The test code seems to pass, so I don't think I've screwed
   anything up.

   The time for 16384x16384 on my machine is now 20s, so a factor of
   2.15x faster. The time for 2000x2000 also seems to be the same time as
   Magma now. Hopefully I didn't just mistime things before, and this is
   a real improvement.

   I am still trying things out.

   Bill.

   On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote:

I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.

If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.

The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.

Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.

Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.

Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.

Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.

if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.

Bill.

On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED]
wrote:

 On Thursday 15 May 2008, Bill Hart wrote:

  Here is the graph of Magma times:

 http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

  The crossover is not clear. The change from a smooth curve to a
  squiggly line is about 3600. So presumably that is it, but the graph
  also seems to change character at about 6200 or 7000 as well. One of
  these changes may be cache related.
  At 3600, the total data for all three matrices is almost 5mb and

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


P.S: I also tried cache hints, but no luck. They just slow it down.

Bill.

On 16 May, 15:59, Bill Hart [EMAIL PROTECTED] wrote:
 I tried cache blocking matrix B, but the times are exactly the same. I
 think the processor is happy to keep loading B one row at a time
 sequentially, and since it is only done a handful of times in the
 algorithm, it accounts for little of the runtime.

 It might go faster in combination with storing the gray tables, but
 the optimal block size seems to be about 100 (*k) for the matrix B,
 which would necessitate storage of 100 gray tables, which is an awful
 lot of memory. Probably I think it is best to avoid this kind of
 memory usage.

 So I am not sure where the other factor of 2 will come from.

 Bill.

 On 16 May, 14:37, Bill Hart [EMAIL PROTECTED] wrote:

  Here is the modified _mzd_mul_m4rm_impl function which gives this
  speedup:

 http://sage.math.washington.edu/home/wbhart/m4rmul.c

  I used a crossover of 3600 and I indicate at the top of this file how
  constants should be changed to get the speedups for the various values
  of n. I didn't put any code in to automatically choose the correct
  values.

  I am presuming that the test_multiplication code actually tests this
  function, otherwise maybe my code is just rubbish.

  The basic idea is to cache block the A matrix into groups of BLOCK
  rows. If one also cache blocked the B matrix and left the computed
  gray tables in memory instead of writing over them and recalculating
  them all the time, as I currently do, one could probably get the other
  factor of 2 that we need.

  Note I've turned SSE off in this function, but left it on in
  _mzd_combine since it makes no real difference there.

  Bill.

  On 16 May, 14:16, Bill Hart [EMAIL PROTECTED] wrote:

   1x1 is down to 4.5s now, so nearly a 2x speedup.

   2x2 is down to 32.0s, so again nearly a 2x speedup.

   Bill.

   On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote:

I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.

The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma now. Hopefully I didn't just mistime things before, and this is
a real improvement.

I am still trying things out.

Bill.

On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote:

 I think it might just be possible to get down to the speed of Magma
 with a highly optimised classical multiplication routine. At 3600X3600
 one clearly has to do 3600x3600 scalar products of a row by a column.
 We'll assume one of the matrices has been transposed to facilitate
 this.

 If we use SSE2 then we can operate 128 bits at a time. There are 16
 SSE registers.

 The basic idea would be to load 4 SSE registers from different rows of
 matrix A and 2 from different columns of matrix B. We compute the
 scalar products of all 8 combinations of rowsxcolumns simultaneously.
 For this we need 2 temporary registers and 8 registers to hold the
 running totals. So all in all we need 4+2+2+8 = 16 registers. We only
 need to do an AND and an XOR to do the multiplication and addition
 required.

 Caching becomes irrelevant if we choose a large selection of rows from
 A and a large selection of columns from B and do all the possible
 scalar products or rows by columns before moving to the next part.

 Assuming the Opteron can do the memory loads at the same time as doing
 arithmetic not depending on those loads, careful instruction
 scheduling should get everything down to the cost of an SSE AND and an
 SSE OR per 128x128 bit pair in the classical algorithm. That puts the
 entire algorithm at pretty close to what Magma is doing timewise.

 Another option is to not use SSE and just use the 64 bit integer
 registers. The disadvantage is one has to load things 64 bits at a
 time. But the advantage is the Opteron can do three arithmetic
 operations per cycle if properly scheduled. That gets 50% more work
 done than the SSE registers, which can only do 1 x 128 bit operation
 per cycle.

 Of course I'm making the assumption here that the Opteron can indeed
 do loads at the same time as arithmetic.

 if not, then there is no way Magma can be using classical
 multiplication out to 3600x3600 since there are simply too many
 operations to perform in the number of cycles available.

 Bill.

 On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  On Thursday 15 May 2008, Bill Hart wrote:

   Here is the graph of Magma times:

  http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

   The crossover is not clear. The change from a smooth curve to a
   squiggly line is about 3600. So presumably

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


Here are the times I get for Magma vs M4RI now. Note the crossover
between the two programs is now above about 5000 and M4RI beats Magma
below that point. This suggests the remaining factor of 2 is in the
Strassen-Winograd function. Probably Winograd-Strassen is falling out
of L2 cache (the previous adjustments I made were to prevent the M4R
algorithm falling out of L1 cache).

The other possibility is that Magma combines the two algorithms so
that there is even greater usage of the Gray code tables. This would
be an ugly hack, but could work.

4x4:
  Magma: 112.6s
  M4RI: 232.4s

2x2:
  Magma: 16.40s
  M4RI: 32.34s

1x1:
  Magma: 2.750s
  M4RI: 4.529s

5000x5000:
  Magma: 0.700s
  M4RI: 0.672s

2500x2500:
  Magma: 0.13s
  M4RI: 0.079s

1250x1250:
  Magma: 0.015s
  M4RI: 0.012s

625x625:
  Magma: 0.0030s
  M4RI: 0.0023s

312x312:
  Magma: 0.0014s
  M4RI: 0.00032s

156x156:
  Magma: 0.001s
  M4RI: 0.0001s

If I get some time I'll look into this.

Did those changes work for you Martin?

Bill.

On 16 May, 16:24, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Friday 16 May 2008, Bill Hart wrote:

  P.S: I also tried cache hints, but no luck. They just slow it down.

  Bill.

 Bill, thanks so much for looking into this! It is very much appreciated. I'm
 going to read/try your patch right away!

 Martin

 PS: I have a slight delay in my replies because my provider's spam filter
 consistently flags you and mabshoff as spam for some reason.

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Martin Albrecht


On Friday 16 May 2008, Bill Hart wrote:
 Here are the times I get for Magma vs M4RI now. Note the crossover
 between the two programs is now above about 5000 and M4RI beats Magma
 below that point. This suggests the remaining factor of 2 is in the
 Strassen-Winograd function. Probably Winograd-Strassen is falling out
 of L2 cache (the previous adjustments I made were to prevent the M4R
 algorithm falling out of L1 cache).

 The other possibility is that Magma combines the two algorithms so
 that there is even greater usage of the Gray code tables. This would
 be an ugly hack, but could work.

Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state 
that anyway. Probably I'm just misunderstanding you.

 4x4:
   Magma: 112.6s
   M4RI: 232.4s

 2x2:
   Magma: 16.40s
   M4RI: 32.34s

 1x1:
   Magma: 2.750s
   M4RI: 4.529s

 5000x5000:
   Magma: 0.700s
   M4RI: 0.672s

 2500x2500:
   Magma: 0.13s
   M4RI: 0.079s

 1250x1250:
   Magma: 0.015s
   M4RI: 0.012s

 625x625:
   Magma: 0.0030s
   M4RI: 0.0023s

 312x312:
   Magma: 0.0014s
   M4RI: 0.00032s

 156x156:
   Magma: 0.001s
   M4RI: 0.0001s

 If I get some time I'll look into this.

 Did those changes work for you Martin?

 Bill.

Yes, the change worked like a charm. I made some changes (the fixed k is 
replaced by a k that depends on the new block dimensions etc.) and it is much 
much faster now. I'm working on re-introducing SSE2 now to see if it at least 
on the Core2Duo makes the world a better place. Btw. all the stuff I wrote 
about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma 
so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit 
version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow, 
here are the times:

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix DimensionMagma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000 2.920   4.130
16,384 x 16,384 11.140  15.740
20,000 x 20,000 20.370  28.950

64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
Matrix DimensionMagma 2.13-5 (64-bit)   M4RI-20080517 (64-bit)
10,000 x 10,000 3.930   7.860
16,384 x 16,384 16.230  104.77???
20,000 x 20,000 27.080  56.420

(My university today finally granted me access to Magma 2.14 for my notebook.)

As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot 
explain that for now, it might just be a bug. I have a similar spike in 
another run on another AMD (Athlon X2) machine:

n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01
n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05
n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13
n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29
n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55
n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88
n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35
 n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 
n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78
n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88
n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24
n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33

I'm investigating this one too and play around with the parameters. Since the 
C2D times are considerably better I also bet it is L2 related, we'll see.

Thanks again!
Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-16 Thread Bill Hart


Hi Martin,

That spike is wierd. Basically I got closer to 2x the time of Magma
for 16384x16384, but you need different parameters than for
1x1 or 2x2 since the size of the M4R matrices will be
different than in either of those cases.

For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One
might also need to fiddle with the cutoff. I think I used 3600, but at
various times I had the cutoff set lower.

It would be awfully surprising if there wasn't a set of parameters
that dropped this time right down. It is possible that the L1 cache is
smaller on sage.math than on the Opteron I was using. Perhaps my
parameters don't apply on that machine.

Bill.

On 16 May, 23:05, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Friday 16 May 2008, Bill Hart wrote:

  Here are the times I get for Magma vs M4RI now. Note the crossover
  between the two programs is now above about 5000 and M4RI beats Magma
  below that point. This suggests the remaining factor of 2 is in the
  Strassen-Winograd function. Probably Winograd-Strassen is falling out
  of L2 cache (the previous adjustments I made were to prevent the M4R
  algorithm falling out of L1 cache).

  The other possibility is that Magma combines the two algorithms so
  that there is even greater usage of the Gray code tables. This would
  be an ugly hack, but could work.

 Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
 that anyway. Probably I'm just misunderstanding you.



  4x4:
    Magma: 112.6s
    M4RI: 232.4s

  2x2:
    Magma: 16.40s
    M4RI: 32.34s

  1x1:
    Magma: 2.750s
    M4RI: 4.529s

  5000x5000:
    Magma: 0.700s
    M4RI: 0.672s

  2500x2500:
    Magma: 0.13s
    M4RI: 0.079s

  1250x1250:
    Magma: 0.015s
    M4RI: 0.012s

  625x625:
    Magma: 0.0030s
    M4RI: 0.0023s

  312x312:
    Magma: 0.0014s
    M4RI: 0.00032s

  156x156:
    Magma: 0.001s
    M4RI: 0.0001s

  If I get some time I'll look into this.

  Did those changes work for you Martin?

  Bill.

 Yes, the change worked like a charm. I made some changes (the fixed k is
 replaced by a k that depends on the new block dimensions etc.) and it is much
 much faster now. I'm working on re-introducing SSE2 now to see if it at least
 on the Core2Duo makes the world a better place. Btw. all the stuff I wrote
 about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma
 so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit
 version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow,
 here are the times:

 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
 Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
 10,000 x 10,000         2.920                   4.130
 16,384 x 16,384         11.140                  15.740
 20,000 x 20,000         20.370                  28.950

 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
 Matrix Dimension        Magma 2.13-5 (64-bit)   M4RI-20080517 (64-bit)
 10,000 x 10,000         3.930                   7.860
 16,384 x 16,384         16.230                  104.77???
 20,000 x 20,000         27.080                  56.420

 (My university today finally granted me access to Magma 2.14 for my notebook.)

 As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot
 explain that for now, it might just be a bug. I have a similar spike in
 another run on another AMD (Athlon X2) machine:

 n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01
 n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05
 n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13
 n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29
 n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55
 n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88
 n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35 n: 9216, 
 cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 

 n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78
 n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88
 n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24
 n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33

 I'm investigating this one too and play around with the parameters. Since the
 C2D times are considerably better I also bet it is L2 related, we'll see.

 Thanks again!
 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


On Thursday 15 May 2008, William Stein wrote:
  Btw. I don't have access to Magma 2.14 which I believe to be the fastest
  in linear algebra over GF(2). In version 2.14 they added SSE2 support
  too. So if anybody could compare the new M4RI with Magma 2.14 I'd be
  happy to hear about the results. I don't know what else to compare with
  except Magma.

 Here are magma V2.14-9 timings on sage.math:
[EMAIL PROTECTED]:~/magma64/bin$ ./magma
Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142]
Type ? for help.  Type Ctrl-D to quit.
 A := RandomMatrix(GF(2),10^4,10^4);
 B := RandomMatrix(GF(2),10^4,10^4);
 time C := A*B;
Time: 4.020

 time C := A*B;
Time: 3.980
 time C := A*B;
Time: 3.990
 A := RandomMatrix(GF(2),2^14,2^14);
 B := RandomMatrix(GF(2),2^14,2^14);
 time C := A*B;
Time: 15.450
 time C := A*B;
Time: 15.420

This doesn't seem much different from 2.13, which is good for my purposes. 
Once I fixed the OSX issue mentioned below could you run the benchmark on the 
C2D? I want to see if they can make better use of the 4MB L2 cache.

 

 I tried your new code up at #3204 under OS X and get this:

 sage: A = random_matrix(GF(2),10^4,10^4)
 sage: B = random_matrix(GF(2),10^4,10^4)
 sage: time C = A._multiply_strassen(B,cutoff=3200)
 sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
 pointer being freed (2)
 *** set a breakpoint in malloc_error_break to debug
 sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
 pointer being freed (2)
 *** set a breakpoint in malloc_error_break to debug
 sage.bin(39971) malloc: *** error for object 0x7465a00:
 non-page-aligned, non-allocated pointer being freed
 *** set a breakpoint in malloc_error_break to debug
 sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
 pointer being freed (2)
 *** set a breakpoint in malloc_error_break to debug
 ...
 CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
 Wall time: 16.31

 Maybe you're doing something wrong?

Likely, though my Valgrind log is clean under Linux. I'll try under OSX.

Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


 Maybe you're doing something wrong?

This bug is fixed in:

   http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg

Cheers,
Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


On Thursday 15 May 2008, Bill Hart wrote:
 Hi Martin,

 The cycle counter in your bench code seems to give random values on
 the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
 Magma 2-14.10 on it.

 Anyhow, here are the Magma times:
  A := RandomMatrix(GF(2),10^4,10^4);
  B := RandomMatrix(GF(2),10^4,10^4);
  t := Cputime(); C := A*B; Cputime(t);

 2.940

  A := RandomMatrix(GF(2),2*10^4,2*10^4);
  B := RandomMatrix(GF(2),2*10^4,2*10^4);
  t := Cputime(); C := A*B; Cputime(t);

 16.570

  A := RandomMatrix(GF(2),2^14,2^14);
  B := RandomMatrix(GF(2),2^14,2^14);
  t := Cputime(); C := A*B; Cputime(t);

 9.250

 The corresponding times for your code are approximately (hand timed;
 adjusted for generation of random matrices):

 10^4: 8s
 2*10^4: 59s
 2^14: 43s

 I think it is polynomials over GF2 which Magma 2-14.x is much faster
 at rather than linear algebra.

 Bill.

Hi Bill,

I cannot explain the random behaviour of the cpucycles counter. The cpucycles 
library is by Dan Bernstein so I'd suspect it is carefully coded. Some 
potential sources of variance are: dynticks (i.e. the kernel doesn't have a 
fixed tick rate, this is done to safe power), CPU frequency scaling and other 
applications. On the other hand, I'd suspect I don't need to tell you since 
you probably benchmarked more code than I did. To account for the variance I 
have the small run_bench.py script in the testsuite directory. Run it like 
this:

./run_bench.py -t mul -n 1 -c 3200 

It'll run the experiment 17 times (17 for no particular reason) and return the 
median runtime. -n is the size and -c the cutoff.

Cheers,
Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread William Stein


On Thu, May 15, 2008 at 10:42 AM, Martin Albrecht
[EMAIL PROTECTED] wrote:

 On Thursday 15 May 2008, Bill Hart wrote:
 Hi Martin,

 The cycle counter in your bench code seems to give random values on
 the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
 Magma 2-14.10 on it.

 Anyhow, here are the Magma times:
  A := RandomMatrix(GF(2),10^4,10^4);
  B := RandomMatrix(GF(2),10^4,10^4);
  t := Cputime(); C := A*B; Cputime(t);

 2.940

  A := RandomMatrix(GF(2),2*10^4,2*10^4);
  B := RandomMatrix(GF(2),2*10^4,2*10^4);
  t := Cputime(); C := A*B; Cputime(t);

 16.570

  A := RandomMatrix(GF(2),2^14,2^14);
  B := RandomMatrix(GF(2),2^14,2^14);
  t := Cputime(); C := A*B; Cputime(t);

 9.250

 The corresponding times for your code are approximately (hand timed;
 adjusted for generation of random matrices):

 10^4: 8s
 2*10^4: 59s
 2^14: 43s

 I think it is polynomials over GF2 which Magma 2-14.x is much faster
 at rather than linear algebra.

 Bill.

 Hi Bill,

 I cannot explain the random behaviour of the cpucycles counter. The cpucycles
 library is by Dan Bernstein so I'd suspect it is carefully coded. Some
 potential sources of variance are: dynticks (i.e. the kernel doesn't have a
 fixed tick rate, this is done to safe power), CPU frequency scaling and other
 applications. On the other hand, I'd suspect I don't need to tell you since
 you probably benchmarked more code than I did. To account for the variance I
 have the small run_bench.py script in the testsuite directory. Run it like
 this:

./run_bench.py -t mul -n 1 -c 3200

 It'll run the experiment 17 times (17 for no particular reason) and return the
 median runtime. -n is the size and -c the cutoff.

Median?  Shouldn't you take the minimum?  Are there any good papers
on benchmarking?

By the way -- present for you:

  http://m4ri.sagemath.org/

To update that page you have to ssh to sagemath.org.

William

--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


  It'll run the experiment 17 times (17 for no particular reason) and
  return the median runtime. -n is the size and -c the cutoff.

 Median?  Shouldn't you take the minimum?  Are there any good papers
 on benchmarking?

I'm using cpucycles from:

http://www.ecrypt.eu.org/ebats/cpucycles.html

which advises:



Notes on accuracy

Benchmarking tools are encouraged to record several timings of a function: 
call cpucycles(), function(), cpucycles(), function(), etc., and then print 
one line reporting the differences between successive cpucycles() results. 
The median of several differences is much more stable than the average. 

Cycle counts continue to increase while other programs are running, while the 
operating system is handling an interruption such as a network packet, etc. 
This won't affect the median of several timings of a fast function---the 
function usually won't be interrupted---but it can affect the median of 
several timings of a slow function. Hopefully a benchmarking machine isn't 
running other programs. 

On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs 
often don't have synchronized cycle counters, so a process that switches CPUs 
can have its cycle counts jump forwards or backwards. I've never seen this 
affect the median of several timings. 

Some CPUs dynamically reduce CPU speed to save power, but deliberately keep 
their cycle counters running at full speed, the idea being that measuring 
time is more important than measuring cycles. Hopefully a benchmarking 
machine won't enter power-saving mode. 

Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as 
discussed below. I've never seen this affect the median of several timings. 

The estimate returned by cpucycles_persecond() may improve accuracy after 
cpucycles() has been called repeatedly. 



I just followed that. I guess if one takes the minimum or not depends on the 
data. If there are (much) easier cases then it is probably not best to take 
the minimum. However, I guess for matrix multiplication the times don't 
really depend on the entries of the matrices so I could use the minimum or 
just output all (minimum, average, median).

 By the way -- present for you:

   http://m4ri.sagemath.org/

 To update that page you have to ssh to sagemath.org.

Cool, thanks! Any chance that I could get write access to the home directory 
so that I can upload my SSH public key?

Martin


-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Bill Hart


Hi Martin,

Here is a run that illustrates the problem. Am I doing something
wrong?

[EMAIL PROTECTED]:~/m4ri-20080514/testsuite time ./
bench_multiplication 5000 2048
n:  5000, cutoff:  2048, cpu cycles: 2670143764

real0m2.768s
user0m2.760s
sys 0m0.008s
[EMAIL PROTECTED]:~/m4ri-20080514/testsuite time ./
bench_multiplication 5000 3000
n:  5000, cutoff:  3000, cpu cycles:  746344039

real0m3.277s
user0m3.252s
sys 0m0.024s

Bill.

On 15 May, 18:42, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Thursday 15 May 2008, Bill Hart wrote:



  Hi Martin,

  The cycle counter in your bench code seems to give random values on
  the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
  Magma 2-14.10 on it.

  Anyhow, here are the Magma times:
   A := RandomMatrix(GF(2),10^4,10^4);
   B := RandomMatrix(GF(2),10^4,10^4);
   t := Cputime(); C := A*B; Cputime(t);

  2.940

   A := RandomMatrix(GF(2),2*10^4,2*10^4);
   B := RandomMatrix(GF(2),2*10^4,2*10^4);
   t := Cputime(); C := A*B; Cputime(t);

  16.570

   A := RandomMatrix(GF(2),2^14,2^14);
   B := RandomMatrix(GF(2),2^14,2^14);
   t := Cputime(); C := A*B; Cputime(t);

  9.250

  The corresponding times for your code are approximately (hand timed;
  adjusted for generation of random matrices):

  10^4: 8s
  2*10^4: 59s
  2^14: 43s

  I think it is polynomials over GF2 which Magma 2-14.x is much faster
  at rather than linear algebra.

  Bill.

 Hi Bill,

 I cannot explain the random behaviour of the cpucycles counter. The cpucycles
 library is by Dan Bernstein so I'd suspect it is carefully coded. Some
 potential sources of variance are: dynticks (i.e. the kernel doesn't have a
 fixed tick rate, this is done to safe power), CPU frequency scaling and other
 applications. On the other hand, I'd suspect I don't need to tell you since
 you probably benchmarked more code than I did. To account for the variance I
 have the small run_bench.py script in the testsuite directory. Run it like
 this:

     ./run_bench.py -t mul -n 1 -c 3200

 It'll run the experiment 17 times (17 for no particular reason) and return the
 median runtime. -n is the size and -c the cutoff.

 Cheers,
 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht

On Thursday 15 May 2008, Bill Hart wrote:
 Hi Martin,

 Here is a run that illustrates the problem. Am I doing something
 wrong?

No, I was stupid. The cpucycles are printed as %u but they should be printed 
as %llu since they are longer than an int. I've attached the fixed C file 
(since it is  1KB).

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

#include stdlib.h
#include cpucycles.h
#include m4ri.h

int main(int argc, char **argv) {
  int n, cutoff;
  unsigned long long t;
  m4ri_build_all_codes();

  if (argc != 3) {
m4ri_die(Parameters n and cutoff expected.\n);
  }
  n = atoi(argv[1]);
  cutoff = atoi(argv[2]);

  if (n=0) {
m4ri_die(Parameter n must be  0\n);
  }

  if (cutoff=0) {
m4ri_die(Parameter cutoff must be  0\n);
  }

  packedmatrix *A = mzd_init(n, n);
  packedmatrix *B = mzd_init(n, n);
  mzd_randomize(A);
  mzd_randomize(B);

  t = cpucycles();
  packedmatrix *C = mzd_mul_strassen(NULL, A, B, cutoff);
  printf(n: %5d, cutoff: %5d, cpu cycles: %llu\n,n, cutoff, cpucycles() - t);

  mzd_free(A);
  mzd_free(B);
  mzd_free(C);
  m4ri_destroy_all_codes();
}

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Bill Hart


Martin,

Do you think Magma uses naive multiplication for its base case? This
can be ridicoulously fast, especially over GF2. I note for example
that Magma's base case is about 6 times faster than M4RI at 1000x1000.
Is it possible that the naive multiplication can just be optimised
with a far better constant and that the crossover with m4r is above
the crossover with Strassen-Winograd?

I also note that the M4RI SSE2 code doesn't seem faster than the
generic C code on the Opteron, in fact it seems the other way around.
Very strange.

Bill.

On 15 May, 20:25, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Thursday 15 May 2008, Bill Hart wrote:

  Hi Martin,

  Here is a run that illustrates the problem. Am I doing something
  wrong?

 No, I was stupid. The cpucycles are printed as %u but they should be printed
 as %llu since they are longer than an int. I've attached the fixed C file
 (since it is  1KB).

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]

  bench_multiplication.c
 1KDownload
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


On Thursday 15 May 2008, Bill Hart wrote:
 Martin,

 Do you think Magma uses naive multiplication for its base case? This
 can be ridicoulously fast, especially over GF2. I note for example
 that Magma's base case is about 6 times faster than M4RI at 1000x1000.
 Is it possible that the naive multiplication can just be optimised
 with a far better constant and that the crossover with m4r is above
 the crossover with Strassen-Winograd?

 I also note that the M4RI SSE2 code doesn't seem faster than the
 generic C code on the Opteron, in fact it seems the other way around.
 Very strange.

I observed the same behaviour but have no explanation besides stuff like this:

http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opterons-slower-than-athlon-xps

How do you know what the Magma base case is, can you tell Magma somehow to 
print its crossover? I'll look into plugging in naive multiplication as the 
base case (after optimisation) to see if your idea is right, though I doubt 
that the crossover to M4RM is beyond the crossover to Strassen-Winograd.

Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Bill Hart


Oh, actually I have no idea where Magma's crossover is. I'll wager it
is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
I'll try and work it out with some timings.

Bill.

On 15 May, 22:23, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Thursday 15 May 2008, Bill Hart wrote:

  Martin,

  Do you think Magma uses naive multiplication for its base case? This
  can be ridicoulously fast, especially over GF2. I note for example
  that Magma's base case is about 6 times faster than M4RI at 1000x1000.
  Is it possible that the naive multiplication can just be optimised
  with a far better constant and that the crossover with m4r is above
  the crossover with Strassen-Winograd?

  I also note that the M4RI SSE2 code doesn't seem faster than the
  generic C code on the Opteron, in fact it seems the other way around.
  Very strange.

 I observed the same behaviour but have no explanation besides stuff like this:

 http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt...

 How do you know what the Magma base case is, can you tell Magma somehow to
 print its crossover? I'll look into plugging in naive multiplication as the
 base case (after optimisation) to see if your idea is right, though I doubt
 that the crossover to M4RM is beyond the crossover to Strassen-Winograd.

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Bill Hart


Here is the graph of Magma times:

http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as well. One of
these changes may be cache related.

At 3600, the total data for all three matrices is almost 5mb and the
cache on my machine is 1024kb. But if Magma is using classical
multiplication, then this is pretty much irrelevant anyway, since you
can keep the working data within the cache for quite a while during
the algorithm.

Bill.

On 15 May, 22:41, Bill Hart [EMAIL PROTECTED] wrote:
 Oh, actually I have no idea where Magma's crossover is. I'll wager it
 is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
 I'll try and work it out with some timings.

 Bill.

 On 15 May, 22:23, Martin Albrecht [EMAIL PROTECTED]
 wrote:

  On Thursday 15 May 2008, Bill Hart wrote:

   Martin,

   Do you think Magma uses naive multiplication for its base case? This
   can be ridicoulously fast, especially over GF2. I note for example
   that Magma's base case is about 6 times faster than M4RI at 1000x1000.
   Is it possible that the naive multiplication can just be optimised
   with a far better constant and that the crossover with m4r is above
   the crossover with Strassen-Winograd?

   I also note that the M4RI SSE2 code doesn't seem faster than the
   generic C code on the Opteron, in fact it seems the other way around.
   Very strange.

  I observed the same behaviour but have no explanation besides stuff like 
  this:

 http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt...

  How do you know what the Magma base case is, can you tell Magma somehow to
  print its crossover? I'll look into plugging in naive multiplication as the
  base case (after optimisation) to see if your idea is right, though I doubt
  that the crossover to M4RM is beyond the crossover to Strassen-Winograd.

  Martin

  --
  name: Martin Albrecht
  _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
  _www:http://www.informatik.uni-bremen.de/~malb
  _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Martin Albrecht


On Thursday 15 May 2008, Bill Hart wrote:
 Here is the graph of Magma times:

 http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

 The crossover is not clear. The change from a smooth curve to a
 squiggly line is about 3600. So presumably that is it, but the graph
 also seems to change character at about 6200 or 7000 as well. One of
 these changes may be cache related.
 At 3600, the total data for all three matrices is almost 5mb and the
 cache on my machine is 1024kb. But if Magma is using classical
 multiplication, then this is pretty much irrelevant anyway, since you
 can keep the working data within the cache for quite a while during
 the algorithm.

On the other hand: a squiggly line is what one one would expect for 
Strassen-Winograd due to the extra rows/columns that have to be taken care 
of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel 
once gave a talk about his implementation and stated that they don't use 
classical block multiplication (I saw some slides with that remark).

Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-15 Thread Bill Hart


I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.

If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.

The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.

Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.

Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.

Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.

Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.

if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.

Bill.

On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED]
wrote:
 On Thursday 15 May 2008, Bill Hart wrote:

  Here is the graph of Magma times:

 http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

  The crossover is not clear. The change from a smooth curve to a
  squiggly line is about 3600. So presumably that is it, but the graph
  also seems to change character at about 6200 or 7000 as well. One of
  these changes may be cache related.
  At 3600, the total data for all three matrices is almost 5mb and the
  cache on my machine is 1024kb. But if Magma is using classical
  multiplication, then this is pretty much irrelevant anyway, since you
  can keep the working data within the cache for quite a while during
  the algorithm.

 On the other hand: a squiggly line is what one one would expect for
 Strassen-Winograd due to the extra rows/columns that have to be taken care
 of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
 once gave a talk about his implementation and stated that they don't use
 classical block multiplication (I saw some slides with that remark).

 Martin

 --
 name: Martin Albrecht
 _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
 _www:http://www.informatik.uni-bremen.de/~malb
 _jab: [EMAIL PROTECTED]
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

[sage-devel] Re: slightly OT: new M4RI library

2008-05-14 Thread William Stein


 Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
 linear algebra over GF(2). In version 2.14 they added SSE2 support too.
 So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
 about the results. I don't know what else to compare with except Magma.

Here are magma V2.14-9 timings on sage.math:

[EMAIL PROTECTED]:~/magma64/bin$ ./magma
Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142]
Type ? for help.  Type Ctrl-D to quit.
 A := RandomMatrix(GF(2),10^4,10^4);
 B := RandomMatrix(GF(2),10^4,10^4);
 time C := A*B;
Time: 4.020
 time C := A*B;
Time: 3.980
 time C := A*B;
Time: 3.990
 A := RandomMatrix(GF(2),2^14,2^14);
 B := RandomMatrix(GF(2),2^14,2^14);
 time C := A*B;
Time: 15.450
 time C := A*B;
Time: 15.420


 Any suggestions? Thoughts? Rants?

Allan Steel is a really amazing programmer.



I tried your new code up at #3204 under OS X and get this:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x7465a00:
non-page-aligned, non-allocated pointer being freed
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
...
CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
Wall time: 16.31

Maybe you're doing something wrong?

 -- William

--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~--~~~~--~~--~--~---

55 matches

Mail list logo