Oh gods of the cpu cycle ... or hi there,

(this e-mail contains details on a particular implementation for 
GF(2) linear algebra, feel free to ignore it if that doesn't get you going)

I've just submitted a new (much improved) version of the M4RI library for 
inclusion in Sage:

  http://trac.sagemath.org/sage_trac/ticket/3204

M4RI is a library by Gregory Bard and myself for fast arithmetic with dense 
matrices over GF(2). M4RI stands for the "Method of the Four Russians" 
inversion which is an algorithm for matrix reduction first published by Greg. 
The website is here:

  http://sage.math.washington.edu/home/malb/m4ri/

This new version has among other things two new features:
 - SSE2 support where support (XOR two 128-bit words in one instruction)
 - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four 
Russians" multiplication or Kronrod's algorithm)

I'm writing to [sage-devel] because I observed some odd performance 
characteristics and hope to pick your brains on how to improve performance:

At least in my experience enabling SSE2 results in slower code on AMD CPUs. I 
think this is rumoured across the net but did anyone else also witness this 
behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo 
notebook SSE2 speeds up the computation at least up to the size of the L2 
cache.

One would expect that the M4RM + Strassen implementation would have an easy 
time beating Magma since it uses a better algorithm. Same as Magma it does 
Strassen-Winograd down to the cutoff but then it switches to the M4RM 
algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This 
should give a nice constant speed-up. This theory so far seems to match 
reality on a Core2Duo notebook with 4MB L2 cache:

----------------------------------------------------------------------
| SAGE Version 3.0.1, Release Date: 2008-05-05                       |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
Wall time: 4.56

Magma V2.13-5     Wed May 14 2008 21:55:31 on road     [Seed = 2246483032]
Type ? for help.  Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
7.170

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
Wall time: 33.59

> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
49.990

Finally, a perfect cutting example (no extra rows/columns to take care of):

sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
Wall time: 16.96

> A:= RandomMatrix(GF(2),2^14,2^14);
> B:= RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
24.060

Note however that this comparison is NOT FAIR since this version of Magma is 
not optimised for the C2D chip!

However, on AMD CPUs the times don't look that nice. Both on sage.math and on 
a 

   AMD Athlon(tm) 64 X2 Dual Core Processor 6400+

the performance is way way behind Magma (or what I'd expect Magma to perform 
like), e.g. on sage.math:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
Wall time: 14.62

Magma V2.13-5     Wed May 14 2008 14:07:18 on sage     [Seed = 4255048009]
Type ? for help.  Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
3.890 <<--------- They even beat the C2D time!

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
Wall time: 100.06

> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
27.470

# perfect cutoff
sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10)
CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
Wall time: 100.45
sage: time C = A._multiply_strassen(B,cutoff=1<<9)
CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
Wall time: 76.87
sage: time C = A._multiply_strassen(B,cutoff=1<<8)
CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
Wall time: 52.78

> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
14.970

I guess the bad performance of the M4RI library is partly related to the 
smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower 
(since we want all three C = A*B submatrices to  fit in L2 cache). Also it 
seems that Magma is almost always ~ 3.5 times faster than our implementation. 
I am slowly running out of ideas what to improve in order to close to the gap 
so if anyone has some ideas please let me hear them. It might be worth noting 
that we are using the same cutting strategy as Sage in strassen.pyx which 
ignores uneven rows/columns at every recursion step and deals with the extra 
rows/columns after the Strassen-Winograd formula (see strassen.pyx for 
details). We are also using the improved operation schedule from paper of 
Clement et al.Though I don't know if the cutting strategy is optimal, we also 
get beaten in a perfect cutting scenario, so I guess it is not that important 
for now. Anyway, I'd happily provide profiling data if anyone wants to take a 
look and squeeze a couple of cycles out. If not, then this e-mail is jut a 
quite long status report :-)

Btw. I don't have access to Magma 2.14 which I believe to be the fastest in 
linear algebra over GF(2). In version 2.14 they added SSE2 support too. 
So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear 
about the results. I don't know what else to compare with except Magma. Any 
suggestions? Thoughts? Rants?

Cheers,
Martin

-- 
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: [EMAIL PROTECTED]


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to