Hi Martin,

The cycle counter in your bench code seems to give random values on
the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
Magma 2-14.10 on it.

Anyhow, here are the Magma times:

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
2.940
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
16.570
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
9.250

The corresponding times for your code are approximately (hand timed;
adjusted for generation of random matrices):

10^4: 8s
2*10^4: 59s
2^14: 43s

I think it is polynomials over GF2 which Magma 2-14.x is much faster
at rather than linear algebra.

Bill.

On 14 May, 22:52, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> Oh gods of the cpu cycle ... or hi there,
>
> (this e-mail contains details on a particular implementation for
> GF(2) linear algebra, feel free to ignore it if that doesn't get you going)
>
> I've just submitted a new (much improved) version of the M4RI library for
> inclusion in Sage:
>
>  http://trac.sagemath.org/sage_trac/ticket/3204
>
> M4RI is a library by Gregory Bard and myself for fast arithmetic with dense
> matrices over GF(2). M4RI stands for the "Method of the Four Russians"
> inversion which is an algorithm for matrix reduction first published by Greg.
> The website is here:
>
>  http://sage.math.washington.edu/home/malb/m4ri/
>
> This new version has among other things two new features:
>  - SSE2 support where support (XOR two 128-bit words in one instruction)
>  - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four
> Russians" multiplication or Kronrod's algorithm)
>
> I'm writing to [sage-devel] because I observed some odd performance
> characteristics and hope to pick your brains on how to improve performance:
>
> At least in my experience enabling SSE2 results in slower code on AMD CPUs. I
> think this is rumoured across the net but did anyone else also witness this
> behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo
> notebook SSE2 speeds up the computation at least up to the size of the L2
> cache.
>
> One would expect that the M4RM + Strassen implementation would have an easy
> time beating Magma since it uses a better algorithm. Same as Magma it does
> Strassen-Winograd down to the cutoff but then it switches to the M4RM
> algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This
> should give a nice constant speed-up. This theory so far seems to match
> reality on a Core2Duo notebook with 4MB L2 cache:
>
> ----------------------------------------------------------------------
> | SAGE Version 3.0.1, Release Date: 2008-05-05                       |
> | Type notebook() for the GUI, and license() for information.        |
> ----------------------------------------------------------------------
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
> Wall time: 4.56
>
> Magma V2.13-5     Wed May 14 2008 21:55:31 on road     [Seed = 2246483032]
> Type ? for help.  Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4);
> > B := RandomMatrix(GF(2),10^4,10^4);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 7.170
>
> sage: A = random_matrix(GF(2),2*10^4,2*10^4)
> sage: B = random_matrix(GF(2),2*10^4,2*10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
> Wall time: 33.59
>
> > A := RandomMatrix(GF(2),2*10^4,2*10^4);
> > B := RandomMatrix(GF(2),2*10^4,2*10^4);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 49.990
>
> Finally, a perfect cutting example (no extra rows/columns to take care of):
>
> sage: A = random_matrix(GF(2),1<<14,1<<14)
> sage: B = random_matrix(GF(2),1<<14,1<<14)
> sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
> CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
> Wall time: 16.96
>
> > A:= RandomMatrix(GF(2),2^14,2^14);
> > B:= RandomMatrix(GF(2),2^14,2^14);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 24.060
>
> Note however that this comparison is NOT FAIR since this version of Magma is
> not optimised for the C2D chip!
>
> However, on AMD CPUs the times don't look that nice. Both on sage.math and on
> a
>
>    AMD Athlon(tm) 64 X2 Dual Core Processor 6400+
>
> the performance is way way behind Magma (or what I'd expect Magma to perform
> like), e.g. on sage.math:
>
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=1500)
> CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
> Wall time: 14.62
>
> Magma V2.13-5     Wed May 14 2008 14:07:18 on sage     [Seed = 4255048009]
> Type ? for help.  Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4);
> > B := RandomMatrix(GF(2),10^4,10^4);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 3.890 <<--------- They even beat the C2D time!
>
> sage: A = random_matrix(GF(2),2*10^4,2*10^4)
> sage: B = random_matrix(GF(2),2*10^4,2*10^4)
> sage: time C = A._multiply_strassen(B,cutoff=1500)
> CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
> Wall time: 100.06
>
> > B := RandomMatrix(GF(2),2*10^4,2*10^4);
> > A := RandomMatrix(GF(2),2*10^4,2*10^4);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 27.470
>
> # perfect cutoff
> sage: A = random_matrix(GF(2),1<<14,1<<14)
> sage: B = random_matrix(GF(2),1<<14,1<<14)
> sage: time C = A._multiply_strassen(B,cutoff=1<<10)
> CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
> Wall time: 100.45
> sage: time C = A._multiply_strassen(B,cutoff=1<<9)
> CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
> Wall time: 76.87
> sage: time C = A._multiply_strassen(B,cutoff=1<<8)
> CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
> Wall time: 52.78
>
> > A := RandomMatrix(GF(2),2^14,2^14);
> > B := RandomMatrix(GF(2),2^14,2^14);
> > t := Cputime(); C := A*B; Cputime(t);
>
> 14.970
>
> I guess the bad performance of the M4RI library is partly related to the
> smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower
> (since we want all three C = A*B submatrices to  fit in L2 cache). Also it
> seems that Magma is almost always ~ 3.5 times faster than our implementation.
> I am slowly running out of ideas what to improve in order to close to the gap
> so if anyone has some ideas please let me hear them. It might be worth noting
> that we are using the same cutting strategy as Sage in strassen.pyx which
> ignores uneven rows/columns at every recursion step and deals with the extra
> rows/columns after the Strassen-Winograd formula (see strassen.pyx for
> details). We are also using the improved operation schedule from paper of
> Clement et al.Though I don't know if the cutting strategy is optimal, we also
> get beaten in a perfect cutting scenario, so I guess it is not that important
> for now. Anyway, I'd happily provide profiling data if anyone wants to take a
> look and squeeze a couple of cycles out. If not, then this e-mail is jut a
> quite long status report :-)
>
> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
> linear algebra over GF(2). In version 2.14 they added SSE2 support too.
> So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
> about the results. I don't know what else to compare with except Magma. Any
> suggestions? Thoughts? Rants?
>
> Cheers,
> Martin
>
> --
> name: Martin Albrecht
> _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www:http://www.informatik.uni-bremen.de/~malb
> _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to