Oh gods of the cpu cycle ... or hi there, (this e-mail contains details on a particular implementation for GF(2) linear algebra, feel free to ignore it if that doesn't get you going)
I've just submitted a new (much improved) version of the M4RI library for inclusion in Sage: http://trac.sagemath.org/sage_trac/ticket/3204 M4RI is a library by Gregory Bard and myself for fast arithmetic with dense matrices over GF(2). M4RI stands for the "Method of the Four Russians" inversion which is an algorithm for matrix reduction first published by Greg. The website is here: http://sage.math.washington.edu/home/malb/m4ri/ This new version has among other things two new features: - SSE2 support where support (XOR two 128-bit words in one instruction) - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four Russians" multiplication or Kronrod's algorithm) I'm writing to [sage-devel] because I observed some odd performance characteristics and hope to pick your brains on how to improve performance: At least in my experience enabling SSE2 results in slower code on AMD CPUs. I think this is rumoured across the net but did anyone else also witness this behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo notebook SSE2 speeds up the computation at least up to the size of the L2 cache. One would expect that the M4RM + Strassen implementation would have an easy time beating Magma since it uses a better algorithm. Same as Magma it does Strassen-Winograd down to the cutoff but then it switches to the M4RM algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This should give a nice constant speed-up. This theory so far seems to match reality on a Core2Duo notebook with 4MB L2 cache: ---------------------------------------------------------------------- | SAGE Version 3.0.1, Release Date: 2008-05-05 | | Type notebook() for the GUI, and license() for information. | ---------------------------------------------------------------------- sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s Wall time: 4.56 Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032] Type ? for help. Type <Ctrl>-D to quit. > A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > t := Cputime(); C := A*B; Cputime(t); 7.170 sage: A = random_matrix(GF(2),2*10^4,2*10^4) sage: B = random_matrix(GF(2),2*10^4,2*10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s Wall time: 33.59 > A := RandomMatrix(GF(2),2*10^4,2*10^4); > B := RandomMatrix(GF(2),2*10^4,2*10^4); > t := Cputime(); C := A*B; Cputime(t); 49.990 Finally, a perfect cutting example (no extra rows/columns to take care of): sage: A = random_matrix(GF(2),1<<14,1<<14) sage: B = random_matrix(GF(2),1<<14,1<<14) sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200 CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s Wall time: 16.96 > A:= RandomMatrix(GF(2),2^14,2^14); > B:= RandomMatrix(GF(2),2^14,2^14); > t := Cputime(); C := A*B; Cputime(t); 24.060 Note however that this comparison is NOT FAIR since this version of Magma is not optimised for the C2D chip! However, on AMD CPUs the times don't look that nice. Both on sage.math and on a AMD Athlon(tm) 64 X2 Dual Core Processor 6400+ the performance is way way behind Magma (or what I'd expect Magma to perform like), e.g. on sage.math: sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=1500) CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s Wall time: 14.62 Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009] Type ? for help. Type <Ctrl>-D to quit. > A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > t := Cputime(); C := A*B; Cputime(t); 3.890 <<--------- They even beat the C2D time! sage: A = random_matrix(GF(2),2*10^4,2*10^4) sage: B = random_matrix(GF(2),2*10^4,2*10^4) sage: time C = A._multiply_strassen(B,cutoff=1500) CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s Wall time: 100.06 > B := RandomMatrix(GF(2),2*10^4,2*10^4); > A := RandomMatrix(GF(2),2*10^4,2*10^4); > t := Cputime(); C := A*B; Cputime(t); 27.470 # perfect cutoff sage: A = random_matrix(GF(2),1<<14,1<<14) sage: B = random_matrix(GF(2),1<<14,1<<14) sage: time C = A._multiply_strassen(B,cutoff=1<<10) CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s Wall time: 100.45 sage: time C = A._multiply_strassen(B,cutoff=1<<9) CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s Wall time: 76.87 sage: time C = A._multiply_strassen(B,cutoff=1<<8) CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s Wall time: 52.78 > A := RandomMatrix(GF(2),2^14,2^14); > B := RandomMatrix(GF(2),2^14,2^14); > t := Cputime(); C := A*B; Cputime(t); 14.970 I guess the bad performance of the M4RI library is partly related to the smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower (since we want all three C = A*B submatrices to fit in L2 cache). Also it seems that Magma is almost always ~ 3.5 times faster than our implementation. I am slowly running out of ideas what to improve in order to close to the gap so if anyone has some ideas please let me hear them. It might be worth noting that we are using the same cutting strategy as Sage in strassen.pyx which ignores uneven rows/columns at every recursion step and deals with the extra rows/columns after the Strassen-Winograd formula (see strassen.pyx for details). We are also using the improved operation schedule from paper of Clement et al.Though I don't know if the cutting strategy is optimal, we also get beaten in a perfect cutting scenario, so I guess it is not that important for now. Anyway, I'd happily provide profiling data if anyone wants to take a look and squeeze a couple of cycles out. If not, then this e-mail is jut a quite long status report :-) Btw. I don't have access to Magma 2.14 which I believe to be the fastest in linear algebra over GF(2). In version 2.14 they added SSE2 support too. So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear about the results. I don't know what else to compare with except Magma. Any suggestions? Thoughts? Rants? Cheers, Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~----------~----~----~----~------~----~------~--~---