[sage-devel] Re: slightly OT: new M4RI library
On Tuesday 20 May 2008, Bill Hart wrote: Martin, On the M4RI website you say that M4R inversion is asymptotically fast with time complexity n^3/log_2(n), but Strassen's method has complexity n^log2(7), which I would call asymptotically fast. Do you just mean asymptotically faster than the classical algorithm? Yes, I'll fix that. By the way, I wrote some code ages ago for computing the row echelon form of sparse matrices over GF2 which essentially inverts 6000x6000 sparse matrices in about a second. Excellent, is the code available somewhere? How dense are these matrices? My plan for M4RI is: - polish what we have for multiplication now, find appropriate/acceptable parameters for a wide range of platforms, make sure it builds everywhere etc. - during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation. This is the actual application we are most interested in - Once that is done, I want to look into sparse matrices (which are probably much more relevant for my work than dense matrices) so I'd be interested in looking into your old code. I'm not sure this should go under the roof of the M4RI library but if it does then this probably warrants a name change :-) Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
My code is not downloadable anywhere that I am aware of. I used to use it in my quadratic sieve before I switched to Jason P's block Lanczos code (BL solves a more restricted problem but is fine for the QS). My old code is some of the earliest mathematical C code that I wrote, so can certainly be improved. I can certainly make it available at some point. The crossover with block Lanczos was about 6000x6000. The matrices of 6000x6000 had something like 12 nonzero entries in each row, so quite sparse. Of course the matrix quickly becomes dense as the algorithm proceeds. I was also taking advantage of the fact that the left hand side of the matrix was more dense than the right, which is probably not the situation you would have arising in your research, Martin. At any rate, it might make a reasonable base case for an implementation of something. Bill. On 20 May, 09:47, Martin Albrecht [EMAIL PROTECTED] wrote: On Tuesday 20 May 2008, Bill Hart wrote: Martin, On the M4RI website you say that M4R inversion is asymptotically fast with time complexity n^3/log_2(n), but Strassen's method has complexity n^log2(7), which I would call asymptotically fast. Do you just mean asymptotically faster than the classical algorithm? Yes, I'll fix that. By the way, I wrote some code ages ago for computing the row echelon form of sparse matrices over GF2 which essentially inverts 6000x6000 sparse matrices in about a second. Excellent, is the code available somewhere? How dense are these matrices? My plan for M4RI is: - polish what we have for multiplication now, find appropriate/acceptable parameters for a wide range of platforms, make sure it builds everywhere etc. - during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation. This is the actual application we are most interested in - Once that is done, I want to look into sparse matrices (which are probably much more relevant for my work than dense matrices) so I'd be interested in looking into your old code. I'm not sure this should go under the roof of the M4RI library but if it does then this probably warrants a name change :-) Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Martin, does your machine speed up at all with any of the SSE code? I spent considerable time yesterday trying to optimise the combine3 function I wrote (note: I didn't submit this improved code). Whilst I did speed it up considerably by removing %'s and /'s and changing the function prototype to accept rows directly and lots of other minor improvements, the overall result was still much slower than the generic C code. If the SSE code doesn't ever speed it up (and it might not be faster to use SSE when there is at least one load/store per arithmetic operation), then it might be worth throwing all the SSE code out completely, since it just polutes what is otherwise very nice looking code. I did check by compiling the code to assembler without assembling it (the -S option in gcc), that it is actually using SSE instructions and that it is doing this in what I would consider an optimal way. So it really is SSE itself that is slow, not just inefficiencies introduced by the compiler. One possible avenue for exploration is the gf2x library by Zimmerman, Brent, et al. They use SSE in their code for polynomials over GF2, apparently to good effect. We could look to see if there is some trick to using it efficiently. The difference there however is probably that the polynomials get quite long and multiple arithmetic operations need to be done for each store/load. I don't know if any of their tricks can help us for that reason. Bill. On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED] wrote: On Monday 19 May 2008, Bill Hart wrote: Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is. Here are the new times: Magma Old New 1x1: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 2x2: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 2x2. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. Cool, I'll take a look tomorrow. It seems we're closing in on the classical algorithm though. Although, you don't seem to decrease k anymore. As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned on cache lines. I'm widely speculating here. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I had a look at gf2x. The reason they can make use of sse is because of the shl and shr capabilities. Doing that 128 bits at a time is more efficient since one operation can be done instead of quite a few using general purpose regs. Bill. On 19 May, 13:52, Bill Hart [EMAIL PROTECTED] wrote: Martin, does your machine speed up at all with any of the SSE code? I spent considerable time yesterday trying to optimise the combine3 function I wrote (note: I didn't submit this improved code). Whilst I did speed it up considerably by removing %'s and /'s and changing the function prototype to accept rows directly and lots of other minor improvements, the overall result was still much slower than the generic C code. If the SSE code doesn't ever speed it up (and it might not be faster to use SSE when there is at least one load/store per arithmetic operation), then it might be worth throwing all the SSE code out completely, since it just polutes what is otherwise very nice looking code. I did check by compiling the code to assembler without assembling it (the -S option in gcc), that it is actually using SSE instructions and that it is doing this in what I would consider an optimal way. So it really is SSE itself that is slow, not just inefficiencies introduced by the compiler. One possible avenue for exploration is the gf2x library by Zimmerman, Brent, et al. They use SSE in their code for polynomials over GF2, apparently to good effect. We could look to see if there is some trick to using it efficiently. The difference there however is probably that the polynomials get quite long and multiple arithmetic operations need to be done for each store/load. I don't know if any of their tricks can help us for that reason. Bill. On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED] wrote: On Monday 19 May 2008, Bill Hart wrote: Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is. Here are the new times: Magma Old New 1x1: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 2x2: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 2x2. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. Cool, I'll take a look tomorrow. It seems we're closing in on the classical algorithm though. Although, you don't seem to decrease k anymore. As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned on cache lines. I'm widely speculating here. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Bill, I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is worth the trouble (I agree that it make the otherwise pretty looking code ugly). I have some timings (for an old implementation) here: http://trac.sagemath.org/sage_trac/ticket/3204#comment:2 My guess is that SSE2 is slower on the Opteron because SSE2 is basically an Intel thing and only provided by AMD for compatibility reasons. There are several reports of SSE2 being slow on the Opteron and I guess the SSE2 integer operations were not focused for speed since MMX/SSE is all about floating point mainly. One thing I noticed on the Opteron is that if I put the code in mzd_combine vs. putting the same code directly in the function made huge difference. I blamed it on better cache prefetching support but that was probably preliminary. My proposal: - This evening I'll update my code with your 8 Gray tables and check the performance on the C2D - Then I'll re-introduce SSE2 and check whether it makes a worthy difference, if not we drop SSE2 from the multiplication. Martin PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel implementation of Strassen-Winograd yesterday and it gives - as is - a speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma feels like cheating, first we should aim for better speed with the same resources and then we switch to parallel implementations for even better times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix multiplication in 1.7 seconds on my notebook one week ago :-) -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
You seemed to be getting up to 8% at points there. That's definitely worth it. I'll be interested to see this evening how it comes out, though I recommend optimising my combine3 function (which I suppose should now be combine8), even including it inline rather than have it in a separate file. Of course on the Opteron, SSE should be switched off, since it is definitely slower by about 5%-10% even with careful optimisation. Bill. On 19 May, 14:23, Martin Albrecht [EMAIL PROTECTED] wrote: Bill, I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is worth the trouble (I agree that it make the otherwise pretty looking code ugly). I have some timings (for an old implementation) here: http://trac.sagemath.org/sage_trac/ticket/3204#comment:2 My guess is that SSE2 is slower on the Opteron because SSE2 is basically an Intel thing and only provided by AMD for compatibility reasons. There are several reports of SSE2 being slow on the Opteron and I guess the SSE2 integer operations were not focused for speed since MMX/SSE is all about floating point mainly. One thing I noticed on the Opteron is that if I put the code in mzd_combine vs. putting the same code directly in the function made huge difference. I blamed it on better cache prefetching support but that was probably preliminary. My proposal: - This evening I'll update my code with your 8 Gray tables and check the performance on the C2D - Then I'll re-introduce SSE2 and check whether it makes a worthy difference, if not we drop SSE2 from the multiplication. Martin PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel implementation of Strassen-Winograd yesterday and it gives - as is - a speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma feels like cheating, first we should aim for better speed with the same resources and then we switch to parallel implementations for even better times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix multiplication in 1.7 seconds on my notebook one week ago :-) -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Monday 19 May 2008, Bill Hart wrote: You seemed to be getting up to 8% at points there. That's definitely worth it. I'll be interested to see this evening how it comes out, though I recommend optimising my combine3 function (which I suppose should now be combine8), even including it inline rather than have it in a separate file. Of course on the Opteron, SSE should be switched off, since it is definitely slower by about 5%-10% even with careful optimisation. Bill. Okay, so a good compromise is to remove all SSE2 stuff from the main function _mzd_mul_m4rm_impl and put it in static inline _mzd_combine8 function which is specifically tailored towards this particular application. Thus the code still looks relatively pretty/elegant but we can have SSE2 support. Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Monday 19 May 2008, Bill Hart wrote: You seemed to be getting up to 8% at points there. That's definitely worth it. I'll be interested to see this evening how it comes out, though I recommend optimising my combine3 function (which I suppose should now be combine8), even including it inline rather than have it in a separate file. Bill, some progress report for the C2D: I incorporated your changes with the following modifications: - if (x1==0 x2==0 x3==0 x8==8) ... I removed this one since it seems rather unlikely that this is true - I added #define called GRAY8 which defines whether 8 or 4 tables ought to be used. I figured this might be handy for machines with a smaller L1. - I added the correct a_nc%k values back in - We don't need to make sure that the allocated buffers are correctly aligned, since we try allocate with _mm_malloc. If that is not available we should probably just use posix_memalign. The speed-up on the C2D (similar to the Opteron) is considerable (the last column is a parallel toy implementation): 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*) Matrix DimensionMagma M4RIM4RI (OpenMP), walltime 10,000 x 10,000 2.920 2.270 1.470 16,384 x 16,384 11.140 9.130 5.540 20,000 x 20,000 20.370 16.110 11.800 32,000 x 32,000 75.510 64.340 40.040 Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x 2*10^4 in this configuration: sage: A = random_matrix(GF(2),10^3, 10^3); sage: B = random_matrix(GF(2),10^3, 10^3); sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B) True sage: A = random_matrix(GF(2),2*10^4, 2*10^4); sage: B = random_matrix(GF(2),2*10^4, 2*10^4); sage: time A._multiply_m4rm(B) CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s Wall time: 18.65 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11 Matrix DimensionMagma 2.13-5 (64-bit) M4RI-20080518 (64-bit) 10,000 x 10,000 4.190 4.290 16,384 x 16,384 15.360 15.230 20,000 x 20,000 29.530 28.640 32,000 x 32,000 103.970 114.620 That does seem to roughly match what you reported. I'll now look into SSE2 support. Cheers, Martin (*) Note: Magma is 64-bit but not optimised for C2D. -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Martin, On the M4RI website you say that M4R inversion is asymptotically fast with time complexity n^3/log_2(n), but Strassen's method has complexity n^log2(7), which I would call asymptotically fast. Do you just mean asymptotically faster than the classical algorithm? By the way, I wrote some code ages ago for computing the row echelon form of sparse matrices over GF2 which essentially inverts 6000x6000 sparse matrices in about a second. Bill. On 19 May, 19:58, Martin Albrecht [EMAIL PROTECTED] wrote: On Monday 19 May 2008, Bill Hart wrote: You seemed to be getting up to 8% at points there. That's definitely worth it. I'll be interested to see this evening how it comes out, though I recommend optimising my combine3 function (which I suppose should now be combine8), even including it inline rather than have it in a separate file. Bill, some progress report for the C2D: I incorporated your changes with the following modifications: - if (x1==0 x2==0 x3==0 x8==8) ... I removed this one since it seems rather unlikely that this is true - I added #define called GRAY8 which defines whether 8 or 4 tables ought to be used. I figured this might be handy for machines with a smaller L1. - I added the correct a_nc%k values back in - We don't need to make sure that the allocated buffers are correctly aligned, since we try allocate with _mm_malloc. If that is not available we should probably just use posix_memalign. The speed-up on the C2D (similar to the Opteron) is considerable (the last column is a parallel toy implementation): 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*) Matrix Dimension Magma M4RI M4RI (OpenMP), walltime 10,000 x 10,000 2.920 2.270 1.470 16,384 x 16,384 11.140 9.130 5.540 20,000 x 20,000 20.370 16.110 11.800 32,000 x 32,000 75.510 64.340 40.040 Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x 2*10^4 in this configuration: sage: A = random_matrix(GF(2),10^3, 10^3); sage: B = random_matrix(GF(2),10^3, 10^3); sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B) True sage: A = random_matrix(GF(2),2*10^4, 2*10^4); sage: B = random_matrix(GF(2),2*10^4, 2*10^4); sage: time A._multiply_m4rm(B) CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s Wall time: 18.65 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11 Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080518 (64-bit) 10,000 x 10,000 4.190 4.290 16,384 x 16,384 15.360 15.230 20,000 x 20,000 29.530 28.640 32,000 x 32,000 103.970 114.620 That does seem to roughly match what you reported. I'll now look into SSE2 support. Cheers, Martin (*) Note: Magma is 64-bit but not optimised for C2D. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I sorted out what was wrong. In my combine code I was combining operands in the wrong order. This meant that a whole lot of zeroes ended up where they shouldn't be, hence the if (x) thing working. So my times were unfortunately completely screwed up. I went back to a fresh tarball with the original code and applied my changes one by one. Only two changes actually improved the time. First I give Magma times vs your (Martin's) code: 1x1: 2.940s 3.13s 16384x16384: 9.250s 12.96s 2x2: 16.57s 22.43s 32000x32000: 59.1s 90.2s Now I give times for my code with a single Gray table. The legend below indicates what the three times mean. 1x1: 7.76s 6.70s 4.40s 16384x16384: 44.6s 37.3s 18.3s 2x2: 53.3s 45.9s 31.2s 32000x32000: --- 194s 134s 0) cutoff = 2048, original code no SSE 1) cutoff = 3600, CACHE BLOCK A (256, 768) 2) cutoff = 7200, fix k = 6 So it is clear that two Gray tables is much better than one. The only thing that is puzzling me now is how you get away with k = 7 with two Gray tables, whereas I was using k = 6 with one table. Also, you stick with BLOCK = 768, whereas I found it optimal to switch to 256 for 16384x16384. However, if I make any additional changes to your code it just slows down. It must have to do with this copying out that you did. That must significantly affect cache performance. Anyhow, I'm now going to wipe all versions I have (except my working one with an optimal single Gray table implementation) and just start from your code as a base for further improvements. We are still up to 50% slower than Magma on the Opteron!! Bill. On 18 May, 02:21, Bill Hart [EMAIL PROTECTED] wrote: I managed to get the modified version from the spkg. Nice code!! Unfortunately it is not as fast on my opteron. So more work tomorrow for me to try and get it down to the same times as I have with my version. Here are the times all on my opteron. Note your CTD version was optimal at a cutoff of 2048, not 7200 as for my code. Now I am worried that maybe my code is actually broken somehow and still passing the test code. I'll carefully make the changes to your code tomorrow to see if that is the case. Magma CTD-M4RI:2048 AMD-M4RI:7200 AMD-M4RI:2048 1x1: 2.940s 3.13s 3.442s 4.132s 16384x16384: 9.250s 12.96s 11.47s 11.80s 2x2: 16.57s 22.43s 19.3s 26.05s 32000x32000: 59.05s 90.20s 71.9s 71.8s Bill. On 18 May, 01:58, Bill Hart [EMAIL PROTECTED] wrote: On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED] wrote: My version is here: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg (this needs an updated patch for Sage) and here: http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz (which is the raw source). This pure C version seems to be the old version, before you made either of the two big changes. Bill. --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Hi, first, I recorded the different speed-ups in a small table for an overview in the attachment (I think we've come a far way :-)) To disable the copying out one needs to edit /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ packedmatrix *Cbar = mzd_init(C-nrows, C-ncols); Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); mzd_copy(C, Cbar); mzd_free(Cbar); return C; in strassen.c to /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); return C; This disables the copying out. Martin PS: If I find some time later today I'll make some changes such that SSE2 can be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is used. -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~--- Title: Sheet1 Sheet1 Magma2.14-10 M4RI-Orig CacheFriendlyM4RM TwoTables CacheFriendly,k=6,cutoff=7200 1x1 2.94 8 4.529 3.13 4.4 16000x16000 9.25 43 20 12.96 18.3 2x2 16.57 59 32.34 22.43 31.2 32000x32000 59.1 -- -- 90.2 134 Top
[sage-devel] Re: slightly OT: new M4RI library
If two Gray tables is better than one, then you can't have enough of a good thing right? So I made 3 Gray tables now. The files I modified are in: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ There are three main modifications: 1) Make all matrices SSE aligned if we have SSE2, and make all rows of all matrices aligned. This required a fix in brilliantrussian.c, since it makes some assumptions about how the data is stored in matrices. 2) Introduce function combine2 and combine3 which use SSE to do a[i] ^= b[i] ^ c[i] and a[i] ^= b[i] ^ c[i] ^ d[i]. 3) Code to use three Gray tables. I had to remove the a_nc%k's since these were causing segfaults for reasons I don't understand. Here are the Magma times, your old times and the new times on my Opteron: 1x1: 2.940s 3.13s 2.82s 16384x16384: 9.250s 12.96s 11.25s 2x2: 16.57s 22.43s 19.39s 32000x32000: 59.1s 90.2s 81.56s Note I do not use the new combine2 and combine3 functions as they slow it down on my machine. I cannot believe how useless SSE seems to be! I've commented the three lines out that use combine3 in brilliantrussian.c in case you want to try it with SSE on your CTD. Bill. On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED] wrote: Hi, first, I recorded the different speed-ups in a small table for an overview in the attachment (I think we've come a far way :-)) To disable the copying out one needs to edit /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ packedmatrix *Cbar = mzd_init(C-nrows, C-ncols); Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); mzd_copy(C, Cbar); mzd_free(Cbar); return C; in strassen.c to /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); return C; This disables the copying out. Martin PS: If I find some time later today I'll make some changes such that SSE2 can be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is used. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] timings.html 2KDownload --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
The copying out makes 50% difference (its better with copying) to the speed of 16384x16384 but no difference to 1x1 or 2x2. That's wierd. Bill. On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED] wrote: Hi, first, I recorded the different speed-ups in a small table for an overview in the attachment (I think we've come a far way :-)) To disable the copying out one needs to edit /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ packedmatrix *Cbar = mzd_init(C-nrows, C-ncols); Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); mzd_copy(C, Cbar); mzd_free(Cbar); return C; in strassen.c to /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); return C; This disables the copying out. Martin PS: If I find some time later today I'll make some changes such that SSE2 can be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is used. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] timings.html 2KDownload --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I tried copying out the input matrices to the M4RM routine, but only when the rows aren't all contiguous in memory. This didn't speed anything up. Of course the reason for that is the second matrix is only read a handful of times, to construct the Gray tables which are then used extensively. The first matrix is read out of order anyway, by M4RM, so there's no point making all its rows contiguous. It's hard to see how to get that last bit we need to beat Magma. What I don't quite understand now is the fact that we are beating Magma all the way up to 1x1, which is surely past their crossover to Strassen. But we start losing for large matrices when we use Strassen. I checked that the addition is not slower than Magma (its not, it's up to 5 times faster). The only trick I have left to try is to use twice the number of Gray tables, but make them half the width, but that seems like cheating, since we should already have a fast enough base case by now! Bill. On 18 May, 20:09, Bill Hart [EMAIL PROTECTED] wrote: The copying out makes 50% difference (its better with copying) to the speed of 16384x16384 but no difference to 1x1 or 2x2. That's wierd. Bill. On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED] wrote: Hi, first, I recorded the different speed-ups in a small table for an overview in the attachment (I think we've come a far way :-)) To disable the copying out one needs to edit /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ packedmatrix *Cbar = mzd_init(C-nrows, C-ncols); Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); mzd_copy(C, Cbar); mzd_free(Cbar); return C; in strassen.c to /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); return C; This disables the copying out. Martin PS: If I find some time later today I'll make some changes such that SSE2 can be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is used. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] timings.html 2KDownload --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is. Here are the new times: Magma Old New 1x1: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 2x2: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 2x2. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. On 18 May, 19:19, Bill Hart [EMAIL PROTECTED] wrote: If two Gray tables is better than one, then you can't have enough of a good thing right? So I made 3 Gray tables now. The files I modified are in: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ There are three main modifications: 1) Make all matrices SSE aligned if we have SSE2, and make all rows of all matrices aligned. This required a fix in brilliantrussian.c, since it makes some assumptions about how the data is stored in matrices. 2) Introduce function combine2 and combine3 which use SSE to do a[i] ^= b[i] ^ c[i] and a[i] ^= b[i] ^ c[i] ^ d[i]. 3) Code to use three Gray tables. I had to remove the a_nc%k's since these were causing segfaults for reasons I don't understand. Here are the Magma times, your old times and the new times on my Opteron: 1x1: 2.940s 3.13s 2.82s 16384x16384: 9.250s 12.96s 11.25s 2x2: 16.57s 22.43s 19.39s 32000x32000: 59.1s 90.2s 81.56s Note I do not use the new combine2 and combine3 functions as they slow it down on my machine. I cannot believe how useless SSE seems to be! I've commented the three lines out that use combine3 in brilliantrussian.c in case you want to try it with SSE on your CTD. Bill. On 18 May, 17:36, Martin Albrecht [EMAIL PROTECTED] wrote: Hi, first, I recorded the different speed-ups in a small table for an overview in the attachment (I think we've come a far way :-)) To disable the copying out one needs to edit /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ /* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */ packedmatrix *Cbar = mzd_init(C-nrows, C-ncols); Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE); mzd_copy(C, Cbar); mzd_free(Cbar); return C; in strassen.c to /* we copy the matrix first since it is only constant memory overhead and improves data locality, if you remove it make sure there are no speed regressions */ C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); return C; This disables the copying out. Martin PS: If I find some time later today I'll make some changes such that SSE2 can be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is used. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] timings.html 2KDownload --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Monday 19 May 2008, Bill Hart wrote: Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is. Here are the new times: Magma Old New 1x1: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 2x2: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 2x2. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. Cool, I'll take a look tomorrow. It seems we're closing in on the classical algorithm though. Although, you don't seem to decrease k anymore. As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned on cache lines. I'm widely speculating here. Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I figured out why 8 tables is optimal on this machine. The L1 cache is 128kb and at 2048x2048, 8 Gray tables with k = 5 fits exactly in half the cache, which is probably optimal. Bill. On 19 May, 02:00, Martin Albrecht [EMAIL PROTECTED] wrote: On Monday 19 May 2008, Bill Hart wrote: Well, apparently there are speed gains right up to 8 Gray tables, though I really have no idea why that is. Here are the new times: Magma Old New 1x1: 2.940s 3.13s 2.32s 16384x16384: 9.250s 12.96s 9.17s 2x2: 16.57s 22.43s 16.49s 32000x32000: 59.1s 90.2s 81.6s 65.7s So now we beat Magma all the way up to 2x2. I'll put the adjusted code in the directory: http://sage.math.washington.edu/home/wbhart/m4ri3gray/ in just a moment. I also found a memory leak in my code which I fixed. Bill. Cool, I'll take a look tomorrow. It seems we're closing in on the classical algorithm though. Although, you don't seem to decrease k anymore. As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned on cache lines. I'm widely speculating here. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Heres another idea which should speed things up a bit. For 1x1 we currently use k = 6. Instead of this, we could use k = 5 and make two Gray tables simultaneously. This will still fit in cache. Instead of doing 6 bits at a time, we can then do 10 bits at a time. We'd load the appropriate line from the first Gray table, then the appropriate one from the second and xor them, then xor with the output matrix. This should decrease the number of loads and stores considerably. Moreover, the SSE instructions will then be much more efficient as the ratio of arithmetic instructions to loads and stores is higher. Of course one could also do 16 bits at a time, by doing 4 tables, but I think this might actually get slower again since you've only increased the amount of work done by 60%, but you've had a 30 % increase in instructions. Bill. On 17 May, 17:45, Bill Hart [EMAIL PROTECTED] wrote: Martin, The test code still passes if you change RADIX to 128. I've no idea how it passes, but it does. Shame the results are not correct, because this speeds the code up by a factor of 2. I notice that in the SSE code, you check to see if alignment can be achieved, otherwise it doesn't use SSE. But this introduces an unpredictable branch. Also, where ther are three operands, you can't use SSE2 because the likelihood of all three being aligned is too small. I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. I experimented with interleaving MMX and GPR XOR's, but this doesn't speed anything up. There are more instructions emitted and the time stays about the same. The only way interleaving the MMX and GPR code would speed things up is if there was more computation going on in the registers and less memory loading and storing, I think. Bill. On 17 May, 15:45, Bill Hart [EMAIL PROTECTED] wrote: Hi Martin, Here is another 10% improvement. In the loop at the bottom of mzd_combine you can explicitly unroll by a factor of 8: word * end = b1_ptr + wide; register word * end8 = end - 8; while (b1_ptr end8) { *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); } while (b1_ptr end) { *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++); } I did this in combination with changing the crossover for 1x1 from 3600 to 7200. Bill. On 17 May, 09:40, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Bill Hart wrote: In going from 5000x5000 to 1x1 Magma's time increases by a factor of less than 4. That is impossible. Strassen will never help us there. They must be doing something else. Probably something clever. Bill. I was stuck there too yesterday. Maybe only at 1x1 the pipeline gets fully utilised? Martin PS: If we run out of idea we can simply go for parallelism, that should help on sage.math ;-) -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Saturday 17 May 2008, Bill Hart wrote: Martin, The test code still passes if you change RADIX to 128. I've no idea how it passes, but it does. Shame the results are not correct, because this speeds the code up by a factor of 2. Since all routines use the RADIX and I only check if their results match they are all wrong in the same way but it isn't detected. I should add a test with known answers I suppose. I notice that in the SSE code, you check to see if alignment can be achieved, otherwise it doesn't use SSE. But this introduces an unpredictable branch. Also, where ther are three operands, you can't use SSE2 because the likelihood of all three being aligned is too small. I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. I'll try that. I experimented with interleaving MMX and GPR XOR's, but this doesn't speed anything up. There are more instructions emitted and the time stays about the same. The only way interleaving the MMX and GPR code would speed things up is if there was more computation going on in the registers and less memory loading and storing, I think. I came to the same conclusion (but my code might not have been as good as your's). I improved other areas of the code (e.g. use naiv multiplication rather than M4RM if B-ncols RADIX since it is faster etc.) I can forward you my newest tarball (but the speed improvements aren't really noticable yet). Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix DimensionMagma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix DimensionMagma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 On Opteron things don't look this way, but I think sage.math is pretty heavily used right now such that my benchmarks there are not very telling. Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
That's looking good. Would you like me to run it on an unburdened opteron to see how it goes? If you like you can send me a tarball and I'll try it out. I think our best bet for a significant improvement now is the idea of using two Gray tables of half the size simultaneously. I also realised it possibly improves the cache performance for the A matrix too. I was casually wondering whether Magma might use a highly optimised Winograd's algorithm instead of the naive algorithm. But over GF2 I think it probably actually takes longer, since it basically replaces n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 half row additions, plus a pile of other overhead. Bill. On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 On Opteron things don't look this way, but I think sage.math is pretty heavily used right now such that my benchmarks there are not very telling. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Yet another idea. Suppose we do not combine entire rows in the Gray table, but only half rows. Once half a row is bigger than a single cache line (512 bits on the Opteron) we may as well work with half rows. This allows us to work with twice as many rows at once in the Gray tables (each of half the size). This means that we are dealing with twice as many bits from rows of A as usual and twice as many rows of B as usual, but we need to do it all again for the second half of the rows. This means we get twice the work done in the same amount of cache space. Combined with the idea of using two Gray tables of 2^5 combinations of rows instead of a single table of 2^6 combinations of rows, this would equate to dealing with 20 bits of each row of A at a time and 20 rows of B at a time. With this scheme, there would then be 4 arithmetic operations in SSE registers, 5 loads and 1 store, when combining rows from Gray tables, instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. This is another example where copying the data (the half rows) out and reordering it so it has better locality, would probably make a big difference. That sort of thing always works exceptionally well on AMD chips. Bill. On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote: That's looking good. Would you like me to run it on an unburdened opteron to see how it goes? If you like you can send me a tarball and I'll try it out. I think our best bet for a significant improvement now is the idea of using two Gray tables of half the size simultaneously. I also realised it possibly improves the cache performance for the A matrix too. I was casually wondering whether Magma might use a highly optimised Winograd's algorithm instead of the naive algorithm. But over GF2 I think it probably actually takes longer, since it basically replaces n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 half row additions, plus a pile of other overhead. Bill. On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 On Opteron things don't look this way, but I think sage.math is pretty heavily used right now such that my benchmarks there are not very telling. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Martin, Here's a really unusual thing. Perhaps you can confirm this. I get a 20% improvement if I add: if (x) { } in the three obvious places in the function _mzd_mul_m4rm_impl. This stops it mpz_combining the zero row. But I don't understand why this works. The time should be only 1.5% better since k = 6 and there are 2^k rows in the table, only one of which is zero. Could it be that your coinflip function is not quite random? Anyway, I'm down to 3.40s for 1x1 with this change. Test functions still pass. Bill. On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote: Yet another idea. Suppose we do not combine entire rows in the Gray table, but only half rows. Once half a row is bigger than a single cache line (512 bits on the Opteron) we may as well work with half rows. This allows us to work with twice as many rows at once in the Gray tables (each of half the size). This means that we are dealing with twice as many bits from rows of A as usual and twice as many rows of B as usual, but we need to do it all again for the second half of the rows. This means we get twice the work done in the same amount of cache space. Combined with the idea of using two Gray tables of 2^5 combinations of rows instead of a single table of 2^6 combinations of rows, this would equate to dealing with 20 bits of each row of A at a time and 20 rows of B at a time. With this scheme, there would then be 4 arithmetic operations in SSE registers, 5 loads and 1 store, when combining rows from Gray tables, instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. This is another example where copying the data (the half rows) out and reordering it so it has better locality, would probably make a big difference. That sort of thing always works exceptionally well on AMD chips. Bill. On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote: That's looking good. Would you like me to run it on an unburdened opteron to see how it goes? If you like you can send me a tarball and I'll try it out. I think our best bet for a significant improvement now is the idea of using two Gray tables of half the size simultaneously. I also realised it possibly improves the cache performance for the A matrix too. I was casually wondering whether Magma might use a highly optimised Winograd's algorithm instead of the naive algorithm. But over GF2 I think it probably actually takes longer, since it basically replaces n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 half row additions, plus a pile of other overhead. Bill. On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 On Opteron things don't look this way, but I think sage.math is pretty heavily used right now such that my benchmarks there are not very telling. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I suppose that this might be due to the ends of rows all being zero as they aren't a multiple of 64 bits long. But I checked for 16384x16384 and we are nearly down to the speed of Magma there too. I just don't get it. The coinflip has to be broken I think. Bill. On 17 May, 22:40, Bill Hart [EMAIL PROTECTED] wrote: Martin, Here's a really unusual thing. Perhaps you can confirm this. I get a 20% improvement if I add: if (x) { } in the three obvious places in the function _mzd_mul_m4rm_impl. This stops it mpz_combining the zero row. But I don't understand why this works. The time should be only 1.5% better since k = 6 and there are 2^k rows in the table, only one of which is zero. Could it be that your coinflip function is not quite random? Anyway, I'm down to 3.40s for 1x1 with this change. Test functions still pass. Bill. On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote: Yet another idea. Suppose we do not combine entire rows in the Gray table, but only half rows. Once half a row is bigger than a single cache line (512 bits on the Opteron) we may as well work with half rows. This allows us to work with twice as many rows at once in the Gray tables (each of half the size). This means that we are dealing with twice as many bits from rows of A as usual and twice as many rows of B as usual, but we need to do it all again for the second half of the rows. This means we get twice the work done in the same amount of cache space. Combined with the idea of using two Gray tables of 2^5 combinations of rows instead of a single table of 2^6 combinations of rows, this would equate to dealing with 20 bits of each row of A at a time and 20 rows of B at a time. With this scheme, there would then be 4 arithmetic operations in SSE registers, 5 loads and 1 store, when combining rows from Gray tables, instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. This is another example where copying the data (the half rows) out and reordering it so it has better locality, would probably make a big difference. That sort of thing always works exceptionally well on AMD chips. Bill. On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote: That's looking good. Would you like me to run it on an unburdened opteron to see how it goes? If you like you can send me a tarball and I'll try it out. I think our best bet for a significant improvement now is the idea of using two Gray tables of half the size simultaneously. I also realised it possibly improves the cache performance for the A matrix too. I was casually wondering whether Magma might use a highly optimised Winograd's algorithm instead of the naive algorithm. But over GF2 I think it probably actually takes longer, since it basically replaces n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 half row additions, plus a pile of other overhead. Bill. On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 On Opteron things don't look this way, but I think sage.math is pretty heavily used right now such that my benchmarks there are not very telling. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99
[sage-devel] Re: slightly OT: new M4RI library
I checked the coinflip and it is definitely fine. There is no greater probability of 6 zeroes in a row than there ought to be. So the speedup I just reported is quite a mystery. Bill. On 17 May, 22:57, Bill Hart [EMAIL PROTECTED] wrote: I suppose that this might be due to the ends of rows all being zero as they aren't a multiple of 64 bits long. But I checked for 16384x16384 and we are nearly down to the speed of Magma there too. I just don't get it. The coinflip has to be broken I think. Bill. On 17 May, 22:40, Bill Hart [EMAIL PROTECTED] wrote: Martin, Here's a really unusual thing. Perhaps you can confirm this. I get a 20% improvement if I add: if (x) { } in the three obvious places in the function _mzd_mul_m4rm_impl. This stops it mpz_combining the zero row. But I don't understand why this works. The time should be only 1.5% better since k = 6 and there are 2^k rows in the table, only one of which is zero. Could it be that your coinflip function is not quite random? Anyway, I'm down to 3.40s for 1x1 with this change. Test functions still pass. Bill. On 17 May, 22:05, Bill Hart [EMAIL PROTECTED] wrote: Yet another idea. Suppose we do not combine entire rows in the Gray table, but only half rows. Once half a row is bigger than a single cache line (512 bits on the Opteron) we may as well work with half rows. This allows us to work with twice as many rows at once in the Gray tables (each of half the size). This means that we are dealing with twice as many bits from rows of A as usual and twice as many rows of B as usual, but we need to do it all again for the second half of the rows. This means we get twice the work done in the same amount of cache space. Combined with the idea of using two Gray tables of 2^5 combinations of rows instead of a single table of 2^6 combinations of rows, this would equate to dealing with 20 bits of each row of A at a time and 20 rows of B at a time. With this scheme, there would then be 4 arithmetic operations in SSE registers, 5 loads and 1 store, when combining rows from Gray tables, instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations, changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5. This is another example where copying the data (the half rows) out and reordering it so it has better locality, would probably make a big difference. That sort of thing always works exceptionally well on AMD chips. Bill. On 17 May, 21:05, Bill Hart [EMAIL PROTECTED] wrote: That's looking good. Would you like me to run it on an unburdened opteron to see how it goes? If you like you can send me a tarball and I'll try it out. I think our best bet for a significant improvement now is the idea of using two Gray tables of half the size simultaneously. I also realised it possibly improves the cache performance for the A matrix too. I was casually wondering whether Magma might use a highly optimised Winograd's algorithm instead of the naive algorithm. But over GF2 I think it probably actually takes longer, since it basically replaces n^2 full length scalar multiplies by n^2 half length ones and 2*n^2 half row additions, plus a pile of other overhead. Bill. On 17 May, 20:32, Martin Albrecht [EMAIL PROTECTED] wrote: On Saturday 17 May 2008, Martin Albrecht wrote: I think a better idea would be to explicitly force all matrices and all rows to be 128 bit aligned if the matrices are wide enough to benefit from SSE2, Then the combine function can always use SSE2 and there will be no need to check for alignment. That doesn't seem to make a noticeable difference for me (on C2D). However, I realised that the multiplications where the target matrix is a real matrix rather than a window (which has bad data locality). Copying everything over seems not like a good idea but it at least indicates an area for improvements. Okay, if I only copy when we crossover to M4RM then the memory overhead is constant (~ cutoff^2) and the performance still improves. Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180
[sage-devel] Re: slightly OT: new M4RI library
Woot!! On 17 May, 23:46, Martin Albrecht [EMAIL PROTECTED] wrote: Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 3.610 16,384 x 16,384 11.140 12.120 20,000 x 20,000 20.370 24.390 32,000 x 32,000 74.290 94.910 New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 2.990 16,384 x 16,384 11.140 11.750 20,000 x 20,000 20.370 21.180 32,000 x 32,000 74.290 86.570 If you take this + Bill's idea: For 1x1 we currently use k = 6. Instead of this, we could use k = 5 and make two Gray tables simultaneously. This will still fit in cache. Instead of doing 6 bits at a time, we can then do 10 bits at a time. We'd load the appropriate line from the first Gray table, then the appropriate one from the second and xor them, then xor with the output matrix. This should decrease the number of loads and stores considerably. Moreover, the SSE instructions will then be much more efficient as the ratio of arithmetic instructions to loads and stores is higher. Of course one could also do 16 bits at a time, by doing 4 tables, but I think this might actually get slower again since you've only increased the amount of work done by 60%, but you've had a 30 % increase in instructions. You get (on the C2D): sage: B = random_matrix(GF(2), 3.2*10^4, 3.2*10^4) sage: A = random_matrix(GF(2), 3.2*10^4, 3.2*10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 75.82 s, sys: 0.22 s, total: 76.04 s Wall time: 76.31 sage: A = random_matrix(GF(2), 2*10^4, 2*10^4) sage: B = random_matrix(GF(2), 2*10^4, 2*10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 19.14 s, sys: 0.09 s, total: 19.24 s Wall time: 19.29 sage: B = random_matrix(GF(2), 2^14, 2^14) sage: A = random_matrix(GF(2), 2^14, 2^14) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 10.62 s, sys: 0.05 s, total: 10.67 s Wall time: 10.70 sage: B = random_matrix(GF(2), 10^4, 10^4) sage: A = random_matrix(GF(2), 10^4, 10^4) sage: time C= A._multiply_strassen(B,cutoff=2^11) CPU times: user 2.73 s, sys: 0.02 s, total: 2.75 s Wall time: 2.76 i.e the speed of my current Magma install on the same computer (mind you, this one might not be optimised for the C2D but for the Opteron, I don't know). The times above don't have SSE2 yet. I guess documenting Bill's tricks - process the rows of A in blocks - use two rather than one Gray code table well is in order since now M4RM looks quite different from the original algorithm. I'll do that tomorrow. Again, thanks Bill! Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I suppose that this might be due to the ends of rows all being zero as they aren't a multiple of 64 bits long. But I checked for 16384x16384 and we are nearly down to the speed of Magma there too. I just don't get it. The coinflip has to be broken I think. If one uses M4RI with the new patch from within Sage another PRBG is used, but coinflip should be fine. Don't see these speedups (but I have two Gray code tables and this warrants for more if's) Hi, I think we might consider merging our two forks again? Or do you also have the two Gray code tables? Are your timings on the Opteron? Because then things look really goo since mine are on the C2D. Exciting times, Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I don't have the two Gray code tables, so it would be good to get your version. Also my code is currently a mess, so it would be good to clean it up by merging with a cleaner version (yours). Tomorrow I'll check carefully what I've changed and try and merge the ideas if there are any you don't have which definitely improve performance on the Opteron. The speedups I am seeing from the ifs are possibly a feature of the Opteron cache algorithms. It is very sensitive when things just begin to fall out of cache, as they certainly are here. Not combining with the zero row just nudges things closer in to the cache boundary since it never has to read that row. I have checked and the speedups are quite reproducible, and they definitely come from the ifs, though I am now using a crossover with Strassen of 7200!! Bill. On 18 May, 00:12, Martin Albrecht [EMAIL PROTECTED] wrote: I suppose that this might be due to the ends of rows all being zero as they aren't a multiple of 64 bits long. But I checked for 16384x16384 and we are nearly down to the speed of Magma there too. I just don't get it. The coinflip has to be broken I think. If one uses M4RI with the new patch from within Sage another PRBG is used, but coinflip should be fine. Don't see these speedups (but I have two Gray code tables and this warrants for more if's) Hi, I think we might consider merging our two forks again? Or do you also have the two Gray code tables? Are your timings on the Opteron? Because then things look really goo since mine are on the C2D. Exciting times, Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
P.S: yes all my times are on a 2.8Ghz Opteron. Cpuinfo says: [EMAIL PROTECTED]:~/m4ri-20080514/testsuite cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2220 stepping: 3 cpu MHz : 1000.000 cache size : 1024 KB snip flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy snip The 1000.000 there refers to the FSB. Bill. On 18 May, 00:12, Martin Albrecht [EMAIL PROTECTED] wrote: I suppose that this might be due to the ends of rows all being zero as they aren't a multiple of 64 bits long. But I checked for 16384x16384 and we are nearly down to the speed of Magma there too. I just don't get it. The coinflip has to be broken I think. If one uses M4RI with the new patch from within Sage another PRBG is used, but coinflip should be fine. Don't see these speedups (but I have two Gray code tables and this warrants for more if's) Hi, I think we might consider merging our two forks again? Or do you also have the two Gray code tables? Are your timings on the Opteron? Because then things look really goo since mine are on the C2D. Exciting times, Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Sunday 18 May 2008, Bill Hart wrote: I don't have the two Gray code tables, so it would be good to get your version. Also my code is currently a mess, so it would be good to clean it up by merging with a cleaner version (yours). Tomorrow I'll check carefully what I've changed and try and merge the ideas if there are any you don't have which definitely improve performance on the Opteron. The speedups I am seeing from the ifs are possibly a feature of the Opteron cache algorithms. It is very sensitive when things just begin to fall out of cache, as they certainly are here. Not combining with the zero row just nudges things closer in to the cache boundary since it never has to read that row. I have checked and the speedups are quite reproducible, and they definitely come from the ifs, though I am now using a crossover with Strassen of 7200!! I'm using a crossover of 2048 here, so maybe our improvements are orthogonal? Even more puzzling, I'd expect that my crossover should be bigger than yours. (on a side note: my code changes how the crossover is used, your version: 'size cutoff', my version: '|cutoff - size| is minimal' which should give a actual cutoffs closer to the desired values). My version is here: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg (this needs an updated patch for Sage) and here: http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz (which is the raw source). Those don't have SSE2 yet but it doesn't seem to make that much of a difference anyway. I'll add that back before doing an official release. However, unfortunately I'll probably have limited/no time tomorrow to commit. Martin PS: To give at least some indication that my code still does the right thing, a 'known answer' test: sage: A = random_matrix(GF(2), 10^3, 10^3) sage: B = random_matrix(GF(2), 10^3, 10^3) sage: (A*B)._magma_() == A._magma_() * B._magma_() True sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() * B._magma_() True -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Here are the times I get with the different cutoffs. Magma M4RI:7200 M4RI:2048 1x1: 2.940s 3.442s 4.132s 16384x16384: 9.250s 11.47s 11.80s 2x2: 16.57s 19.3s 26.05s 32000x32000: 59.05s 71.9s 71.8s So it seems when there is not an exact cut, the higher cutoff is substantially better. Don't know why that is. Tomorrow I'll see if there is anything I have that speeds up your code. I'm hopeful we'll be within about 5% on the Opteron by then. The other ideas I outlined above should push us 10-15% ahead of Magma if we end up implementing them, I think. Of course one can go too crazy with optimisation. Bill. On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED] wrote: On Sunday 18 May 2008, Bill Hart wrote: I don't have the two Gray code tables, so it would be good to get your version. Also my code is currently a mess, so it would be good to clean it up by merging with a cleaner version (yours). Tomorrow I'll check carefully what I've changed and try and merge the ideas if there are any you don't have which definitely improve performance on the Opteron. The speedups I am seeing from the ifs are possibly a feature of the Opteron cache algorithms. It is very sensitive when things just begin to fall out of cache, as they certainly are here. Not combining with the zero row just nudges things closer in to the cache boundary since it never has to read that row. I have checked and the speedups are quite reproducible, and they definitely come from the ifs, though I am now using a crossover with Strassen of 7200!! I'm using a crossover of 2048 here, so maybe our improvements are orthogonal? Even more puzzling, I'd expect that my crossover should be bigger than yours. (on a side note: my code changes how the crossover is used, your version: 'size cutoff', my version: '|cutoff - size| is minimal' which should give a actual cutoffs closer to the desired values). My version is here: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg (this needs an updated patch for Sage) and here: http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz (which is the raw source). Those don't have SSE2 yet but it doesn't seem to make that much of a difference anyway. I'll add that back before doing an official release. However, unfortunately I'll probably have limited/no time tomorrow to commit. Martin PS: To give at least some indication that my code still does the right thing, a 'known answer' test: sage: A = random_matrix(GF(2), 10^3, 10^3) sage: B = random_matrix(GF(2), 10^3, 10^3) sage: (A*B)._magma_() == A._magma_() * B._magma_() True sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() * B._magma_() True -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On May 17, 2008, at 8:38 PM, Bill Hart wrote: Of course one can go too crazy with optimisation. No surely that never happens around here. david --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED] wrote: My version is here: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg (this needs an updated patch for Sage) and here: http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz (which is the raw source). This pure C version seems to be the old version, before you made either of the two big changes. Bill. --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I managed to get the modified version from the spkg. Nice code!! Unfortunately it is not as fast on my opteron. So more work tomorrow for me to try and get it down to the same times as I have with my version. Here are the times all on my opteron. Note your CTD version was optimal at a cutoff of 2048, not 7200 as for my code. Now I am worried that maybe my code is actually broken somehow and still passing the test code. I'll carefully make the changes to your code tomorrow to see if that is the case. Magma CTD-M4RI:2048 AMD-M4RI:7200 AMD-M4RI:2048 1x1: 2.940s 3.13s 3.442s 4.132s 16384x16384: 9.250s 12.96s 11.47s 11.80s 2x2: 16.57s 22.43s 19.3s 26.05s 32000x32000: 59.05s 90.20s 71.9s 71.8s Bill. On 18 May, 01:58, Bill Hart [EMAIL PROTECTED] wrote: On 18 May, 00:40, Martin Albrecht [EMAIL PROTECTED] wrote: My version is here: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg (this needs an updated patch for Sage) and here: http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz (which is the raw source). This pure C version seems to be the old version, before you made either of the two big changes. Bill. --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I made some changes to the original code so it would use the cache a bit better. The test code seems to pass, so I don't think I've screwed anything up. The time for 16384x16384 on my machine is now 20s, so a factor of 2.15x faster. The time for 2000x2000 also seems to be the same time as Magma now. Hopefully I didn't just mistime things before, and this is a real improvement. I am still trying things out. Bill. On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote: I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this. If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and the cache on my machine is 1024kb. But if Magma is using classical multiplication, then this is pretty much irrelevant anyway, since you can keep the working data within the cache for quite a while during the algorithm. On the other hand: a squiggly line is what one one would expect for Strassen-Winograd due to the extra rows/columns that have to be taken care of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel once gave a talk about his implementation and stated that they don't use classical block multiplication (I saw some slides with that remark). Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
1x1 is down to 4.5s now, so nearly a 2x speedup. 2x2 is down to 32.0s, so again nearly a 2x speedup. Bill. On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote: I made some changes to the original code so it would use the cache a bit better. The test code seems to pass, so I don't think I've screwed anything up. The time for 16384x16384 on my machine is now 20s, so a factor of 2.15x faster. The time for 2000x2000 also seems to be the same time as Magma now. Hopefully I didn't just mistime things before, and this is a real improvement. I am still trying things out. Bill. On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote: I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this. If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and the cache on my machine is 1024kb. But if Magma is using classical multiplication, then this is pretty much irrelevant anyway, since you can keep the working data within the cache for quite a while during the algorithm. On the other hand: a squiggly line is what one one would expect for Strassen-Winograd due to the extra rows/columns that have to be taken care of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel once gave a talk about his implementation and stated that they don't use classical block multiplication (I saw some slides with that remark). Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I tried cache blocking matrix B, but the times are exactly the same. I think the processor is happy to keep loading B one row at a time sequentially, and since it is only done a handful of times in the algorithm, it accounts for little of the runtime. It might go faster in combination with storing the gray tables, but the optimal block size seems to be about 100 (*k) for the matrix B, which would necessitate storage of 100 gray tables, which is an awful lot of memory. Probably I think it is best to avoid this kind of memory usage. So I am not sure where the other factor of 2 will come from. Bill. On 16 May, 14:37, Bill Hart [EMAIL PROTECTED] wrote: Here is the modified _mzd_mul_m4rm_impl function which gives this speedup: http://sage.math.washington.edu/home/wbhart/m4rmul.c I used a crossover of 3600 and I indicate at the top of this file how constants should be changed to get the speedups for the various values of n. I didn't put any code in to automatically choose the correct values. I am presuming that the test_multiplication code actually tests this function, otherwise maybe my code is just rubbish. The basic idea is to cache block the A matrix into groups of BLOCK rows. If one also cache blocked the B matrix and left the computed gray tables in memory instead of writing over them and recalculating them all the time, as I currently do, one could probably get the other factor of 2 that we need. Note I've turned SSE off in this function, but left it on in _mzd_combine since it makes no real difference there. Bill. On 16 May, 14:16, Bill Hart [EMAIL PROTECTED] wrote: 1x1 is down to 4.5s now, so nearly a 2x speedup. 2x2 is down to 32.0s, so again nearly a 2x speedup. Bill. On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote: I made some changes to the original code so it would use the cache a bit better. The test code seems to pass, so I don't think I've screwed anything up. The time for 16384x16384 on my machine is now 20s, so a factor of 2.15x faster. The time for 2000x2000 also seems to be the same time as Magma now. Hopefully I didn't just mistime things before, and this is a real improvement. I am still trying things out. Bill. On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote: I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this. If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and
[sage-devel] Re: slightly OT: new M4RI library
P.S: I also tried cache hints, but no luck. They just slow it down. Bill. On 16 May, 15:59, Bill Hart [EMAIL PROTECTED] wrote: I tried cache blocking matrix B, but the times are exactly the same. I think the processor is happy to keep loading B one row at a time sequentially, and since it is only done a handful of times in the algorithm, it accounts for little of the runtime. It might go faster in combination with storing the gray tables, but the optimal block size seems to be about 100 (*k) for the matrix B, which would necessitate storage of 100 gray tables, which is an awful lot of memory. Probably I think it is best to avoid this kind of memory usage. So I am not sure where the other factor of 2 will come from. Bill. On 16 May, 14:37, Bill Hart [EMAIL PROTECTED] wrote: Here is the modified _mzd_mul_m4rm_impl function which gives this speedup: http://sage.math.washington.edu/home/wbhart/m4rmul.c I used a crossover of 3600 and I indicate at the top of this file how constants should be changed to get the speedups for the various values of n. I didn't put any code in to automatically choose the correct values. I am presuming that the test_multiplication code actually tests this function, otherwise maybe my code is just rubbish. The basic idea is to cache block the A matrix into groups of BLOCK rows. If one also cache blocked the B matrix and left the computed gray tables in memory instead of writing over them and recalculating them all the time, as I currently do, one could probably get the other factor of 2 that we need. Note I've turned SSE off in this function, but left it on in _mzd_combine since it makes no real difference there. Bill. On 16 May, 14:16, Bill Hart [EMAIL PROTECTED] wrote: 1x1 is down to 4.5s now, so nearly a 2x speedup. 2x2 is down to 32.0s, so again nearly a 2x speedup. Bill. On 16 May, 13:53, Bill Hart [EMAIL PROTECTED] wrote: I made some changes to the original code so it would use the cache a bit better. The test code seems to pass, so I don't think I've screwed anything up. The time for 16384x16384 on my machine is now 20s, so a factor of 2.15x faster. The time for 2000x2000 also seems to be the same time as Magma now. Hopefully I didn't just mistime things before, and this is a real improvement. I am still trying things out. Bill. On 16 May, 01:41, Bill Hart [EMAIL PROTECTED] wrote: I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this. If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably
[sage-devel] Re: slightly OT: new M4RI library
Here are the times I get for Magma vs M4RI now. Note the crossover between the two programs is now above about 5000 and M4RI beats Magma below that point. This suggests the remaining factor of 2 is in the Strassen-Winograd function. Probably Winograd-Strassen is falling out of L2 cache (the previous adjustments I made were to prevent the M4R algorithm falling out of L1 cache). The other possibility is that Magma combines the two algorithms so that there is even greater usage of the Gray code tables. This would be an ugly hack, but could work. 4x4: Magma: 112.6s M4RI: 232.4s 2x2: Magma: 16.40s M4RI: 32.34s 1x1: Magma: 2.750s M4RI: 4.529s 5000x5000: Magma: 0.700s M4RI: 0.672s 2500x2500: Magma: 0.13s M4RI: 0.079s 1250x1250: Magma: 0.015s M4RI: 0.012s 625x625: Magma: 0.0030s M4RI: 0.0023s 312x312: Magma: 0.0014s M4RI: 0.00032s 156x156: Magma: 0.001s M4RI: 0.0001s If I get some time I'll look into this. Did those changes work for you Martin? Bill. On 16 May, 16:24, Martin Albrecht [EMAIL PROTECTED] wrote: On Friday 16 May 2008, Bill Hart wrote: P.S: I also tried cache hints, but no luck. They just slow it down. Bill. Bill, thanks so much for looking into this! It is very much appreciated. I'm going to read/try your patch right away! Martin PS: I have a slight delay in my replies because my provider's spam filter consistently flags you and mabshoff as spam for some reason. -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Friday 16 May 2008, Bill Hart wrote: Here are the times I get for Magma vs M4RI now. Note the crossover between the two programs is now above about 5000 and M4RI beats Magma below that point. This suggests the remaining factor of 2 is in the Strassen-Winograd function. Probably Winograd-Strassen is falling out of L2 cache (the previous adjustments I made were to prevent the M4R algorithm falling out of L1 cache). The other possibility is that Magma combines the two algorithms so that there is even greater usage of the Gray code tables. This would be an ugly hack, but could work. Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state that anyway. Probably I'm just misunderstanding you. 4x4: Magma: 112.6s M4RI: 232.4s 2x2: Magma: 16.40s M4RI: 32.34s 1x1: Magma: 2.750s M4RI: 4.529s 5000x5000: Magma: 0.700s M4RI: 0.672s 2500x2500: Magma: 0.13s M4RI: 0.079s 1250x1250: Magma: 0.015s M4RI: 0.012s 625x625: Magma: 0.0030s M4RI: 0.0023s 312x312: Magma: 0.0014s M4RI: 0.00032s 156x156: Magma: 0.001s M4RI: 0.0001s If I get some time I'll look into this. Did those changes work for you Martin? Bill. Yes, the change worked like a charm. I made some changes (the fixed k is replaced by a k that depends on the new block dimensions etc.) and it is much much faster now. I'm working on re-introducing SSE2 now to see if it at least on the Core2Duo makes the world a better place. Btw. all the stuff I wrote about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow, here are the times: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix DimensionMagma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 4.130 16,384 x 16,384 11.140 15.740 20,000 x 20,000 20.370 28.950 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math) Matrix DimensionMagma 2.13-5 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 3.930 7.860 16,384 x 16,384 16.230 104.77??? 20,000 x 20,000 27.080 56.420 (My university today finally granted me access to Magma 2.14 for my notebook.) As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot explain that for now, it might just be a bug. I have a similar spike in another run on another AMD (Athlon X2) machine: n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01 n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05 n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13 n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29 n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55 n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88 n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35 n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78 n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88 n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24 n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33 I'm investigating this one too and play around with the parameters. Since the C2D times are considerably better I also bet it is L2 related, we'll see. Thanks again! Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Hi Martin, That spike is wierd. Basically I got closer to 2x the time of Magma for 16384x16384, but you need different parameters than for 1x1 or 2x2 since the size of the M4R matrices will be different than in either of those cases. For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One might also need to fiddle with the cutoff. I think I used 3600, but at various times I had the cutoff set lower. It would be awfully surprising if there wasn't a set of parameters that dropped this time right down. It is possible that the L1 cache is smaller on sage.math than on the Opteron I was using. Perhaps my parameters don't apply on that machine. Bill. On 16 May, 23:05, Martin Albrecht [EMAIL PROTECTED] wrote: On Friday 16 May 2008, Bill Hart wrote: Here are the times I get for Magma vs M4RI now. Note the crossover between the two programs is now above about 5000 and M4RI beats Magma below that point. This suggests the remaining factor of 2 is in the Strassen-Winograd function. Probably Winograd-Strassen is falling out of L2 cache (the previous adjustments I made were to prevent the M4R algorithm falling out of L1 cache). The other possibility is that Magma combines the two algorithms so that there is even greater usage of the Gray code tables. This would be an ugly hack, but could work. Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state that anyway. Probably I'm just misunderstanding you. 4x4: Magma: 112.6s M4RI: 232.4s 2x2: Magma: 16.40s M4RI: 32.34s 1x1: Magma: 2.750s M4RI: 4.529s 5000x5000: Magma: 0.700s M4RI: 0.672s 2500x2500: Magma: 0.13s M4RI: 0.079s 1250x1250: Magma: 0.015s M4RI: 0.012s 625x625: Magma: 0.0030s M4RI: 0.0023s 312x312: Magma: 0.0014s M4RI: 0.00032s 156x156: Magma: 0.001s M4RI: 0.0001s If I get some time I'll look into this. Did those changes work for you Martin? Bill. Yes, the change worked like a charm. I made some changes (the fixed k is replaced by a k that depends on the new block dimensions etc.) and it is much much faster now. I'm working on re-introducing SSE2 now to see if it at least on the Core2Duo makes the world a better place. Btw. all the stuff I wrote about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow, here are the times: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 4.130 16,384 x 16,384 11.140 15.740 20,000 x 20,000 20.370 28.950 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math) Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 3.930 7.860 16,384 x 16,384 16.230 104.77??? 20,000 x 20,000 27.080 56.420 (My university today finally granted me access to Magma 2.14 for my notebook.) As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot explain that for now, it might just be a bug. I have a similar spike in another run on another AMD (Athlon X2) machine: n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01 n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05 n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13 n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29 n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55 n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88 n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35 n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78 n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88 n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24 n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33 I'm investigating this one too and play around with the parameters. Since the C2D times are considerably better I also bet it is L2 related, we'll see. Thanks again! Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, William Stein wrote: Btw. I don't have access to Magma 2.14 which I believe to be the fastest in linear algebra over GF(2). In version 2.14 they added SSE2 support too. So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear about the results. I don't know what else to compare with except Magma. Here are magma V2.14-9 timings on sage.math: [EMAIL PROTECTED]:~/magma64/bin$ ./magma Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142] Type ? for help. Type Ctrl-D to quit. A := RandomMatrix(GF(2),10^4,10^4); B := RandomMatrix(GF(2),10^4,10^4); time C := A*B; Time: 4.020 time C := A*B; Time: 3.980 time C := A*B; Time: 3.990 A := RandomMatrix(GF(2),2^14,2^14); B := RandomMatrix(GF(2),2^14,2^14); time C := A*B; Time: 15.450 time C := A*B; Time: 15.420 This doesn't seem much different from 2.13, which is good for my purposes. Once I fixed the OSX issue mentioned below could you run the benchmark on the C2D? I want to see if they can make better use of the 4MB L2 cache. I tried your new code up at #3204 under OS X and get this: sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x7465a00: non-page-aligned, non-allocated pointer being freed *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug ... CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s Wall time: 16.31 Maybe you're doing something wrong? Likely, though my Valgrind log is clean under Linux. I'll try under OSX. Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Maybe you're doing something wrong? This bug is fixed in: http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg Cheers, Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote: Hi Martin, The cycle counter in your bench code seems to give random values on the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has Magma 2-14.10 on it. Anyhow, here are the Magma times: A := RandomMatrix(GF(2),10^4,10^4); B := RandomMatrix(GF(2),10^4,10^4); t := Cputime(); C := A*B; Cputime(t); 2.940 A := RandomMatrix(GF(2),2*10^4,2*10^4); B := RandomMatrix(GF(2),2*10^4,2*10^4); t := Cputime(); C := A*B; Cputime(t); 16.570 A := RandomMatrix(GF(2),2^14,2^14); B := RandomMatrix(GF(2),2^14,2^14); t := Cputime(); C := A*B; Cputime(t); 9.250 The corresponding times for your code are approximately (hand timed; adjusted for generation of random matrices): 10^4: 8s 2*10^4: 59s 2^14: 43s I think it is polynomials over GF2 which Magma 2-14.x is much faster at rather than linear algebra. Bill. Hi Bill, I cannot explain the random behaviour of the cpucycles counter. The cpucycles library is by Dan Bernstein so I'd suspect it is carefully coded. Some potential sources of variance are: dynticks (i.e. the kernel doesn't have a fixed tick rate, this is done to safe power), CPU frequency scaling and other applications. On the other hand, I'd suspect I don't need to tell you since you probably benchmarked more code than I did. To account for the variance I have the small run_bench.py script in the testsuite directory. Run it like this: ./run_bench.py -t mul -n 1 -c 3200 It'll run the experiment 17 times (17 for no particular reason) and return the median runtime. -n is the size and -c the cutoff. Cheers, Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thu, May 15, 2008 at 10:42 AM, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Hi Martin, The cycle counter in your bench code seems to give random values on the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has Magma 2-14.10 on it. Anyhow, here are the Magma times: A := RandomMatrix(GF(2),10^4,10^4); B := RandomMatrix(GF(2),10^4,10^4); t := Cputime(); C := A*B; Cputime(t); 2.940 A := RandomMatrix(GF(2),2*10^4,2*10^4); B := RandomMatrix(GF(2),2*10^4,2*10^4); t := Cputime(); C := A*B; Cputime(t); 16.570 A := RandomMatrix(GF(2),2^14,2^14); B := RandomMatrix(GF(2),2^14,2^14); t := Cputime(); C := A*B; Cputime(t); 9.250 The corresponding times for your code are approximately (hand timed; adjusted for generation of random matrices): 10^4: 8s 2*10^4: 59s 2^14: 43s I think it is polynomials over GF2 which Magma 2-14.x is much faster at rather than linear algebra. Bill. Hi Bill, I cannot explain the random behaviour of the cpucycles counter. The cpucycles library is by Dan Bernstein so I'd suspect it is carefully coded. Some potential sources of variance are: dynticks (i.e. the kernel doesn't have a fixed tick rate, this is done to safe power), CPU frequency scaling and other applications. On the other hand, I'd suspect I don't need to tell you since you probably benchmarked more code than I did. To account for the variance I have the small run_bench.py script in the testsuite directory. Run it like this: ./run_bench.py -t mul -n 1 -c 3200 It'll run the experiment 17 times (17 for no particular reason) and return the median runtime. -n is the size and -c the cutoff. Median? Shouldn't you take the minimum? Are there any good papers on benchmarking? By the way -- present for you: http://m4ri.sagemath.org/ To update that page you have to ssh to sagemath.org. William --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
It'll run the experiment 17 times (17 for no particular reason) and return the median runtime. -n is the size and -c the cutoff. Median? Shouldn't you take the minimum? Are there any good papers on benchmarking? I'm using cpucycles from: http://www.ecrypt.eu.org/ebats/cpucycles.html which advises: Notes on accuracy Benchmarking tools are encouraged to record several timings of a function: call cpucycles(), function(), cpucycles(), function(), etc., and then print one line reporting the differences between successive cpucycles() results. The median of several differences is much more stable than the average. Cycle counts continue to increase while other programs are running, while the operating system is handling an interruption such as a network packet, etc. This won't affect the median of several timings of a fast function---the function usually won't be interrupted---but it can affect the median of several timings of a slow function. Hopefully a benchmarking machine isn't running other programs. On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs often don't have synchronized cycle counters, so a process that switches CPUs can have its cycle counts jump forwards or backwards. I've never seen this affect the median of several timings. Some CPUs dynamically reduce CPU speed to save power, but deliberately keep their cycle counters running at full speed, the idea being that measuring time is more important than measuring cycles. Hopefully a benchmarking machine won't enter power-saving mode. Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as discussed below. I've never seen this affect the median of several timings. The estimate returned by cpucycles_persecond() may improve accuracy after cpucycles() has been called repeatedly. I just followed that. I guess if one takes the minimum or not depends on the data. If there are (much) easier cases then it is probably not best to take the minimum. However, I guess for matrix multiplication the times don't really depend on the entries of the matrices so I could use the minimum or just output all (minimum, average, median). By the way -- present for you: http://m4ri.sagemath.org/ To update that page you have to ssh to sagemath.org. Cool, thanks! Any chance that I could get write access to the home directory so that I can upload my SSH public key? Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Hi Martin, Here is a run that illustrates the problem. Am I doing something wrong? [EMAIL PROTECTED]:~/m4ri-20080514/testsuite time ./ bench_multiplication 5000 2048 n: 5000, cutoff: 2048, cpu cycles: 2670143764 real0m2.768s user0m2.760s sys 0m0.008s [EMAIL PROTECTED]:~/m4ri-20080514/testsuite time ./ bench_multiplication 5000 3000 n: 5000, cutoff: 3000, cpu cycles: 746344039 real0m3.277s user0m3.252s sys 0m0.024s Bill. On 15 May, 18:42, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Hi Martin, The cycle counter in your bench code seems to give random values on the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has Magma 2-14.10 on it. Anyhow, here are the Magma times: A := RandomMatrix(GF(2),10^4,10^4); B := RandomMatrix(GF(2),10^4,10^4); t := Cputime(); C := A*B; Cputime(t); 2.940 A := RandomMatrix(GF(2),2*10^4,2*10^4); B := RandomMatrix(GF(2),2*10^4,2*10^4); t := Cputime(); C := A*B; Cputime(t); 16.570 A := RandomMatrix(GF(2),2^14,2^14); B := RandomMatrix(GF(2),2^14,2^14); t := Cputime(); C := A*B; Cputime(t); 9.250 The corresponding times for your code are approximately (hand timed; adjusted for generation of random matrices): 10^4: 8s 2*10^4: 59s 2^14: 43s I think it is polynomials over GF2 which Magma 2-14.x is much faster at rather than linear algebra. Bill. Hi Bill, I cannot explain the random behaviour of the cpucycles counter. The cpucycles library is by Dan Bernstein so I'd suspect it is carefully coded. Some potential sources of variance are: dynticks (i.e. the kernel doesn't have a fixed tick rate, this is done to safe power), CPU frequency scaling and other applications. On the other hand, I'd suspect I don't need to tell you since you probably benchmarked more code than I did. To account for the variance I have the small run_bench.py script in the testsuite directory. Run it like this: ./run_bench.py -t mul -n 1 -c 3200 It'll run the experiment 17 times (17 for no particular reason) and return the median runtime. -n is the size and -c the cutoff. Cheers, Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote: Hi Martin, Here is a run that illustrates the problem. Am I doing something wrong? No, I was stupid. The cpucycles are printed as %u but they should be printed as %llu since they are longer than an int. I've attached the fixed C file (since it is 1KB). -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~--- #include stdlib.h #include cpucycles.h #include m4ri.h int main(int argc, char **argv) { int n, cutoff; unsigned long long t; m4ri_build_all_codes(); if (argc != 3) { m4ri_die(Parameters n and cutoff expected.\n); } n = atoi(argv[1]); cutoff = atoi(argv[2]); if (n=0) { m4ri_die(Parameter n must be 0\n); } if (cutoff=0) { m4ri_die(Parameter cutoff must be 0\n); } packedmatrix *A = mzd_init(n, n); packedmatrix *B = mzd_init(n, n); mzd_randomize(A); mzd_randomize(B); t = cpucycles(); packedmatrix *C = mzd_mul_strassen(NULL, A, B, cutoff); printf(n: %5d, cutoff: %5d, cpu cycles: %llu\n,n, cutoff, cpucycles() - t); mzd_free(A); mzd_free(B); mzd_free(C); m4ri_destroy_all_codes(); }
[sage-devel] Re: slightly OT: new M4RI library
Martin, Do you think Magma uses naive multiplication for its base case? This can be ridicoulously fast, especially over GF2. I note for example that Magma's base case is about 6 times faster than M4RI at 1000x1000. Is it possible that the naive multiplication can just be optimised with a far better constant and that the crossover with m4r is above the crossover with Strassen-Winograd? I also note that the M4RI SSE2 code doesn't seem faster than the generic C code on the Opteron, in fact it seems the other way around. Very strange. Bill. On 15 May, 20:25, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Hi Martin, Here is a run that illustrates the problem. Am I doing something wrong? No, I was stupid. The cpucycles are printed as %u but they should be printed as %llu since they are longer than an int. I've attached the fixed C file (since it is 1KB). -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] bench_multiplication.c 1KDownload --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote: Martin, Do you think Magma uses naive multiplication for its base case? This can be ridicoulously fast, especially over GF2. I note for example that Magma's base case is about 6 times faster than M4RI at 1000x1000. Is it possible that the naive multiplication can just be optimised with a far better constant and that the crossover with m4r is above the crossover with Strassen-Winograd? I also note that the M4RI SSE2 code doesn't seem faster than the generic C code on the Opteron, in fact it seems the other way around. Very strange. I observed the same behaviour but have no explanation besides stuff like this: http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opterons-slower-than-athlon-xps How do you know what the Magma base case is, can you tell Magma somehow to print its crossover? I'll look into plugging in naive multiplication as the base case (after optimisation) to see if your idea is right, though I doubt that the crossover to M4RM is beyond the crossover to Strassen-Winograd. Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Oh, actually I have no idea where Magma's crossover is. I'll wager it is somewhere between 4000x4000 and 6000x6000, but let's not speculate. I'll try and work it out with some timings. Bill. On 15 May, 22:23, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Martin, Do you think Magma uses naive multiplication for its base case? This can be ridicoulously fast, especially over GF2. I note for example that Magma's base case is about 6 times faster than M4RI at 1000x1000. Is it possible that the naive multiplication can just be optimised with a far better constant and that the crossover with m4r is above the crossover with Strassen-Winograd? I also note that the M4RI SSE2 code doesn't seem faster than the generic C code on the Opteron, in fact it seems the other way around. Very strange. I observed the same behaviour but have no explanation besides stuff like this: http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt... How do you know what the Magma base case is, can you tell Magma somehow to print its crossover? I'll look into plugging in naive multiplication as the base case (after optimisation) to see if your idea is right, though I doubt that the crossover to M4RM is beyond the crossover to Strassen-Winograd. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and the cache on my machine is 1024kb. But if Magma is using classical multiplication, then this is pretty much irrelevant anyway, since you can keep the working data within the cache for quite a while during the algorithm. Bill. On 15 May, 22:41, Bill Hart [EMAIL PROTECTED] wrote: Oh, actually I have no idea where Magma's crossover is. I'll wager it is somewhere between 4000x4000 and 6000x6000, but let's not speculate. I'll try and work it out with some timings. Bill. On 15 May, 22:23, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Martin, Do you think Magma uses naive multiplication for its base case? This can be ridicoulously fast, especially over GF2. I note for example that Magma's base case is about 6 times faster than M4RI at 1000x1000. Is it possible that the naive multiplication can just be optimised with a far better constant and that the crossover with m4r is above the crossover with Strassen-Winograd? I also note that the M4RI SSE2 code doesn't seem faster than the generic C code on the Opteron, in fact it seems the other way around. Very strange. I observed the same behaviour but have no explanation besides stuff like this: http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt... How do you know what the Magma base case is, can you tell Magma somehow to print its crossover? I'll look into plugging in naive multiplication as the base case (after optimisation) to see if your idea is right, though I doubt that the crossover to M4RM is beyond the crossover to Strassen-Winograd. Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and the cache on my machine is 1024kb. But if Magma is using classical multiplication, then this is pretty much irrelevant anyway, since you can keep the working data within the cache for quite a while during the algorithm. On the other hand: a squiggly line is what one one would expect for Strassen-Winograd due to the extra rows/columns that have to be taken care of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel once gave a talk about his implementation and stated that they don't use classical block multiplication (I saw some slides with that remark). Martin -- name: Martin Albrecht _pgp: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www: http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
I think it might just be possible to get down to the speed of Magma with a highly optimised classical multiplication routine. At 3600X3600 one clearly has to do 3600x3600 scalar products of a row by a column. We'll assume one of the matrices has been transposed to facilitate this. If we use SSE2 then we can operate 128 bits at a time. There are 16 SSE registers. The basic idea would be to load 4 SSE registers from different rows of matrix A and 2 from different columns of matrix B. We compute the scalar products of all 8 combinations of rowsxcolumns simultaneously. For this we need 2 temporary registers and 8 registers to hold the running totals. So all in all we need 4+2+2+8 = 16 registers. We only need to do an AND and an XOR to do the multiplication and addition required. Caching becomes irrelevant if we choose a large selection of rows from A and a large selection of columns from B and do all the possible scalar products or rows by columns before moving to the next part. Assuming the Opteron can do the memory loads at the same time as doing arithmetic not depending on those loads, careful instruction scheduling should get everything down to the cost of an SSE AND and an SSE OR per 128x128 bit pair in the classical algorithm. That puts the entire algorithm at pretty close to what Magma is doing timewise. Another option is to not use SSE and just use the 64 bit integer registers. The disadvantage is one has to load things 64 bits at a time. But the advantage is the Opteron can do three arithmetic operations per cycle if properly scheduled. That gets 50% more work done than the SSE registers, which can only do 1 x 128 bit operation per cycle. Of course I'm making the assumption here that the Opteron can indeed do loads at the same time as arithmetic. if not, then there is no way Magma can be using classical multiplication out to 3600x3600 since there are simply too many operations to perform in the number of cycles available. Bill. On 16 May, 00:03, Martin Albrecht [EMAIL PROTECTED] wrote: On Thursday 15 May 2008, Bill Hart wrote: Here is the graph of Magma times: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png The crossover is not clear. The change from a smooth curve to a squiggly line is about 3600. So presumably that is it, but the graph also seems to change character at about 6200 or 7000 as well. One of these changes may be cache related. At 3600, the total data for all three matrices is almost 5mb and the cache on my machine is 1024kb. But if Magma is using classical multiplication, then this is pretty much irrelevant anyway, since you can keep the working data within the cache for quite a while during the algorithm. On the other hand: a squiggly line is what one one would expect for Strassen-Winograd due to the extra rows/columns that have to be taken care of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel once gave a talk about his implementation and stated that they don't use classical block multiplication (I saw some slides with that remark). Martin -- name: Martin Albrecht _pgp:http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x8EF0DC99 _www:http://www.informatik.uni-bremen.de/~malb _jab: [EMAIL PROTECTED] --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---
[sage-devel] Re: slightly OT: new M4RI library
Btw. I don't have access to Magma 2.14 which I believe to be the fastest in linear algebra over GF(2). In version 2.14 they added SSE2 support too. So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear about the results. I don't know what else to compare with except Magma. Here are magma V2.14-9 timings on sage.math: [EMAIL PROTECTED]:~/magma64/bin$ ./magma Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142] Type ? for help. Type Ctrl-D to quit. A := RandomMatrix(GF(2),10^4,10^4); B := RandomMatrix(GF(2),10^4,10^4); time C := A*B; Time: 4.020 time C := A*B; Time: 3.980 time C := A*B; Time: 3.990 A := RandomMatrix(GF(2),2^14,2^14); B := RandomMatrix(GF(2),2^14,2^14); time C := A*B; Time: 15.450 time C := A*B; Time: 15.420 Any suggestions? Thoughts? Rants? Allan Steel is a really amazing programmer. I tried your new code up at #3204 under OS X and get this: sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x7465a00: non-page-aligned, non-allocated pointer being freed *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug ... CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s Wall time: 16.31 Maybe you're doing something wrong? -- William --~--~-~--~~~---~--~~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://www.sagemath.org -~--~~~~--~~--~--~---