I had a look at gf2x. The reason they can make use of sse is because
of the shl and shr capabilities. Doing that 128 bits at a time is more
efficient since one operation can be done instead of quite a few using
general purpose regs.

Bill.

On 19 May, 13:52, Bill Hart <[EMAIL PROTECTED]> wrote:
> Martin,
>
> does your machine speed up at all with any of the SSE code? I spent
> considerable time yesterday trying to optimise the combine3 function I
> wrote (note: I didn't submit this improved code). Whilst I did speed
> it up considerably by removing %'s and /'s and changing the function
> prototype to accept rows directly and lots of other minor
> improvements, the overall result was still much slower than the
> generic C code.
>
> If the SSE code doesn't ever speed it up (and it might not be faster
> to use SSE when there is at least one load/store per arithmetic
> operation), then it might be worth throwing all the SSE code out
> completely, since it just polutes what is otherwise very nice looking
> code.
>
> I did check by compiling the code to assembler without assembling it
> (the -S option in gcc), that it is actually using SSE instructions and
> that it is doing this in what I would consider an optimal way. So it
> really is SSE itself that is slow, not just inefficiencies introduced
> by the compiler.
>
> One possible avenue for exploration is the gf2x library by Zimmerman,
> Brent, et al. They use SSE in their code for polynomials over GF2,
> apparently to good effect. We could look to see if there is some trick
> to using it efficiently. The difference there however is probably that
> the polynomials get quite long and multiple arithmetic operations need
> to be done for each store/load. I don't know if any of their tricks
> can help us for that reason.
>
> Bill.
>
> On 19 May, 02:00, Martin Albrecht <[EMAIL PROTECTED]>
> wrote:
>
> > On Monday 19 May 2008, Bill Hart wrote:
>
> > > Well, apparently there are speed gains right up to 8 Gray tables,
> > > though I really have no idea why that is.
>
> > > Here are the new times:
>
> > > Magma Old New
>
> > > 10000x10000:
> > >    2.940s 3.13s 2.32s
>
> > > 16384x16384:
> > >    9.250s 12.96s 9.17s
>
> > > 20000x20000:
> > >    16.57s 22.43s 16.49s
>
> > > 32000x32000:
> > >    59.1s 90.2s 81.6s 65.7s
>
> > > So now we beat Magma all the way up to 20000x20000.
>
> > > I'll put the adjusted code in the directory:
>
> > >http://sage.math.washington.edu/home/wbhart/m4ri3gray/
>
> > > in just a moment. I also found a memory leak in my code which I fixed.
>
> > > Bill.
>
> > Cool, I'll take a look tomorrow. It seems we're closing in on the classical
> > algorithm though. Although, you don't seem to decrease k anymore.
>
> > As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
> > many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
> > that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
> > on cache lines. I'm widely speculating here.
>
> > Martin
>
> > --
> > name: Martin Albrecht
> > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> > _www:http://www.informatik.uni-bremen.de/~malb
> > _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to