Martin,

The test code still passes if you change RADIX to 128. I've no idea
how it passes, but it does. Shame the results are not correct, because
this speeds the code up by a factor of 2.

I notice that in the SSE code, you check to see if alignment can be
achieved, otherwise it doesn't use SSE. But this introduces an
unpredictable branch. Also, where ther are three operands, you can't
use SSE2 because the likelihood of all three being aligned is too
small.

I think a better idea would be to explicitly force all matrices and
all rows to be 128 bit aligned if the matrices are wide enough to
benefit from SSE2, Then the combine function can always use SSE2 and
there will be no need to check for alignment.

I experimented with interleaving MMX and GPR XOR's, but this doesn't
speed anything up. There are more instructions emitted and the time
stays about the same. The only way interleaving the MMX and GPR code
would speed things up is if there was more computation going on in the
registers and less memory loading and storing, I think.

Bill.

On 17 May, 15:45, Bill Hart <[EMAIL PROTECTED]> wrote:
> Hi Martin,
>
> Here is another 10% improvement. In the loop at the bottom of
> mzd_combine you can explicitly unroll by a factor of 8:
>
>     word * end = b1_ptr + wide;
>     register word * end8 = end - 8;
>     while (b1_ptr < end8)
>     {
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>     }
>     while (b1_ptr < end)
>     {
>          *(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
>     }
>
> I did this in combination with changing the crossover for 10000x10000
> from 3600 to 7200.
>
> Bill.
>
> On 17 May, 09:40, Martin Albrecht <[EMAIL PROTECTED]>
> wrote:
>
> > On Saturday 17 May 2008, Bill Hart wrote:
>
> > > In going from 5000x5000 to 10000x10000 Magma's time increases by a
> > > factor of less than 4. That is impossible. Strassen will never help us
> > > there. They must be doing something else. Probably something clever.
>
> > > Bill.
>
> >  I was stuck there too yesterday. Maybe only at 10000x10000 the pipeline 
> > gets
> > fully utilised?
>
> > Martin
>
> > PS: If we run out of idea we can simply go for parallelism, that should help
> > on sage.math ;-)
>
> > --
> > name: Martin Albrecht
> > _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> > _www:http://www.informatik.uni-bremen.de/~malb
> > _jab: [EMAIL PROTECTED]
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

Reply via email to