Ciao Paul,

I elaborated a better patch (forget the experimental one I sent yesterday)
to the tune/speed program, it is attached.

Before the patch, mpn_mul seem sensibly slower than mpn_mul_n:

$ tune/speed -s 800000 mpn_mul_n mpn_mul mpn_mul_n mpn_mul
overhead 0.000000002 secs, precision 10000 units of 3.12e-10 secs, CPU
freq 3205.77 MHz
            mpn_mul_n       mpn_mul     mpn_mul_n       mpn_mul
800000    0.646153000   0.673501000  #0.643274000   0.686486000


After the patch, only changing the way tune/speed allocate memory for the
operands, their results are comparable:

$ tune/speed -s 800000 mpn_mul_n mpn_mul mpn_mul_n mpn_mul
overhead 0.000000002 secs, precision 10000 units of 3.12e-10 secs, CPU
freq 3200.20 MHz
            mpn_mul_n       mpn_mul     mpn_mul_n       mpn_mul
800000    0.644460000   0.649850000   0.634180000  #0.631246000

There is a side-effect: to measure the speed of unbalanced multiplication,
eg ###### x ##, you used

tune/speed -s ## mpn_mul.######

now the roles of the two parameters are swapped, and you have to write

tune/speed -s ###### mpn_mul.##

The transposed version of the matrix of times I suggested in the previous
message, can now be obtained with the following:

$ tune/speed -s 800000-1200000 -t 100000 mpn_mul.400000 mpn_mul.500000
mpn_mul.600000 mpn_mul.700000 mpn_mul.800000 mpn_mul_n
overhead 0.000000002 secs, precision 10000 units of 3.12e-10 secs, CPU
freq 3200.23 MHz
        mul.400000 mul.500000 mul.600000 mul.700000 mul.800000 mpn_mul_n
800000  #0.430677   0.433753   0.515757   0.535098   0.630629  0.645156
900000  #0.431647   0.521545   0.532850   0.638642   0.644031  0.642488
1000000 #0.522817   0.527930   0.633221   0.646514   0.648290  0.708614
1100000 #0.516791   0.648199   0.640584   0.651306   0.681567  0.857438
1200000  0.647544  #0.640084   0.652030   0.675864   0.690255  0.950390

There still are problems of non-monotonicity (12..x5.. is slightly faster
than both the more unbalanced 12..x4.. and the less unbalanced 11..x5..),
but at least we isolated the issue.

If other developers does not dislike the changed meaning of the .<r>
parameter to mpn_mul, this patch can be applied to the main repo...

Opinions?

Best regards,
m

-- 
http://bodrato.it/software/combinatorics.html
diff -r 52470639dd75 tune/README
--- a/tune/README	Tue Feb 05 10:49:00 2013 +0100
+++ b/tune/README	Thu Feb 07 09:48:17 2013 +0100
@@ -287,10 +287,10 @@
 
 EXAMPLE COMPARISONS - MULTIPLICATION
 
-mul_basecase takes a ".<r>" parameter which is the first (larger) size
-parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
+mul_basecase takes a ".<r>" parameter which is the second (smaller) size
+parameter.  For example to show speeds for 3x3 up to 20x3 in cycles,
 
-        ./speed -s 1-15 -c mpn_mul_basecase.20
+        ./speed -s 3-20 -c mpn_mul_basecase.3
 
 mul_basecase with no parameter does an NxN multiply, so for example to show
 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
diff -r 52470639dd75 tune/speed.h
--- a/tune/speed.h	Tue Feb 05 10:49:00 2013 +0100
+++ b/tune/speed.h	Thu Feb 07 09:48:17 2013 +0100
@@ -1022,7 +1022,7 @@
 /* For mpn_mul, mpn_mul_basecase, xsize=r, ysize=s->size. */
 #define SPEED_ROUTINE_MPN_MUL(function)					\
   {									\
-    mp_ptr    wp, xp;							\
+    mp_ptr    wp;							\
     mp_size_t size1;							\
     unsigned  i;							\
     double    t;							\
@@ -1030,22 +1030,21 @@
 									\
     size1 = (s->r == 0 ? s->size : s->r);				\
 									\
-    SPEED_RESTRICT_COND (s->size >= 1);					\
-    SPEED_RESTRICT_COND (size1 >= s->size);				\
+    SPEED_RESTRICT_COND (size1 >= 1);					\
+    SPEED_RESTRICT_COND (s->size >= size1);				\
 									\
     TMP_MARK;								\
     SPEED_TMP_ALLOC_LIMBS (wp, size1 + s->size, s->align_wp);		\
-    SPEED_TMP_ALLOC_LIMBS (xp, size1, s->align_xp);			\
-									\
-    speed_operand_src (s, xp, size1);					\
-    speed_operand_src (s, s->yp, s->size);				\
+									\
+    speed_operand_src (s, s->xp, s->size);				\
+    speed_operand_src (s, s->yp, size1);				\
     speed_operand_dst (s, wp, size1 + s->size);				\
     speed_cache_fill (s);						\
 									\
     speed_starttime ();							\
     i = s->reps;							\
     do									\
-      function (wp, xp, size1, s->yp, s->size);				\
+      function (wp, s->xp, s->size, s->yp, size1);			\
     while (--i != 0);							\
     t = speed_endtime ();						\
 									\
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to