Package: release.debian.org Severity: normal User: release.debian....@packages.debian.org Usertags: unblock
Hello release team, Please unblock package mlucas. This upload should fix the RC bug <http://bugs.debian.org/cgi-bin/860662> by splitting big test into smaller ones. The diff attached below:
diff -Nru mlucas-14.1/debian/changelog mlucas-14.1/debian/changelog --- mlucas-14.1/debian/changelog 2015-08-27 22:42:36.000000000 +0800 +++ mlucas-14.1/debian/changelog 2017-04-24 16:16:28.000000000 +0800 @@ -1,3 +1,11 @@ +mlucas (14.1-2) unstable; urgency=medium + + * RC bug fix release (Closes: #860662), split big test into smaller ones + to avoid exhausting system resources. + * Backport fix for undefined behavior from upstream. + + -- Alex Vong <alexvong1...@gmail.com> Mon, 24 Apr 2017 16:16:28 +0800 + mlucas (14.1-1) unstable; urgency=low * Initial release (Closes: #786656) diff -Nru mlucas-14.1/debian/patches/0001-fixes-undefined-behaviour.patch mlucas-14.1/debian/patches/0001-fixes-undefined-behaviour.patch --- mlucas-14.1/debian/patches/0001-fixes-undefined-behaviour.patch 1970-01-01 08:00:00.000000000 +0800 +++ mlucas-14.1/debian/patches/0001-fixes-undefined-behaviour.patch 2017-04-24 16:16:28.000000000 +0800 @@ -0,0 +1,657 @@ +From f4c2fb2f7f771bf696d277140d267f6f03577f49 Mon Sep 17 00:00:00 2001 +From: Alex Vong <alexvong1...@gmail.com> +Date: Wed, 27 Jul 2016 19:52:35 +0800 +Subject: [PATCH] Fixes undefined behaviour. + +Description: This fixes undefined behaviour (array out out bound) in + the fermat test code reported by gcc's + `-Waggressive-loop-optimizations'. +Forwarded: yes +Author: Ernst W. Mayer <ewma...@aol.com> + +* src/radix1008_main_carry_loop.h: Fix undefined behaviour. +* src/radix1024_main_carry_loop.h: Likewise. +* src/radix128_main_carry_loop.h: Likewise. +* src/radix224_main_carry_loop.h: Likewise. +* src/radix240_main_carry_loop.h: Likewise. +* src/radix256_main_carry_loop.h: Likewise. +* src/radix32_main_carry_loop.h: Likewise. +* src/radix4032_main_carry_loop.h: Likewise. +* src/radix56_main_carry_loop.h: Likewise. +* src/radix60_main_carry_loop.h: Likewise. +* src/radix64_main_carry_loop.h: Likewise. +* src/radix960_main_carry_loop.h: Likewise. +--- + src/radix1008_main_carry_loop.h | 21 ++++++++++----------- + src/radix1024_main_carry_loop.h | 6 +++--- + src/radix128_main_carry_loop.h | 10 +++++----- + src/radix224_main_carry_loop.h | 17 ++++++++--------- + src/radix240_main_carry_loop.h | 19 ++++++++++--------- + src/radix256_main_carry_loop.h | 10 +++++----- + src/radix32_main_carry_loop.h | 6 +++--- + src/radix4032_main_carry_loop.h | 21 ++++++++++----------- + src/radix56_main_carry_loop.h | 10 +++++----- + src/radix60_main_carry_loop.h | 12 ++++++------ + src/radix64_main_carry_loop.h | 6 +++--- + src/radix960_main_carry_loop.h | 22 ++++++++++++++-------- + 12 files changed, 82 insertions(+), 78 deletions(-) + +diff --git a/src/radix1008_main_carry_loop.h b/src/radix1008_main_carry_loop.h +index 25cdc2c..525d29c 100644 +--- a/src/radix1008_main_carry_loop.h ++++ b/src/radix1008_main_carry_loop.h +@@ -389,14 +389,14 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + // icycle[ic],icycle[ic+1],icycle[ic+2],icycle[ic+3], jcycle[ic],kcycle[ic],lcycle[ic] of the non-looped version with + // icycle[ic],icycle[jc],icycle[kc],icycle[lc], jcycle[ic],kcycle[ic],lcycle[ic] : + ic = 0; jc = 1; kc = 2; lc = 3; +- while(tm0 < isrt2,two) // Can't use l for loop index here since need it for byte offset in carry macro call ++ while(tm0 < two) // Can't use l for loop index here since need it for byte offset in carry macro call + { /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; k5 = jcycle[ic]; k6 = kcycle[ic]; k7 = lcycle[ic]; + k2 = icycle[jc]; + k3 = icycle[kc]; + k4 = icycle[lc]; +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_hiacc(tm0,tmp,l,tm1,0x1f80, 0x7e0,0xfc0,0x17a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, add0,p1,p2,p3); + tm0 += 8; tm1++; tmp += 8; l -= 0xc0; +@@ -417,7 +417,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k2 = icycle[jc]; + k3 = icycle[kc]; + k4 = icycle[lc]; +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x1f80, 0x7e0,0xfc0,0x17a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, add0,p1,p2,p3); + tm0 += 8; tm1++; +@@ -447,15 +447,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; jc = 1; + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX; // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] +- while(tm1 < isrt2) { ++ while((int)(tmp-cy_r) < RADIX) { + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; + k2 = jcycle[ic]; + k3 = icycle[jc]; + k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p1); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -468,16 +468,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; // ic = idx into [i|j]cycle mini-arrays, gets incremented (mod ODD_RADIX) between macro calls + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX << 4; // 32-bit version needs preshifted << 4 input value +- while(tm1 < isrt2) { ++ while((int)(tmp-cy_r) < RADIX) { + //Sep 2014: Even with reduced-register version of the 32-bit Fermat-mod carry macro, + // GCC runs out of registers on this one, without some playing-around-with-alternate code-sequences ... + // Pulling the array-refs out of the carry-macro call like so solves the problem: + k1 = icycle[ic]; + k2 = jcycle[ic]; +- // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-(l&0x10)) & p2; +- tm2 += (-(l&0x01)) & p1; // Added offset cycles among p0,1,2,3 ++ // Each SSE2 carry macro call also processes 1 prefetch of main-array data ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p1*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +diff --git a/src/radix1024_main_carry_loop.h b/src/radix1024_main_carry_loop.h +index 6b2e8ae..d43b47c 100644 +--- a/src/radix1024_main_carry_loop.h ++++ b/src/radix1024_main_carry_loop.h +@@ -384,7 +384,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #if (OS_BITS == 32) + for(l = 0; l < RADIX; l++) { // RADIX loop passes + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- add0 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ add0 = a + j1 + pfetch_dist + poff[l>>2]; // poff[] = p0,4,8,... + add0 += (-(l&0x10)) & p2; + add0 += (-(l&0x01)) & p1; + SSE2_fermat_carry_norm_pow2_errcheck (tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, add0); +@@ -393,7 +393,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #else // 64-bit SSE2 + for(l = 0; l < RADIX>>1; l++) { // RADIX/2 loop passes + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- add0 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ add0 = a + j1 + pfetch_dist + poff[l>>1]; // poff[] = p0,4,8,... + add0 += (-(l&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_pow2_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, add0,p2); + tm1 += 4; tmp += 2; +@@ -427,7 +427,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + SSE2_RADIX_64_DIF( FALSE, thr_id, + 4, // set = trailz(N) - trailz(64) + // Input pointer; no offsets array in pow2-radix case: +- s1p00 + (jt<<1), 0x0, ++ (double *)(s1p00 + (jt<<1)), 0x0, + // Intermediates-storage pointer: + vd00, + // Outputs: Base address plus index offsets: +diff --git a/src/radix128_main_carry_loop.h b/src/radix128_main_carry_loop.h +index ff92238..24cb836 100644 +--- a/src/radix128_main_carry_loop.h ++++ b/src/radix128_main_carry_loop.h +@@ -571,7 +571,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #if (OS_BITS == 32) + for(l = 0; l < RADIX; l++) { // RADIX loop passes + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ tm2 = (vec_dbl *)( + pfetch_dist + poff[l>>2]); // poff[] = p0,4,8,... + tm2 += (-(l&0x10)) & p02; + tm2 += (-(l&0x01)) & p01; + SSE2_fermat_carry_norm_pow2_errcheck (tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2); +@@ -580,7 +580,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #else // 64-bit SSE2 + for(l = 0; l < RADIX>>1; l++) { // RADIX/2 loop passes + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[l>>1]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + tm2 += (-(l&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_pow2_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2,p01); + tm1 += 4; tmp += 2; +@@ -592,7 +592,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + // Can't use l as loop index here, since it gets used in the Fermat-mod carry macro (as are k1,k2); + ntmp = 0; addr = cy_r; addi = cy_i; + for(m = 0; m < RADIX>>2; m++) { +- jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,...,60 ++ jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,... + fermat_carry_norm_pow2_errcheck(a[jt ],a[jp ],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p01],a[jp+p01],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p02],a[jp+p02],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; +@@ -634,8 +634,8 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + k1 = reverse(l,8)<<1; + tm2 = s1p00 + k1; + #if (OS_BITS == 32) +- add1 = (vec_dbl*)tm1+ 2; add2 = (vec_dbl*)tm1+ 4; add3 = (vec_dbl*)tm1+ 6; add4 = (vec_dbl*)tm1+ 8; add5 = (vec_dbl*)tm1+10; add6 = (vec_dbl*)tm1+12; add7 = (vec_dbl*)tm1+14; +- add8 = (vec_dbl*)tm1+16; add9 = (vec_dbl*)tm1+18; adda = (vec_dbl*)tm1+20; addb = (vec_dbl*)tm1+22; addc = (vec_dbl*)tm1+24; addd = (vec_dbl*)tm1+26; adde = (vec_dbl*)tm1+28; addf = (vec_dbl*)tm1+30; ++ add1 = (double*)(tm1+ 2); add2 = (double*)(tm1+ 4); add3 = (double*)(tm1+ 6); add4 = (double*)(tm1+ 8); add5 = (double*)(tm1+10); add6 = (double*)(tm1+12); add7 = (double*)(tm1+14); ++ add8 = (double*)(tm1+16); add9 = (double*)(tm1+18); adda = (double*)(tm1+20); addb = (double*)(tm1+22); addc = (double*)(tm1+24); addd = (double*)(tm1+26); adde = (double*)(tm1+28); addf = (double*)(tm1+30); + SSE2_RADIX16_DIF_0TWIDDLE (tm2,OFF1,OFF2,OFF3,OFF4, tmp,two, tm1,add1,add2,add3,add4,add5,add6,add7,add8,add9,adda,addb,addc,addd,adde,addf); + #else + SSE2_RADIX16_DIF_0TWIDDLE_B(tm2,OFF1,OFF2,OFF3,OFF4, tmp,two, tm1); +diff --git a/src/radix224_main_carry_loop.h b/src/radix224_main_carry_loop.h +index 1ad55e5..ead8a83 100644 +--- a/src/radix224_main_carry_loop.h ++++ b/src/radix224_main_carry_loop.h +@@ -398,7 +398,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_hiacc(tm0,tmp,l,tm1,0x700, 0xe0,0x1c0,0x2a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; tmp += 8; l -= 0xc0; +@@ -420,7 +420,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x700, 0xe0,0x1c0,0x2a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; +@@ -448,15 +448,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; jc = 1; + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX; // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] +- while(tm1 < isrt2) { ++ while((int)(tmp-cy_r) < RADIX) { + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; + k2 = jcycle[ic]; + int k3 = icycle[jc]; + int k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p1); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -470,16 +470,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] + l = ODD_RADIX << 4; // 32-bit version needs preshifted << 4 input value +- while(tm1 < isrt2) { ++ while((int)(tmp-cy_r) < RADIX) { + //Sep 2014: Even with reduced-register version of the 32-bit Fermat-mod carry macro, + // GCC runs out of registers on this one, without some playing-around-with-alternate code-sequences ... + // Pulling the array-refs out of the carry-macro call like so solves the problem: + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-(l&0x10)) & p2; +- tm2 += (-(l&0x01)) & p1; // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p1*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +diff --git a/src/radix240_main_carry_loop.h b/src/radix240_main_carry_loop.h +index 2278d29..6f8e0f4 100644 +--- a/src/radix240_main_carry_loop.h ++++ b/src/radix240_main_carry_loop.h +@@ -608,14 +608,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + // icycle[ic],icycle[jc],icycle[kc],icycle[lc], jcycle[ic],kcycle[ic],lcycle[ic] : + ic = 0; jc = 1; kc = 2; lc = 3; + while(tm0 < s1pef) // Can't use l for loop index here since need it for byte offset in carry macro call +- { ++ { // NB: (int)(tmp-cy_r) < RADIX (as used for SSE2 build) no good here, since just 1 vec_dbl increment ++ // per 4 Re+Im-carries; but (int)(tmp-cy_r) < (RADIX>>1) would work + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; k5 = jcycle[ic]; k6 = kcycle[ic]; k7 = lcycle[ic]; + k2 = icycle[jc]; + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_hiacc(tm0,tmp,l,tm1,0x780, 0x1e0,0x3c0,0x5a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; tmp += 8; l -= 0xc0; +@@ -691,7 +692,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x780, 0x1e0,0x3c0,0x5a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; +@@ -722,15 +723,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; jc = 1; + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX; // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] +- while(tm1 < s1pef) { ++ while((int)(tmp-cy_r) < RADIX) { + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; + k2 = jcycle[ic]; + int k3 = icycle[jc]; + int k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p1); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -744,15 +745,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] + l = ODD_RADIX << 4; // 32-bit version needs preshifted << 4 input value +- while(tm1 <= s1pef) { ++ while((int)(tmp-cy_r) < RADIX) { + //Sep 2014: Even with reduced-register version of the 32-bit Fermat-mod carry macro, + // GCC runs out of registers on this one, without some playing-around-with-alternate code-sequences ... + // Pulling the array-refs out of the carry-macro call like so solves the problem: + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += plo[(int)(tm1-cy_r)&0x3]; // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p1*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +diff --git a/src/radix256_main_carry_loop.h b/src/radix256_main_carry_loop.h +index d439f24..aff7f38 100644 +--- a/src/radix256_main_carry_loop.h ++++ b/src/radix256_main_carry_loop.h +@@ -558,7 +558,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #if (OS_BITS == 32) + for(l = 0; l < RADIX; l++) { // RADIX loop passes + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- add0 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ add0 = a + j1 + pfetch_dist + poff[l>>2]; // poff[] = p0,4,8,... + add0 += (-(l&0x10)) & p02; + add0 += (-(l&0x01)) & p01; + SSE2_fermat_carry_norm_pow2_errcheck (tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, add0); +@@ -567,7 +567,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #else // 64-bit SSE2 + for(l = 0; l < RADIX>>1; l++) { // RADIX/2 loop passes + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- add0 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ add0 = a + j1 + pfetch_dist + poff[l>>1]; // poff[] = p0,4,8,... + add0 += (-(l&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_pow2_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, add0,p01); + tm1 += 4; tmp += 2; +@@ -579,7 +579,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + // Can't use l as loop index here, since it gets used in the Fermat-mod carry macro (as are k1,k2): + ntmp = 0; addr = cy_r; addi = cy_i; + for(m = 0; m < RADIX>>2; m++) { +- jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,...,60 ++ jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,... + fermat_carry_norm_pow2_errcheck(a[jt ],a[jp ],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p01],a[jp+p01],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p02],a[jp+p02],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; +@@ -629,8 +629,8 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + k1 = reverse(l,16)<<1; + tm2 = s1p00 + k1; + #if (OS_BITS == 32) +- add1 = (vec_dbl*)tmp+ 2; add2 = (vec_dbl*)tmp+ 4; add3 = (vec_dbl*)tmp+ 6; add4 = (vec_dbl*)tmp+ 8; add5 = (vec_dbl*)tmp+10; add6 = (vec_dbl*)tmp+12; add7 = (vec_dbl*)tmp+14; +- add8 = (vec_dbl*)tmp+16; add9 = (vec_dbl*)tmp+18; adda = (vec_dbl*)tmp+20; addb = (vec_dbl*)tmp+22; addc = (vec_dbl*)tmp+24; addd = (vec_dbl*)tmp+26; adde = (vec_dbl*)tmp+28; addf = (vec_dbl*)tmp+30; ++ add1 = (double*)(tmp+ 2); add2 = (double*)(tmp+ 4); add3 = (double*)(tmp+ 6); add4 = (double*)(tmp+ 8); add5 = (double*)(tmp+10); add6 = (double*)(tmp+12); add7 = (double*)(tmp+14); ++ add8 = (double*)(tmp+16); add9 = (double*)(tmp+18); adda = (double*)(tmp+20); addb = (double*)(tmp+22); addc = (double*)(tmp+24); addd = (double*)(tmp+26); adde = (double*)(tmp+28); addf = (double*)(tmp+30); + SSE2_RADIX16_DIF_0TWIDDLE (tm2,OFF1,OFF2,OFF3,OFF4, isrt2,two, tmp,add1,add2,add3,add4,add5,add6,add7,add8,add9,adda,addb,addc,addd,adde,addf); + #else + SSE2_RADIX16_DIF_0TWIDDLE_B(tm2,OFF1,OFF2,OFF3,OFF4, isrt2,two, tmp); +diff --git a/src/radix32_main_carry_loop.h b/src/radix32_main_carry_loop.h +index 5337009..3f0d0a0 100644 +--- a/src/radix32_main_carry_loop.h ++++ b/src/radix32_main_carry_loop.h +@@ -291,7 +291,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #if (OS_BITS == 32) + for(l = 0; l < RADIX; l++) { // RADIX loop passes + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[l>>2]); // poff[] = p0,4,8,... + tm2 += (-(l&0x10)) & p02; + tm2 += (-(l&0x01)) & p01; + SSE2_fermat_carry_norm_pow2_errcheck (tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2); +@@ -300,7 +300,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #else // 64-bit SSE2 + for(l = 0; l < RADIX>>1; l++) { // RADIX/2 loop passes + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[l>>1]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + tm2 += (-(l&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_pow2_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2,p01); + tm1 += 4; tmp += 2; +@@ -312,7 +312,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + // Can't use l as loop index here, since it gets used in the Fermat-mod carry macro (as are k1,k2); + ntmp = 0; addr = cy_r; addi = cy_i; + for(m = 0; m < RADIX>>2; m++) { +- jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,...,60 ++ jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,... + fermat_carry_norm_pow2_errcheck(a[jt ],a[jp ],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p01],a[jp+p01],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p02],a[jp+p02],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; +diff --git a/src/radix4032_main_carry_loop.h b/src/radix4032_main_carry_loop.h +index 3e68bb2..ac02d50 100644 +--- a/src/radix4032_main_carry_loop.h ++++ b/src/radix4032_main_carry_loop.h +@@ -371,7 +371,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k2 = icycle[jc]; + k3 = icycle[kc]; + k4 = icycle[lc]; +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_hiacc(tm0,tmp,l,tm1,0x7e00, 0x1f80,0x3f00,0x5e80, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; tmp += 8; l -= 0xc0; +@@ -386,14 +386,13 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + tm0 = s1p00; tmp = base_negacyclic_root; // tmp *not* incremented between macro calls in loacc version + tm1 = cy_r; // tm2 = cy_i; *** replace with literal-byte-offset in macro call to save a reg + ic = 0; jc = 1; kc = 2; lc = 3; +- for(l = 0; l < RADIX>>2; l++) // RADIX/4 loop passes +- { ++ for(l = 0; l < RADIX>>2; l++) { // RADIX/4 loop passes + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; k5 = jcycle[ic]; k6 = kcycle[ic]; k7 = lcycle[ic]; + k2 = icycle[jc]; + k3 = icycle[kc]; + k4 = icycle[lc]; +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x7e00, 0x1f80,0x3f00,0x5e80, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; +@@ -423,15 +422,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; jc = 1; + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX; // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] +- while(tm1 < cy_r) { ++ while((int)(tmp-cy_r) < RADIX) { + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; + k2 = jcycle[ic]; + int k3 = icycle[jc]; + int k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p1); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -444,15 +443,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; // ic = idx into [i|j]cycle mini-arrays, gets incremented (mod ODD_RADIX) between macro calls + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX << 4; // 32-bit version needs preshifted << 4 input value +- while(tm1 < cy_r) { ++ while((int)(tmp-cy_r) < RADIX) { + //Sep 2014: Even with reduced-register version of the 32-bit Fermat-mod carry macro, + // GCC runs out of registers on this one, without some playing-around-with-alternate code-sequences ... + // Pulling the array-refs out of the carry-macro call like so solves the problem: + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += plo[(int)(tm1-cy_r)&0x3]; // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p1*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +@@ -531,7 +530,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + // the leading pow2-shift arg = trailz(N) - trailz(64) = 0: + SSE2_RADIX_64_DIF( FALSE, thr_id, + 0, +- tmp,t_offsets, ++ (double *)tmp,t_offsets, + s1p00, // tmp-storage + a+jt,io_offsets + ); tmp += 2; +diff --git a/src/radix56_main_carry_loop.h b/src/radix56_main_carry_loop.h +index 7e6ba9f..6e395fa 100644 +--- a/src/radix56_main_carry_loop.h ++++ b/src/radix56_main_carry_loop.h +@@ -434,7 +434,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x1c0, 0xe0,0x1c0,0x2a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p01,p02,p03); + tm0 += 8; tm1++; +@@ -469,8 +469,8 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + int k3 = icycle[jc]; + int k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p01); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -491,8 +491,8 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += p01*((int)(tm1-cy_r)&0x3); // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p01*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +diff --git a/src/radix60_main_carry_loop.h b/src/radix60_main_carry_loop.h +index 187ec3f..d4ad69b 100644 +--- a/src/radix60_main_carry_loop.h ++++ b/src/radix60_main_carry_loop.h +@@ -424,7 +424,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_hiacc(tm0,tmp,l,tm1,0x1e0, 0x1e0,0x3c0,0x5a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p01,p02,p03); + tm0 += 8; tm1++; tmp += 8; l -= 0xc0; +@@ -446,7 +446,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x1e0, 0x1e0,0x3c0,0x5a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p01,p02,p03); + tm0 += 8; tm1++; +@@ -483,8 +483,8 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + int k3 = icycle[jc]; + int k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p01); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -505,8 +505,8 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += p01*((int)(tm1-cy_r)&0x3); // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p01*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +diff --git a/src/radix64_main_carry_loop.h b/src/radix64_main_carry_loop.h +index ce3e4af..57bea3d 100644 +--- a/src/radix64_main_carry_loop.h ++++ b/src/radix64_main_carry_loop.h +@@ -464,7 +464,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #if (OS_BITS == 32) + for(l = 0; l < RADIX; l++) { // RADIX loop passes + // Each SSE2 carry macro call also processes 1 prefetch of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,... ++ tm2 = a + j1 + pfetch_dist + poff[l>>2]; // poff[] = p0,4,8,... + tm2 += (-(l&0x10)) & p02; + tm2 += (-(l&0x01)) & p01; + SSE2_fermat_carry_norm_pow2_errcheck (tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2); +@@ -473,7 +473,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + #else // 64-bit SSE2 + for(l = 0; l < RADIX>>1; l++) { // RADIX/2 loop passes + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[l]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = a + j1 + pfetch_dist + poff[l>>1]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + tm2 += (-(l&0x1)) & p02; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_pow2_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,half_arr,sign_mask,add1,add2, tm2,p01); + tm1 += 4; tmp += 2; +@@ -485,7 +485,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee + // Can't use l as loop index here, since it gets used in the Fermat-mod carry macro (as are k1,k2); + ntmp = 0; addr = cy_r; addi = cy_i; + for(m = 0; m < RADIX>>2; m++) { +- jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,...,60 ++ jt = j1 + poff[m]; jp = j2 + poff[m]; // poff[] = p04,08,... + fermat_carry_norm_pow2_errcheck(a[jt ],a[jp ],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p01],a[jp+p01],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; + fermat_carry_norm_pow2_errcheck(a[jt+p02],a[jp+p02],*addr,*addi,ntmp,NRTM1,NRT_BITS); ntmp += NDIVR; ++addr; ++addi; +diff --git a/src/radix960_main_carry_loop.h b/src/radix960_main_carry_loop.h +index cb4cc15..f900a77 100644 +--- a/src/radix960_main_carry_loop.h ++++ b/src/radix960_main_carry_loop.h +@@ -589,16 +589,22 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + // Oct 2014: Try getting most of the LOACC speedup with better accuracy by breaking the complex-roots-of-(-1) + // chaining into 2 or more equal-sized subchains, each starting with 'fresh' (unchained) complex roots: + #if (LOACC == 0) ++ #warning LOACC = 0 + #define NFOLD (const int)0 + #elif (LOACC == 1) ++ #warning LOACC = 1 + #define NFOLD (const int)1 + #elif (LOACC == 2) ++ #warning LOACC = 2 + #define NFOLD (const int)2 + #elif (LOACC == 3) ++ #warning LOACC = 3 + #define NFOLD (const int)3 + #elif (LOACC == 4) ++ #warning LOACC = 4 + #define NFOLD (const int)4 + #elif (LOACC == 5) ++ #warning LOACC = 5 + #define NFOLD (const int)5 + #else + #error If LOACC defined for build of radix960_ditN_cy_dif1.c, must be given value 0,1,2,3,4 or 5! +@@ -650,7 +656,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + k3 = icycle[kc]; + k4 = icycle[lc]; + // Each AVX carry macro call also processes 4 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. + /* vvvvvvvvvvvvvvv [1,2,3]*ODD_RADIX; assumed << l2_sz_vd on input: */ + SSE2_fermat_carry_norm_errcheck_X4_loacc(tm0,tmp,tm1,0x1e00, 0x1e0,0x3c0,0x5a0, half_arr,sign_mask,k1,k2,k3,k4,k5,k6,k7, tm2,p1,p2,p3); + tm0 += 8; tm1++; +@@ -681,15 +687,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + ic = 0; jc = 1; + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + l = ODD_RADIX; // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] +- while(tm1 < x00) { ++ while((int)(tmp-cy_r) < RADIX) { + //See "Sep 2014" note in 32-bit SSE2 version of this code below + k1 = icycle[ic]; + k2 = jcycle[ic]; + k3 = icycle[jc]; + k4 = jcycle[jc]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += (-((int)(tm1-cy_r)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += (-((int)((tmp-cy_r)>>1)&0x1)) & p2; // Base-addr incr by extra p2 on odd-index passes + SSE2_fermat_carry_norm_errcheck_X2(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2,k3,k4, tm2,p1); + tm1 += 4; tmp += 2; + MOD_ADD32(ic, 2, ODD_RADIX, ic); +@@ -703,15 +709,15 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + tm1 = s1p00; tmp = cy_r; // <*** Again rely on contiguity of cy_r,i here *** + // Need to stick this #def into an intvar to work around [error: invalid lvalue in asm input for constraint 'm'] + l = ODD_RADIX << 4; // 32-bit version needs preshifted << 4 input value +- while(tm1 < x00) { ++ while((int)(tmp-cy_r) < RADIX) { + //Sep 2014: Even with reduced-register version of the 32-bit Fermat-mod carry macro, + // GCC runs out of registers on this one, without some playing-around-with-alternate code-sequences ... + // Pulling the array-refs out of the carry-macro call like so solves the problem: + k1 = icycle[ic]; + k2 = jcycle[ic]; + // Each SSE2 carry macro call also processes 2 prefetches of main-array data +- tm2 = a + j1 + pfetch_dist + poff[(int)(tm1-cy_r)]; // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. +- tm2 += plo[(int)(tm1-cy_r)&0x3]; // Added offset cycles among p0,1,2,3 ++ tm2 = (vec_dbl *)(a + j1 + pfetch_dist + poff[(int)(tmp-cy_r)>>2]); // poff[] = p0,4,8,...; (tm1-cy_r) acts as a linear loop index running from 0,...,RADIX-1 here. ++ tm2 += p1*((int)(tmp-cy_r)&0x3); // Added offset cycles among p0,1,2,3 + SSE2_fermat_carry_norm_errcheck(tm1,tmp,NRT_BITS,NRTM1,idx_offset,idx_incr,l,half_arr,sign_mask,add1,add2,k1,k2, tm2); + tm1 += 2; tmp++; + MOD_ADD32(ic, 1, ODD_RADIX, ic); +@@ -982,7 +988,7 @@ for(k=1; k <= khi; k++) /* Do n/(radix(1)*nwt) outer loop executions... */ + // the leading pow2-shift arg = trailz(N) - trailz(64) = 0: + SSE2_RADIX_64_DIF( FALSE, thr_id, + 0, +- tmp,t_offsets, ++ (double *)tmp,t_offsets, + s1p00, // tmp-storage + a+jt,dif_o_offsets + ); tmp += 2; +-- +2.12.2 + diff -Nru mlucas-14.1/debian/patches/0001-Split-big-test-into-smaller-ones.patch mlucas-14.1/debian/patches/0001-Split-big-test-into-smaller-ones.patch --- mlucas-14.1/debian/patches/0001-Split-big-test-into-smaller-ones.patch 1970-01-01 08:00:00.000000000 +0800 +++ mlucas-14.1/debian/patches/0001-Split-big-test-into-smaller-ones.patch 2017-04-24 16:16:28.000000000 +0800 @@ -0,0 +1,33 @@ +From 35e426b2718af92558df61718f405c69e03bf10d Mon Sep 17 00:00:00 2001 +From: Alex Vong <alexvong1...@gmail.com> +Date: Mon, 24 Apr 2017 14:09:01 +0800 +Subject: [PATCH] Split big test into smaller ones. + +Description: Split big test into smaller ones to avoid exhausting + system resources. This fix is inspired by that of + https://bugs.debian.org/860664. +Bug-Debian: https://bugs.debian.org/860662 +Forwarded: yes +Author: Alex Vong <alexvong1...@gmail.com> + +* scripts/self_test.test: Split big test. +--- + scripts/self_test.test | 10 ++++++++-- + 1 file changed, 8 insertions(+), 2 deletions(-) + +--- a/scripts/self_test.test ++++ b/scripts/self_test.test +@@ -29,5 +29,11 @@ + # Export MLUCAS_PATH so that mlucas.cfg stays in the build directory + export MLUCAS_PATH + +-# Do self-test +-exec "$MLUCAS_PATH"mlucas -s m ++# List of `medium' exponents ++exponent_ls='20000047 22442237 24878401 27309229 29735137 32156581 34573867 36987271 39397201 44207087 49005071 53792327 58569809 63338459 68098843 72851621 77597293 87068977 96517019 105943723 115351063 124740697 134113933 143472073' ++ ++# Run self-test on `medium' exponents ++for exponent in $exponent_ls ++do ++ "$MLUCAS_PATH"mlucas -m "$exponent" -iters 100 ++done diff -Nru mlucas-14.1/debian/patches/series mlucas-14.1/debian/patches/series --- mlucas-14.1/debian/patches/series 2015-08-28 03:58:09.000000000 +0800 +++ mlucas-14.1/debian/patches/series 2017-04-24 16:16:28.000000000 +0800 @@ -1 +1,3 @@ 0001-Add-copyright-info-of-generated-files.patch +0001-Split-big-test-into-smaller-ones.patch +0001-fixes-undefined-behaviour.patch diff -Nru mlucas-14.1/debian/README.Debian mlucas-14.1/debian/README.Debian --- mlucas-14.1/debian/README.Debian 2015-08-27 22:53:38.000000000 +0800 +++ mlucas-14.1/debian/README.Debian 2017-04-24 16:16:28.000000000 +0800 @@ -13,6 +13,14 @@ flag. However, the parser will not reject unsupported arguments. Using unsupported arguments for -iters flag may trigger strange behaviour. +On system with limited resources, the self-test for medium exponents +'mlucas -s m' may fail with 'pthread_create:: Cannot allocate memory'. See +<https://bugs.debian.org/860662> for details. The current fix is to run +self-test on each exponent one by one instead. However, this is unsatisfactory +since it does not prevent the user from running the self-test for medium +exponents and getting an error. + See BUGS section in mlucas(1) for details. + -- Alex Vong <alexvong1...@gmail.com> Thu, 27 Aug 2017 22:04:58 +0800 -- Alex Vong <alexvong1...@gmail.com> Thu, 27 Aug 2015 22:04:58 +0800
Feel free to ask for more details. Cheers, Alex unblock mlucas/14.1-2 -- System Information: Debian Release: 9.0 APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 4.9.0-2-amd64 (SMP w/2 CPU cores) Locale: LANG=zh_TW.UTF-8, LC_CTYPE=zh_TW.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system)
signature.asc
Description: PGP signature