We have proven with the straight translation of Daniel Bernstein's C code to 
golang that the Sieve of Eratosthenes (SoE) eratspeed rans about the same speed 
as the Sieve of Atkin (SoA) primespeed for the same range with the same 
optimizations and with John Barham's multi-treading for primespeed disabled 
(set numprocs = 1 in sieve.go), however golang primespeed is slower with array 
bounds checks enabled due to the complexity of the array operations for the 
prime square free operations.  For both I eliminated the use of the "two" 
shifting look up array table as that caused an extra bounds check, and tuned 
the inner loops in exactly the same way to minimize tight loop time.

Now that already shows that the SoA is inferior for page segmented sieving to 
1,000,000,000 (one billion), as it only has about 258,000,000 operations 
whereas SoE eratspeed has about 405,000,000 operations, meaning that the SoA 
operations are more complex, which is obvious.  However, Bernstein crippled 
eratspeed to only 2/3/5 wheel factorization where it is easily possible to use 
2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the operations to 
about 263,000,000 operations, or about 35% less.

The golang FasterEratspeed that implements this maximum wheel factorization 
version is posted at play.golang at:  https://play.golang.org/p/8Bo04bnT5F and 
runs about 33% faster than the previous standard wheel factorized version, just 
as the ratio of operations suggests that it should; also, because it minimizes 
array access it doesn't matter much whether array bound checking is enabled or 
not.  This is almost as fast as the highly optimized C++ primesieve 
(http://primesieve.org/) run signle threaded on my cache access speed 
bottle-necked machine, but on faster machines it will be slower than 
primesieve, as much as twice as slow on high end Intel desktop CPU's.

Actually, this implementation isn't maximally factorized as in some languages 
one is able to also pre-cull by 19 as well, but with golang the overhead of 
creating handling the extra size of pattern array is about the same as the few 
percent gain.

So, in this implementation we are coming very close to C/C++ speeds and faster 
than C#/Java; however, I learned much from the exercise and have found that 
very slight changes in the formulation of the tight inner loop(s) can slow the 
speed by at least a factor of two and sometimes up to almost three.  These best 
opimizations aren't always obvious and it is time consuming trying the 
different optimizations to get the fastest code.  Sometimes adding instructions 
actually reduces the execution speed, as the compiler will perhaps change the 
use of registers in a way that there are less stalls or bottle-necks.  For this 
particular implementation and golang version 1.7beta1, it seems that for this 
particular loop, it is optimized quite effectively in spite of having the array 
bounds check and even better without it.

That should be enough to prove that the SoA is just an interesting intellectual 
exercise, as for page seqmented sieves as these it has more complexity and is 
thus slower than the maximally factorized SoE, and also does not support its 
theoretical O(n) asymptotic complexity as with increasing range there is an 
ever increasing cost in code overhead, whereas the SoE comes very close to 
maintaining its theoretical O(n log log n) performance.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to