Twinprimes generator that showcases Nim

jzakiya Tue, 27 Mar 2018 21:30:05 +0200

It's been a long and interesting journey, but I've finally gotten this to a 
point where I feel I can release it.


This is a `twinprimes` generator, that I did an initial version of back in 
2014|5 in C++. Over time, I did a C++ parallel version using OpenMP. Then a few 
months ago, at the beginning of 2018, I started looking at this again (I can't 
remember the specifics of why) and said, man, I can do better than that now. I 
had been reimplementing the straight SSoZ in Nim, and trying to figure out the 
best architecture. It turns out in this exercise, I figured out how to use 
better math to allow me to create a much simpler, and MANY TIMES faster 
architecture for parallel implementation.

What you see here maybe the `fastest twinnprimes generator` if this other 
program that claimed to be the fastest really is.

There is a program called `primesieve` 
([https://primesieve.org/)](https://primesieve.org/\)), which is open source, 
written in C++, whose author says its the fastest programs that generates 
primes. It is a very nice, and well written, program, and I've used it to learn 
how to do good C++ programming. In the past, I've written him about my 
primesieve method(s) (SoZ, SSoZ). It's easy to download and run. I just get the 
console version of the compiled version.

So I use `primesieve` as the reference to check out my methods|techniques. 
After some head scratching, and a few `Aha` moments, `I've finally figured out 
how to bang on Nim` to make my program faster than `primesieve` for generating 
`twinprimes`.

Below are time comparisons between `primesieve` and mine - `twinprimes_ssoz`. 
You see as the numbers get bigger then `twinprimes_ssoz` becomes increasingly 
faster. It's optimum math now (finally) married to an optimum implementation.

To get these times, I ran both programs on a `quiet` system, where I shutdown 
and rebooted, with no other apps loaded other than a terminal, to run both in, 
and an editor to record their times. If you have other programs that operate in 
the background that use threads (e.g. browsers) or eat up memory, the 
(relative) times will be different|slower for both.

To run `primesieve` to generate twinprimes from a console do e.g.: `$ 
./primesieve 7e9 -c2`

To run `twinprimes_ssoz` just do: `$ ./twinprimes_ssoz<cr>` and then enter 
number as: `7000000000`

It was compiled as follows:
    
    
    $ nim c --cc:gcc --d:release --threads:on --gc:none twinprimes_ssoz.nim
    
    

Below are the times I got (in seconds) on my system: System76 laptop, with an 
Intel I7 6700HQ cpu, 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory, run 
on a 64-bit PCLinuxOS Linux distro, compiled using gcc 4.9.2, with Nim 0.17.2. 
Here both programs just count the number of twinprimes (`twinprimes_ssoz` also 
prints the last twinprime value).
    
    
    Input Number   |  twinprimes_ssoz  |  primesieve
    --------------------------------------------------
     < = 1e6       |        0.000      |      0.000
    5e6            |        0.001      |      0.001
    1e7            |        0.001      |      0.001
    5e7            |        0.003      |      0.004
    1e8            |        0.005      |      0.005
    5e8            |        0.025      |      0.024
    1e9            |        0.054      |      0.050
    5e9            |        0.222      |      0.262
    1e10           |        0.483      |      0.572
    5e10           |        2.693      |      3.112
    1e11           |        5.581      |      6.475
    5e11           |       29.371      |     35.675
    1e12           |       63.274      |     75.638
    5e12           |      408.348      |    448.066
    1e13           |      914.001      |    953.423
    

**Value to Nim**

This code can be used to benefit|showcase Nim in various ways.

  * show that Nim can be a player in the numerical analysis arena, particularly 
for parallel algorithms
  * can be used as a standard benchmark to evaluate new versions of Nim for 
improvements|regressions
  * can be used to talk stuff to to C++, et al, programmers 



**Future Development**

After taking some rest, I plan to implement this new architecture inn 
C++|OpenMP so I can compare performance. I'll report the results afterwards. I 
also plan to update my paper `The Segmented Sieve of Zakiya (SSoZ)` (released 
in 2104) to include the new math, architecture, and coding improvements.

Below are references for learning about the SoZ and SSoZ.

`The Segmented Sieve of Zakiya (SSoZ)`

[https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ](https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ)

`Primes Utils Handbook`

[https://www.scribd.com/document/266461408/Primes-Utils-Handbook](https://www.scribd.com/document/266461408/Primes-Utils-Handbook)

But I encourage, implore, welcome, people to beat on the code to improve it and 
make it faster. What idioms are faster than the ones I used, etc. Also, I 
thought I saw mentioned herein that OpenMP can be used with Nim. If so, it 
would be interesting if using it in Nim would make the code faster. Whoever 
knows how to do that, go for it!

Below is the code and its gist (it's 317 loc, with ~60 separate loc of 
comments, compare that to `primesieve`'s code size):

[https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e](https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e)
    
    
    #[
     This Nim source file is a multiple threaded implementation to perform an
     extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
     
     Based on the size of input value N, it dynamically selects the best PG to
     use from PGs P5, P7, P11, and P13, and then sets the optimum segment size.
     
     This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
     2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I suspect parameter
     tuning may have to be done on other hadware systems (ARM, PowerPC, etc) to
     achieve optimum performance on them. It was tested on various Linux 64 bit
     distros, native and in Virtual Box, using 8 or 4 threads, or 16|4GB of mem.
     
     At time of writing, verified using Nim: 0.17.2, 0.18.0
     
     The code was compiled using these compiler directives|flags. Not usng GC
     produces smaller executable, and maybe little faster. Try w/wo to see
     difference on your system. For optimum performance use gcc over clang.
     
     $ nim c --cc:gcc --d:release --threads:on --gc:none twinprimes_ssoz.nim
     
     Then run executable: $ ./twinprimes_ssoz <cr>, and enter value for N.
     As coded, input values cover the range: 13 -- 2^64-1
     
     This source file, and updates, will be available here:
     https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
     
     Related code, papers, and tutorials, are available here:
     https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
     https://www.scribd.com/document/266461408/Primes-Utils-Handbook
     
     Use of this code is free subject to acknowledgment of copyright.
     Copyright (c) 2017 Jabari Zakiya -- jzakiya at gmail dot com
     Version Date: 2018/03/18
     Version Date: 2018/03/27
     
     This code is provided under the terms of the
     GNU General Public License Version 3, GPLv3, or greater.
     License copy/terms are here:  http://www.gnu.org/licenses/
    ]#
    
    import math                   # for sqrt function
    import strutils, typetraits   # for number input
    import times, os              # for timing code execution
    import osproc                 # for getting threads count
    import threadpool             # for parallel processing
    {.experimental.}              # required to use 'parallel' (<= 0.18.x)
    
    # Compute modular inverse a^(-1) of a to base b, e.g. a*(a^-1) mod b = 1
    proc modinv(a0, b0: int): int =
      var (a, b, x0) = (a0, b0, 0)
      result = 1
      if b == 1: return
      while a > 1:
        let q = a div b
        a = a mod b
        swap a, b
        result -= q * x0
        swap x0, result
      if result < 0: result += b0
    
    # Create constant parameters for given PG at compile time
    proc genPGparameters(prime: int): (int, int, int, seq[int], seq[int], 
seq[int]) =
      echo("generating parameters for P", prime)
      let primes = [2, 3, 5, 7, 11, 13, 17, 19, 23]
      var modpg = 1
      for prm in primes: (modpg *= prm; if prm == prime: break)  # PG's mdoulus
      
      var residues: seq[int] = @[]    # generate PG's residue values here
      var pc = 1
      var inc = 4
      while pc < modpg: (pc += inc; inc = inc xor 0b110; if gcd(modpg, pc) == 
1: residues.add(pc))
      let rescnt = residues.len       # PG's residues count
      
      var restwins: seq[int] = @[]    # extract twinpair residues here
      var j = 0
      while j < rescnt-1:
        if residues[j]+2 == residues[j+1]:
          restwins.add residues[j]; restwins.add residues[j+1]; j += 1
        j += 1
      let rescntp = restwins.len     # twinpair residues count
      
      var inverses: seq[int] = @[]   # create PG's residues inverses here
      for res in residues: inverses.add(modinv(res,modpg))
      
      result = (modpg, rescnt, rescntp, residues, restwins, inverses)
    
    # Generate at compile time the parameters for PGs P5-P13
    const parametersp5  = genPGparameters(5)
    const parametersp7  = genPGparameters(7)
    const parametersp11 = genPGparameters(11)
    const parametersp13 = genPGparameters(13)
    
    # Global parameters
    var
      pcnt = 0             # number of primes from r1..sqrt(N)
      num = 0'u64          # adjusted (odd) input value
      primecnt = 0'u64     # number of primes <= N
      nextp: seq[uint64]   # table of regroups vals for primes multiples
      primes: seq[int]     # list of primes r1..sqrt(N)
      seg: seq[uint8]      # segment byte array to perform ssoz
      KB = 0               # segment size for each seg restrack
      cnts: seq[uint]      # hold twinprime counts for seg bytes
      pos: seq[int]        # convert residue val to its residues index val
                           # faster than `residues.find(residue)`
      modpg:    int        # PG's modulus value
      rescnt:   int        # PG's residues count
      rescntp:  int        # PG's twinprimes residues count
      residues: seq[int]   # PG's list of residues
      restwins: seq[int]   # PG's list of twinpair residues
      resinvrs: seq[int]   # PG's list of residues inverses
      Bn:       int        # segment size factor for PG and input number
    
    # Select at runtime best PG and segment size factor to use for input value.
    # These are good estimates derived from PG data profiling. Can be improved.
    proc selectPG(num: uint) =
      if num < 10_000_000:
        (modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp5
        Bn = 16
      elif num < 1_100_000_000'u:
        (modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp7
        Bn = 32
      elif num < 35_500_000_000'u:
        (modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp11
        Bn = 64
      else:
        (modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp13
        if num > 2_500_000_000_000'u: Bn = 384
        elif num > 250_000_000_000'u: Bn = 196
        else: Bn = 96
      cnts = newSeq[uint](rescntp div 2)           # twinprime sums for seg 
bytes
      pos  = newSeq[int](modpg)                    # create modpg size array to
      for i in 0..rescnt-1: pos[residues[i]-2] = i # convert residue val -> indx
    
    # Compute the list of primes r1..sqrt(input_num), and store in global
    # 'primes' array, and store their count in global var 'pcnt'.
    # Any algorithm (fast|small) is usable. Here the SoZ for the PG is used.
    proc sozpg(val: int | int64) =      # Use PG to finds primes upto val
      let md = modpg                    # PG's modulus value
      let rscnt = rescnt                # PG's residue count
      let res = residues                # PG's residues list
      
      let num = (val-1) or 1            # if val even then subtract 1
      var k = num div md                # compute its residue group value
      var modk = md * k                 # compute the resgroup base value
      var r = 0                         # from 1st residue in num's resgroup
      while num >= modk+res[r]: r += 1  # find last pc val|position <= num
      let maxpcs = k * rscnt + r        # max number of prime candidates <= num
      var prms = newSeq[bool](maxpcs)   # array of prime candidates set False
      
      let sqrtN =int(sqrt float64(num)) # compute integer sqrt of input num
      modk = 0; r = -1; k = 0           # initialize sieve parameters
      
      # mark the multiples of the primes r1..sqrtN in 'prms'
      for prm in prms:                  # from list of pc positions in prms
        r += 1; if r == rscnt: (r = 0; modk += md; k += 1)
        if prm: continue                # if pc not prime go to next location
        let prm_r = res[r]              # if prime save its residue value
        let prime = modk + prm_r        # numerate the prime value
        if  prime > sqrtN: break        # we're finished if it's > sqrtN
        let prmstep = prime * rscnt     # prime's stepsize to mark its mults
        for ri in res:                  # mark prime's multiples in prms
          let prod = prm_r * ri - 2     # compute cross-product for r|ri pair
          # compute resgroup val of 1st prime multiple for each restrack
          # starting there, mark all prime multiples on restrack upto end of 
prms
          var prm_mult = (k*(prime + ri) + prod div md)*rscnt + pos[prod mod md]
          while prm_mult < maxpcs: prms[prm_mult] = true; prm_mult += prmstep
      
      # prms now contains the nonprime positions for the prime candidates r1..N
      # extract primes into global var 'primes' and count into global var 'pcnt'
      primes = @[]                      # create empty dynamic array for primes
      modk = 0; r = -1                  # initialize loop parameters
      for prm in prms:                  # numerate|store primes from pcs list
        r += 1; if r == rscnt: (r = 0; modk += md)
        if not prm: primes.add(modk + res[r]) # put prime in global 'primes' 
list
      pcnt = primes.len                       # set global count of primes
    
    # For 'nextp' 'row_index' restrack for given residue, for primes 
r1..sqrt(N),
    # init each col w/1st prime multiple resgroup for each prime along restrack.
    proc resinit(row_indx, res: int) {.gcsafe.} =
     {.gcsafe.}:                              # for given residue 'res'
        let row = row_indx * pcnt             # along its restrack row in 
'nextp'
        for j, prime in primes:               # for each primes r1..sqrt(N)
          let k = (prime-2) div modpg         # find the resgroup it's in
          let r = (prime-2) mod modpg + 2     # and its residue value
          let ri = (res*resinvrs[pos[r-2]]-2) mod modpg + 2 # compute the ri 
for r
          let prod = r * ri - 2               # compute residues cross-product
          # compute|store 1st prime mult resgroup at col j for prime, for 'res'
          nextp[row + j] = uint(k*(prime + ri) + prod div modpg)
    
    # Print twinprimes for given twinpair for given segment slice.
    # Primes will not be displayed in sorted order, collect|sort later for that.
    proc printprms(Kn: int, Ki: uint64, indx: int)= # starting at seg resgroup 
Ki
      var modk = Ki * modpg.uint         # compute its base num
      let s_row = indx * KB              # set seg row index for given twinpair
      let res = restwins[indx * 2 + 1]   # get upper twinpair residue value
      for k in 0..Kn-1:                  # then for each of Kn seg resgroups
        if int(seg[s_row + k]) == 0:     # if both lsb bits in byte are prime
          if modk + res.uint <= num:     # and upper twinprime <= num
            echo(modk+res.uint-1)        # print one twinprime mid val per line
            #write(stdout,modk+res.uint-1," ") # print twnprms mid val
        modk += modpg.uint
    
    # First init twinpair's 1st prime mults for their residues rows in 'nextp'.
    # Perform prime sieve on selected twinpair restracks in seg of Kn resgroups.
    # Set lsbs in bytes to '1' along restrack for nonprime pc resgroups where 
each
    # of the Kn resgroup bytes along the restrack represent its prime 
candidates.
    # Update 'nextp' array for that restrack for next segment slices 
accordingly.
    # Then compute twinprimes count for segment, store in 'cnts' array for row.
    proc twins_sieve(Kmax: uint, indx, Ks: int) {.gcsafe.} =
     {.gcsafe.}:
      let i = indx * 2                     # lower lsb seg row index for 
twinpair
      resinit(i,   restwins[i])            # init 1st prime mults for lower lsb
      resinit(i+1, restwins[i+1])          # init 1st prime mults for upper lsb
      var sum = 0'u                        # init primes cnt for this seg byte 
row
      var Ki  = 0'u                        # 1st resgroup seg val for each slice
      let s_row = indx * KB                # set seg row address for this 
twinpair
      while Ki < Kmax:                     # for Ks resgroup size slices upto 
Kmax
        let Kn = min(Ks, int(Kmax-Ki))     # set segment slice resgroup length
        for b in s_row..s_row+KB-1: seg[b]=0 # set all seg restrack bits to 
prime
        for r in 0..1:                     # for 2 lsbs for twinpair bits in 
byte
          let row = (i + r) * pcnt         # set address to its restrack in 
'nextp'
          let biti = uint8(1 shl r)        # set residue track bit mask
          for j in 0..pcnt-1:              # for each prime index r1..sqrt(N)
            if nextp[row + j] < Kn.uint:   # if 1st mult resgroup is within 
'seg'
              var k = int(nextp[row + j])  # starting from this resgroup in 
'seg'
              let prime = primes[j]        # for this prime
              while k < Kn:                # for each primenth byte to end of 
'seg'
                seg[s_row+k] = seg[s_row+k] or biti # mark restrack bit nonprime
                k += prime                 # compute next prime multiple 
resgroup
              nextp[row + j] = uint(k-Kn)  # save 1st resgroup in next eligible 
seg
            else: nextp[row+j] -= Kn.uint  # do if 1st mult resgroup not within 
seg
        for k in 0..Kn-1: (if seg[s_row+k] == 0: sum += 1) # sum bytes with 
twins
        #printprms(Kn, Ki, indx)            # display twintprimes for this 
twinpair
        Ki += Ks.uint                      # set 1st resgroup val of next seg 
slice
      cnts[indx] = sum                     # save prime count for this seg 
twinpair
    
    # Perform sieve on twinpair residues for segment of Kn resgroups in 
parallel.
    # Then add seg twinprimes count for each seg tp restrack total to 
'primecnt'.
    proc segsieve(Kmax: uint) =          # perform sieve on all segments
      let Ks = KB                        # make default seg size immutable
      let twinpairs = rescntp div 2      # for number of seg twinpair byte rows
      parallel:                          # perform in parallel
        for indx in 0..twinpairs-1:      # for nextp|seg twinpair row indexes,
          spawn twins_sieve(Kmax,indx,Ks)# sieve the selected twinpair restracks
      sync()                             # then count the twinprimes in the seg
      for sum in cnts: primecnt += sum   # update 'primecnt' w/seg twinprime 
cnts
    
    proc twinprimes_ssoz() =             # Use selected Pn prime generator for 
SSoZ
      stdout.write "Enter integer number: "
      let val = stdin.readline.parseBiggestUInt # find primes <= val 
(13..2^64-1)
      
      echo("threads = ", countProcessors())
      let ts = epochTime()             # start timing sieve setup execution
      
      num = uint64((val-1) or 1)       # if val even subtract 1
      selectPG(num.uint)               # select PG and seg factor for input 
number
      let modpg = modpg.uint           # to reduce casting hell noise
      var k = num div modpg            # compute its resgroup
      var modk = modpg * k.uint        # compute its base num
      let Kmax = (num-2) div modpg + 1 # maximum number of resgroups for num
      
      let B = Bn * 1024                # set seg size to optimize for selected 
PG
      KB = if Kmax.int < B: (Kmax.int + 1) else: B  # min num of segment 
resgroups
      seg = newSeq[uint8]((rescntp div 2)*KB) # create [twinpairs x KB] bytes 
array
      
      echo("segment is [", (rescntp div 2), " x ", KB, "] bytes array")
      
      # This is not necessary for running the program but provides information
      # to determine the 'efficiency' of the used PG: (num of primes)/(num of 
pcs)
      # The closer the ratio is to '1' the higher the PG's 'efficiency'.
      var r = 0                        # starting with first residue
      while num.uint >= modk+restwins[r].uint: r += 1 # find last tp index <= 
num
      let maxpcs = k*rescntp.uint + r.uint  # maximum number of twinprime pcs
      let maxpairs = maxpcs div 2      # maximum number of twinpair candidates
      
      echo("twinprime candidates = ", maxpairs, "; resgroups = ", Kmax)
      
      let sqrtN=int(sqrt float64(num)) # compute integer sqrt of input num
      sozpg(sqrtN)                     # compute pcnt and primes <= sqrt(num)
      
      nextp = newSeq[uint64](rescntp*pcnt) # create size of global 'nextp' array
      
      echo("created|initialized nextp[", rescntp, " x ", pcnt, "] array")
      let te = (epochTime()-ts)        # sieve setup time
      echo("setup time = ", te.formatFloat(ffDecimal, 3), " secs")
      
      primecnt = if modpg > 210'u: 3 else: 2   # for 1st tps (3,5)|(5,7)|(11,13)
      
      echo("perform Twin Prime Segmented SoZ") # start doing ssoz now
      
      let t1 = epochTime()             # start timing ssoz sieve execution
      segsieve(Kmax.uint)              # perform total twinprime ssoz sieve
      
      let Kn = min(KB, (Kmax.int mod KB)) # set number of resgroups in last 
slice
      var lprime = 0'u64               # to store last twinprime value <= num
      modk = (Kmax-1).uint * modpg     # set mod for last resgroup in last 
segment
      k = uint(Kn-1)                   # set val for last resgroup in last 
segment
      let lasti = rescntp div 2 - 1    # set val for last seg twinpair row index
      r = lasti                        # starting from last resgroup twinpair 
byte
      while true:                      # step backwards from end of last 
resgroup
        var row_i = r * KB             # set 'seg' byte row for twinpair 
restrack
        if int(seg[row_i + k.int]) == 0: # if both twinpair bits in byte are 
prime
          lprime = modk + restwins[r*2 + 1].uint # numerate the upper twinprime 
val
          if lprime <= num: break      # if its <= num its the last prime, so 
exit
          primecnt -= 1                # else reduce primecnt, keep backtracking
                                       # reduce restrack, next resgroup if 
needed
        r -= 1; if r < 0: (r = lasti; modk -= modpg;  k -= 1)
      
      let t2 = (epochTime()-t1)        # sieve execution time
      echo()
      echo("sieve time = ", t2.formatFloat(ffDecimal, 3), " secs")
      echo("last segment = ", Kn, " resgroups; segment slices = ", Kmax div 
KB.uint+1)
      echo("total twins = ", primecnt, "; last twin = ", lprime-1, "+/-1")
      echo("total time = ", (t2+te).formatFloat(ffDecimal, 3), " secs\n")
    
    twinprimes_ssoz()

Twinprimes generator that showcases Nim

Reply via email to