Compiling with **\--threadAnalysis:off** did allow the program to compile and 
run with no segfaults, and it produces the correct results, but it's 2x slower 
than the serial version.

In the code snippet, the **next**, **seg**, and **primes** arrays (seqs), and 
the constant **rescnt** are global parameters. **primes** is only being read 
from. Each thread of **residue_sieve** is using independent memory chunks of 
**seg** to read/write to. The problem may be with **next** which is 
read/written to in each thread, though never from the same location for any 
thread (where it is read/written to for each thread is differennt).

Is there **detailed documentation** of some non-trivial example to show how to 
actually create running parallel code?

If I need to re-architect the algorithm I first want to really understand the 
details of what needs to be done before I undertake that task.

Reply via email to