Compiling with **\--threadAnalysis:off** did allow the program to compile and run with no segfaults, and it produces the correct results, but it's 2x slower than the serial version.
In the code snippet, the **next**, **seg**, and **primes** arrays (seqs), and the constant **rescnt** are global parameters. **primes** is only being read from. Each thread of **residue_sieve** is using independent memory chunks of **seg** to read/write to. The problem may be with **next** which is read/written to in each thread, though never from the same location for any thread (where it is read/written to for each thread is differennt). Is there **detailed documentation** of some non-trivial example to show how to actually create running parallel code? If I need to re-architect the algorithm I first want to really understand the details of what needs to be done before I undertake that task.