Jim Meyering wrote: ... > With that, I've solved at least part of the problem. > The segfault (and other strangeness we've witnessed) > arises because each "node" struct is stored on the stack, > and its address ends up being used by another thread after > the thread that owns the stack in question has been "joined". > > My solution is to use the heap instead of the stack. > However, for today I'm out of time and I have not yet found a > way to free these newly-malloc'd "node" buffers. > > To test this, I've done the following: > > gensort -a 10000 > gensort-10k > for i in $(seq 2000); do printf '% 4d\n' $i; valgrind --quiet src/sort -S > 100K \ > --parallel=2 gensort-10k > k; test $(wc -c < k) = 1000000 || break; done > for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \ > --parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done > > Without the patch, the first would show errors for more than 50% of > the runs and the second would rarely get to i=100 without generating > a core file. With the patch, both complete error-free (not counting > leaks).
FYI, while preparing a test, I've found that the latter test (without valgrind) passes 2000/2000 tests when compiled with -g -O2, yet fails in at least 10 of the 2000 when compiled with -ggdb3.