There exist parallel build issues on the MinGW/MSYS, Cygwin, and MinGW-w64/MSYS2 windows platforms (see below) so I have looked for race conditions caused by ill-considered CMake logic in our build system concerning target and file dependencies, but have not found any instances of that (although the timing conditions of my tests may just not be right to trigger such a race condition).
An example of such ill-considered logic is if you have two separate custom targets file depend directly or indirectly on the same custom command. Under these conditions, the custom command gets executed once for each target which is normally not an issue for non-parallel builds or parallel builds where the two custom targets happen to be executed at separate times. However, when by chance the two custom targets are being executed at the same time for parallel builds the resulting race condition will cause substantial build issues. We have lots of custom commands and custom targets in our build so historically we have had a lot of such race conditions to deal with. But as far as I know (hah!) CMake code review of the various custom commands and targets that are generated has by now found them all. Furthermore, here is a run-time test to discover whether any such race conditions are occurring. N.B. such race conditions critically depend on timing (e.g., the N value used in the -jN make option, the hardware platform, etc.) so this is a necessary but not sufficient test that we have gotten rid of all such race conditions via CMake code review. It turns out that even if race conditions that occur for a given timing enviroment do not cause obvious build issues, they are still easy to spot because each time a custom command is executed at build time (whenever its OUTPUT files are non-existent or older than its file depends), there is a correponding build message Generating <list of OUTPUT files from the custom command>. So that message will be duplicated if two separate custom targets attempt to build the custom command at the same time, i.e., whenever a race condition occurs for the given timing regardless of the symptoms (or lack of symptoms) from such a race. I have created the following set of commands to test for duplicate messages of that form for both the test_noninteractive target and the all target # test_noninteractive target # Do this test for an absolutely fresh build, i.e., cmake with # -DBUILD_TEST=ON option in an initially empty build tree followed by make -j4 test_noninteractive >& test_noninteractive.out # Find essentially all occurrences of commands being run by looking for # anything with "ing" with some exceptions. Sort the result once # without duplicates being eliminated and sort once with duplicates # eliminated. grep ing test_noninteractive.out |\ sed -e 's?\[.*\] ??g' |\ grep -vE 'Scanning|Differing|Missing' |\ grep -v bindings |\ grep -v "using command" |\ grep -v "Testing" |\ grep -v "switching to" |\ sort -u >| unique_builds.out grep ing test_noninteractive.out |\ sed -e 's?\[.*\] ??g' |\ grep -vE 'Scanning|Differing|Missing' |\ grep -v bindings |\ grep -v "using command" |\ grep -v "Testing" |\ grep -v "switching to" |\ sort >| builds.out # Compare the two results to see whether anything extra in builds.out # compared to unique_builds.out diff -au unique_builds.out builds.out |less It turns out there is some extra data consisting of duplicate instances of Generating tclIndex and duplicate instances of Generating x00 and similarly for almost every other x?? file. These all turned out to be false alarms. The two tclIndex files are actually created by distinct custom commands in bindings/tcl and examples/tcl. And similarly the x?? files are actually created by distinct custom commands in examples/python and examples/tcl. The one peculiarity that I found from the above test is that Generating x01 Generating x25 Generating x26 did not have duplicate instances. However, I double checked that two files were created in each case (one in examples/tcl and one in examples/python) so the conclusion must be I have turned up a minor bug in CMake (3.0.2) where it does not always emit a Generating <filename> message when a custom command generates an OUTPUT file at run time. # all target I also did the equivalent test for the "all" target starting from an initially empty build tree with similar good results concerning the lack of race conditions for the given timing conditions. There were duplicate messages for plgrid.tcl, plplot.tcl, tclIndex (4 instances instead of the 2 that occurred for the test_noninteractive target), x??, and x??.tcl. In all cases, further analysis showed these were all false alarms due to a file with the same name being created by different custom commands in different directories. So my conclusion is there is no evidence of parallel build race conditions for the "test_noninteractive" or "all" targets in the core build tree for the given timing conditions which is a limited but still satisfactory conclusion. Note at some point I should do a similar test for the test_interactive target in the build tree and also for the all, test_noninteractive, and test_interactive targets for the installed examples, but I don't have time for that now. The big motivation for the above experiment to look for parallel race conditions is due to following parallel build issues that show up in all our current comprehensive tests on various Windows platforms: 1. For my comprehensive test on MinGW/MSYS in the last release using parallel builds with -j4, I ran into the following message intermittently (i.e., just once, and the repeat test did not encounter the problem) make.exe error: "INTERNAL: Exiting with 1 jobserver tokens available; should be 4!" That along with the historical MSYS make.exe hangs for parallel builds (see <https://sourceforge.net/p/mingw/bugs/1950/>) lead me to the conclusion that for the most reliable results you should never use parallel builds on the MinGW/MSYS platform. Arjen's recent test of MinGW/MSYS followed that recommendation and (probably as a result) did not have any build or ctest problems. However, the above bug report now claims the hang issue has finally been solved so it might be worth it to experimentally try parallel builds on this platform when we test it again some time in the next release cycle. 2. Greg Jung has also recently encountered a similar warning message for -j4: make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4! on MinGW-w64/MSYS2. That build warning was was accompanied by parallel ctest just hanging until Greg killed the job. Since then, Greg has had no further trouble with the issue if he uses the --ctest_command "ctest" --build_command "make" --traditional_build_command "make" options to the comprehensive_test.sh script, i.e., drops both the parallel build and parallel ctest on the MinGW-w64/MSYS2 platform. 3. Both Greg and Arjen have experienced the make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4! warning on Cygwin. However, that does not seem to interfere with parallel builds, and parallel ctest works fine on that platform as well. My conclusions are as follows: * PLplot has no parallel build race conditions for the test_noninteractive and all targets that I can detect for the given timing conditions. So this is a necessary test to prove no race conditions but not a conclusive test so we are still really relying on our CMake code review here to keep us out of race condition trouble for all timing conditions. * Cygwin, MSYS, and MSYS2 make all emit warnings for parallel builds with -j4 consisting of messages similar to make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4! Until the issue (most likely a Windows make.exe bug, but it could also be a race condition for the given timings in each case) that causes these warning messages is addressed, all parallel build results on Cygwin, MinGW/MSYS, and MinGW-w64/MSYS2 should be viewed with some suspicion. * Parallel ctest hangs on MinGW-w64/MSYS. It is currently not clear if this is strictly a ctest issue or a byproduct of some parallel build issue that occurred. Therefore, more experimentation on this platform post-release, e.g., with non-parallel build but parallel ctest will be needed to help sort out the true cause of this issue. Alan __________________________ Alan W. Irwin Astronomical research affiliation with Department of Physics and Astronomy, University of Victoria (astrowww.phys.uvic.ca). Programming affiliations with the FreeEOS equation-of-state implementation for stellar interiors (freeeos.sf.net); the Time Ephemerides project (timeephem.sf.net); PLplot scientific plotting software package (plplot.sf.net); the libLASi project (unifont.org/lasi); the Loads of Linux Links project (loll.sf.net); and the Linux Brochure Project (lbproject.sf.net). __________________________ Linux-powered Science __________________________ ------------------------------------------------------------------------------ _______________________________________________ Plplot-devel mailing list Plplot-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/plplot-devel