There exist parallel build issues on the MinGW/MSYS, Cygwin, and
MinGW-w64/MSYS2 windows platforms (see below) so I have looked for
race conditions caused by ill-considered CMake logic in our build
system concerning target and file dependencies, but have not found
any instances of that (although the timing conditions of my tests
may just not be right to trigger such a race condition).

An example of such ill-considered logic is if you have two separate
custom targets file depend directly or indirectly on the same custom
command. Under these conditions, the custom command gets executed once
for each target which is normally not an issue for non-parallel builds
or parallel builds where the two custom targets happen to be executed
at separate times. However, when by chance the two custom targets are
being executed at the same time for parallel builds the resulting race
condition will cause substantial build issues.  We have lots of custom
commands and custom targets in our build so historically we have had a
lot of such race conditions to deal with.  But as far as I know (hah!)
CMake code review of the various custom commands and targets that are
generated has by now found them all.

Furthermore, here is a run-time test to discover whether any such race
conditions are occurring.  N.B. such race conditions critically depend
on timing (e.g., the N value used in the -jN make option, the hardware
platform, etc.) so this is a necessary but not sufficient test that we
have gotten rid of all such race conditions via CMake code review.

It turns out that even if race conditions that occur for a given
timing enviroment do not cause obvious build issues, they are still
easy to spot because each time a custom command is executed at build
time (whenever its OUTPUT files are non-existent or older than its
file depends), there is a correponding build message

Generating <list of OUTPUT files from the custom command>.

So that message will be duplicated if two separate custom targets
attempt to build the custom command at the same time, i.e., whenever
a race condition occurs for the given timing regardless of the symptoms
(or lack of symptoms) from such a race.

I have created the following set of commands to test for
duplicate messages of that form for both the test_noninteractive target
and the all target

# test_noninteractive target

# Do this test for an absolutely fresh build, i.e., cmake with
# -DBUILD_TEST=ON option in an initially empty build tree followed by
make -j4 test_noninteractive >& test_noninteractive.out
# Find essentially all occurrences of commands being run by looking for
# anything with "ing" with some exceptions.  Sort the result once
# without duplicates being eliminated and sort once with duplicates
# eliminated.
grep ing test_noninteractive.out |\
sed -e 's?\[.*\] ??g' |\
grep -vE 'Scanning|Differing|Missing' |\
grep -v bindings |\
grep -v "using command" |\
grep -v "Testing" |\
grep -v "switching to" |\
sort -u >| unique_builds.out

grep ing test_noninteractive.out |\
sed -e 's?\[.*\] ??g' |\
grep -vE 'Scanning|Differing|Missing' |\
grep -v bindings |\
grep -v "using command" |\
grep -v "Testing" |\
grep -v "switching to" |\
sort >| builds.out

# Compare the two results to see whether anything extra in builds.out
# compared to unique_builds.out
diff -au unique_builds.out builds.out |less

It turns out there is some extra data consisting of
duplicate instances of

Generating tclIndex

and duplicate instances of

Generating x00

and similarly for almost every other x?? file.

These all turned out to be false alarms.  The two tclIndex files are
actually created by distinct custom commands in bindings/tcl and
examples/tcl. And similarly the x?? files are actually created by
distinct custom commands in examples/python and examples/tcl.

The one peculiarity that I found from the above test is that

Generating x01
Generating x25
Generating x26

did not have duplicate instances.  However, I double checked that two
files were created in each case (one in examples/tcl and one in
examples/python) so the conclusion must be I have turned up a minor
bug in CMake (3.0.2) where it does not always emit a

Generating <filename>

message when a custom command generates an OUTPUT file at run time.

# all target

I also did the equivalent test for the "all" target starting from
an initially empty build tree with similar good results concerning
the lack of race conditions for the given timing conditions.

There were duplicate messages for

plgrid.tcl, plplot.tcl, tclIndex (4 instances instead of the 2 that
occurred for the test_noninteractive target), x??, and x??.tcl.

In all cases, further analysis showed these were all false alarms due to a
file with the same name being created by different custom commands in
different directories.

So my conclusion is there is no evidence of parallel build race
conditions for the "test_noninteractive" or "all" targets in the core
build tree for the given timing conditions which is a limited but
still satisfactory conclusion. Note at some point I should do a
similar test for the test_interactive target in the build tree and
also for the all, test_noninteractive, and test_interactive targets
for the installed examples, but I don't have time for that now.

The big motivation for the above experiment to look for parallel race
conditions is due to following parallel build issues that show up in
all our current comprehensive tests on various Windows platforms:

1. For my comprehensive test on MinGW/MSYS in the last release using
parallel builds with -j4, I ran into the following message intermittently (i.e.,
just once, and the repeat test did not encounter the problem)

make.exe error: "INTERNAL: Exiting with 1 jobserver tokens available;
should be 4!"

That along with the historical MSYS make.exe hangs for parallel builds (see
<https://sourceforge.net/p/mingw/bugs/1950/>) lead me to the
conclusion that for the most reliable results you should never use
parallel builds on the MinGW/MSYS platform. Arjen's recent test of
MinGW/MSYS followed that recommendation and (probably as a result) did
not have any build or ctest problems.  However, the above bug report
now claims the hang issue has finally been solved so it might be
worth it to experimentally try parallel builds on this platform when
we test it again some time in the next release cycle.

2. Greg Jung has also recently encountered a similar warning message
for -j4:

make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4!

on MinGW-w64/MSYS2.  That build warning was
was accompanied by parallel
ctest just hanging until Greg killed the job.  Since then, Greg has had
no further trouble with the issue if he uses the

--ctest_command "ctest"
--build_command "make"
--traditional_build_command "make"

options to the comprehensive_test.sh script, i.e., drops both the
parallel build and parallel ctest on the MinGW-w64/MSYS2 platform.

3. Both Greg and Arjen have experienced the

make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4!

warning on Cygwin.  However, that does not seem to interfere with parallel 
builds, and
parallel ctest works fine on that platform as well.

My conclusions are as follows:

* PLplot has no parallel build race conditions for the
   test_noninteractive and all targets that I can detect for the given
   timing conditions.  So this is a necessary test to prove no race
   conditions but not a conclusive test so we are still really relying
   on our CMake code review here to keep us out of race condition
   trouble for all timing conditions.

* Cygwin, MSYS, and MSYS2 make all emit warnings for parallel builds
   with -j4 consisting of messages similar to

make: INTERNAL: Exiting with 3 jobserver tokens available; should be 4!

Until the issue (most likely a Windows make.exe bug, but it could also
be a race condition for the given timings in each case) that causes
these warning messages is addressed, all parallel build results on
Cygwin, MinGW/MSYS, and MinGW-w64/MSYS2 should be viewed with some
suspicion.

* Parallel ctest hangs on MinGW-w64/MSYS.  It is currently not clear
if this is strictly a ctest issue or a byproduct of some parallel
build issue that occurred.  Therefore, more experimentation on this
platform post-release, e.g., with non-parallel build but parallel
ctest will be needed to help sort out the true cause of this issue.

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

------------------------------------------------------------------------------
_______________________________________________
Plplot-devel mailing list
Plplot-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/plplot-devel

Reply via email to