BTW, timeouts can be caused by contention from stupid number of ranks/tMPI threads hammering a single GPU (especially with 2 threads/core with HT), but I'm not sure if the tests are ever executed with such a huge rank count.
-- Szilárd On Thu, Feb 8, 2018 at 2:40 PM, Mark Abraham <mark.j.abra...@gmail.com> wrote: > Hi, > > On Thu, Feb 8, 2018 at 2:15 PM Alex <nedoma...@gmail.com> wrote: > > > Mark and Peter, > > > > Thanks for commenting. I was told that all CUDA tests passed, but I will > > double check on how many of those were actually run. Also, we never > > rebooted the box after CUDA install, and finally we had a bunch of > > gromacs (2016.4) jobs running, because we didn't want to interrupt > > postdoc's work... All of those were with -nb cpu though. Could those > > factors have affected our regression tests? > > > > Can't say. You observed timeouts, which could be consistent with drivers or > runtimes getting stuck. However, the other mdrun processes may have by > default set thread affinity, and any process that does that will interfere > with how effectively any others run, such as the tests. Sharing a node is > difficult to do well, and doing anything else with a node running GROMACS > is asking for trouble unless you have manually managed keeping the tasks > apart. Just don't. > > Mark > > > > It will really suck, if these are hardware-related... > > > > Thanks, > > > > Alex > > > > > > On 2/8/2018 3:03 AM, Mark Abraham wrote: > > > Hi, > > > > > > Or leftovers of the drivers that are now mismatching. That has caused > > > timeouts for us. > > > > > > Mark > > > > > > On Thu, Feb 8, 2018 at 10:55 AM Peter Kroon <p.c.kr...@rug.nl> wrote: > > > > > >> Hi, > > >> > > >> > > >> with changing failures like this I would start to suspect the hardware > > >> as well. Mark's suggestion of looking at simpler test programs than > GMX > > >> is a good one :) > > >> > > >> > > >> Peter > > >> > > >> > > >> On 08-02-18 09 <08-02%2018%2009> <08-02%2018%2009>:10, Mark Abraham > > wrote: > > >>> Hi, > > >>> > > >>> That suggests that your new CUDA installation is differently > > incomplete. > > >> Do > > >>> its samples or test programs run? > > >>> > > >>> Mark > > >>> > > >>> On Thu, Feb 8, 2018 at 1:20 AM Alex <nedoma...@gmail.com> wrote: > > >>> > > >>>> Update: we seem to have had a hiccup with an orphan CUDA install and > > >> that > > >>>> was causing issues. After wiping everything off and rebuilding the > > >> errors > > >>>> from the initial post disappeared. However, two tests failed during > > >>>> regression: > > >>>> > > >>>> 95% tests passed, 2 tests failed out of 39 > > >>>> > > >>>> Label Time Summary: > > >>>> GTest = 170.83 sec (33 tests) > > >>>> IntegrationTest = 125.00 sec (3 tests) > > >>>> MpiTest = 4.90 sec (3 tests) > > >>>> UnitTest = 45.83 sec (30 tests) > > >>>> > > >>>> Total Test time (real) = 1225.65 sec > > >>>> > > >>>> The following tests FAILED: > > >>>> 9 - GpuUtilsUnitTests (Timeout) > > >>>> 32 - MdrunTests (Timeout) > > >>>> Errors while running CTest > > >>>> CMakeFiles/run-ctest-nophys.dir/build.make:57: recipe for target > > >>>> 'CMakeFiles/run-ctest-nophys' failed > > >>>> make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8 > > >>>> CMakeFiles/Makefile2:1160: recipe for target > > >>>> 'CMakeFiles/run-ctest-nophys.dir/all' failed > > >>>> make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2 > > >>>> CMakeFiles/Makefile2:971: recipe for target > > 'CMakeFiles/check.dir/rule' > > >>>> failed > > >>>> make[1]: *** [CMakeFiles/check.dir/rule] Error 2 > > >>>> Makefile:546: recipe for target 'check' failed > > >>>> make: *** [check] Error 2 > > >>>> > > >>>> Any ideas? I can post the complete log, if needed. > > >>>> > > >>>> Thank you, > > >>>> > > >>>> Alex > > >>>> -- > > >>>> Gromacs Users mailing list > > >>>> > > >>>> * Please search the archive at > > >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > >>>> posting! > > >>>> > > >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >>>> > > >>>> * For (un)subscribe requests visit > > >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > or > > >>>> send a mail to gmx-users-requ...@gromacs.org. > > >>>> > > >> > > >> -- > > >> Gromacs Users mailing list > > >> > > >> * Please search the archive at > > >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > >> posting! > > >> > > >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >> > > >> * For (un)subscribe requests visit > > >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > >> send a mail to gmx-users-requ...@gromacs.org. > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at http://www.gromacs.org/ > Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.