I am rebooting the box and kicking out all the jobs until we figure this out.

Thanks!

Alex


On 2/8/2018 7:27 AM, Szilárd Páll wrote:
BTW, timeouts can be caused by contention from stupid number of ranks/tMPI
threads hammering a single GPU (especially with 2 threads/core with HT),
but I'm not sure if the tests are ever executed with such a huge rank count.

--
Szilárd

On Thu, Feb 8, 2018 at 2:40 PM, Mark Abraham <mark.j.abra...@gmail.com>
wrote:

Hi,

On Thu, Feb 8, 2018 at 2:15 PM Alex <nedoma...@gmail.com> wrote:

Mark and Peter,

Thanks for commenting. I was told that all CUDA tests passed, but I will
double check on how many of those were actually run. Also, we never
rebooted the box after CUDA install, and finally we had a bunch of
gromacs (2016.4) jobs running, because we didn't want to interrupt
postdoc's work... All of those were with -nb cpu though. Could those
factors have affected our regression tests?

Can't say. You observed timeouts, which could be consistent with drivers or
runtimes getting stuck. However, the other mdrun processes may have by
default set thread affinity, and any process that does that will interfere
with how effectively any others run, such as the tests. Sharing a node is
difficult to do well, and doing anything else with a node running GROMACS
is asking for trouble unless you have manually managed keeping the tasks
apart. Just don't.

Mark


It will really suck, if these are hardware-related...

Thanks,

Alex


On 2/8/2018 3:03 AM, Mark Abraham wrote:
Hi,

Or leftovers of the drivers that are now mismatching. That has caused
timeouts for us.

Mark

On Thu, Feb 8, 2018 at 10:55 AM Peter Kroon <p.c.kr...@rug.nl> wrote:

Hi,


with changing failures like this I would start to suspect the hardware
as well. Mark's suggestion of looking at simpler test programs than
GMX
is a good one :)


Peter


On 08-02-18 09 <08-02%2018%2009> <08-02%2018%2009>:10, Mark Abraham
wrote:
Hi,

That suggests that your new CUDA installation is differently
incomplete.
Do
its samples or test programs run?

Mark

On Thu, Feb 8, 2018 at 1:20 AM Alex <nedoma...@gmail.com> wrote:

Update: we seem to have had a hiccup with an orphan CUDA install and
that
was causing issues. After wiping everything off and rebuilding the
errors
from the initial post disappeared. However, two tests failed during
regression:

95% tests passed, 2 tests failed out of 39

Label Time Summary:
GTest              = 170.83 sec (33 tests)
IntegrationTest    = 125.00 sec (3 tests)
MpiTest            =   4.90 sec (3 tests)
UnitTest           =  45.83 sec (30 tests)

Total Test time (real) = 1225.65 sec

The following tests FAILED:
    9 - GpuUtilsUnitTests (Timeout)
32 - MdrunTests (Timeout)
Errors while running CTest
CMakeFiles/run-ctest-nophys.dir/build.make:57: recipe for target
'CMakeFiles/run-ctest-nophys' failed
make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
CMakeFiles/Makefile2:1160: recipe for target
'CMakeFiles/run-ctest-nophys.dir/all' failed
make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
CMakeFiles/Makefile2:971: recipe for target
'CMakeFiles/check.dir/rule'
failed
make[1]: *** [CMakeFiles/check.dir/rule] Error 2
Makefile:546: recipe for target 'check' failed
make: *** [check] Error 2

Any ideas? I can post the complete log, if needed.

Thank you,

Alex
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/
Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Reply via email to