Christof,

Don't use "-ffpe-trap=invalid,zero,overflow" on the pdlaiect.f file. This
file implements checks for special corner cases (division by NaN and by 0)
and will always trigger if you set the FPR trap.

I talked with some of the ScaLAPACK developers, and their assumption is
that this looks like a non-initialized local variable somewhere. Because
different MPI versions do different memory allocations and stack
manipulations, the local variables might inherit different values, and in
some cases these values might be erroneous (in your example the printed
values should not be NaN).

Moreover, the eigenvalue tester is a pretty sensitive piece of code. I
would strongly suggest you send an email with your findings to the
ScaLAPACK mailing list.

  George.



On Wed, Nov 23, 2016 at 9:41 AM, Christof Koehler <
christof.koeh...@bccms.uni-bremen.de> wrote:

> Hello everybody,
>
> as promised I started to test on my laptop (which has only two physical
> cores, in case that matters).
>
> As I discovered the story is not as simple as I assumed. I was focusing
> on xdsyevr when testing on the workstation and overlooked the others.
>
> On the cluster the only test which throws errors is xdsyevr with 2.0.1.
> With 1.10.4 everything is fine. I double checked this by now.
>
> On the workstation I get "136 tests completed and failed." in xcheevr
> with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed
> and failed" in xdsyevr and xssyevr.
>
> On the laptop I am not sure yet, I ran out of battery power. But it
> looked similar to the workstation. Failures with both versions.
>
> So, there is certainly a factor unrelated to OpenMPI. It might even be
> that this failures are complete noise. I will try to investigate this
> further. If some list member has a good idea how to test and what to
> look for I would appreciate a hint. Also, perhaps someone could try to
> replicate this.
>
> Thank you for your help so far.
>
> Best Regards
>
> Christof
>
>
> On Tue, Nov 22, 2016 at 10:35:57PM +0900, Gilles Gouaillardet wrote:
> > Christoph,
> >
> > out of curiosity, could you try to
> > mpirun --mca coll ^tuned ...
> > and see if it helps ?
> >
> > Cheers,
> >
> > Gilles
> >
> >
> > On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
> > <christof.koeh...@bccms.uni-bremen.de> wrote:
> > > Hello again,
> > >
> > > I tried to replicate the situation on the workstation at my desk,
> > > running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
> > > blas libraries.
> > >
> > > With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed
> and
> > > failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
> > > reasonable looking numbers as described before.
> > >
> > > With 1.10 I get "136 tests completed and passed residual checks."
> > > instead as observed before.
> > >
> > > So this is likely not an Omni-Path problem but something else in 2.0.1.
> > >
> > > I should eventually clarify that I am using the current revision 206
> from
> > > the scalapack trunk (svn co https://icl.cs.utk.edu/svn/
> scalapack-dev/scalapack/trunk)
> > > but if I remember correctly I had very similar problems with the 2.0.2
> > > release tarball.
> > >
> > > Both MPIs were built with
> > > ./configure --with-hwloc=internal --enable-static
> --enable-orterun-prefix-by-default
> > >
> > >
> > > Best Regards
> > >
> > > Christof
> > >
> > > On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote:
> > >> Hi Christof,
> > >>
> > >> Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
> > >> Could you try to run the tests using the 'ob1' PML in order to
> > >> bypass PSM2?
> > >>
> > >> mpirun --mca pml ob1 (all the rest of the args)
> > >>
> > >> and see if you still observe the failures?
> > >>
> > >> Howard
> > >>
> > >>
> > >> 2016-11-18 9:32 GMT-07:00 Christof Köhler <
> > >> christof.koeh...@bccms.uni-bremen.de>:
> > >>
> > >> > Hello everybody,
> > >> >
> > >> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
> tests
> > >> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
> > >> > failures are observed. Also, with mvapich2 2.2 no failures are
> observed.
> > >> > The other testers appear to be working with all MPIs mentioned
> (have to
> > >> > triple check again). I somehow overlooked the failures below at
> first.
> > >> >
> > >> > The system is an Intel OmniPath system (newest Intel driver release
> 10.2),
> > >> > i.e. we are using the PSM2
> > >> > mtl I believe.
> > >> >
> > >> > I built the OpenMPIs with gcc 6.2 and the following identical
> options:
> > >> > ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
> > >> > --with-psm2 --with-tm --with-hwloc=internal --enable-static
> > >> > --enable-orterun-prefix-by-default
> > >> >
> > >> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
> "-O1
> > >> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
> compiler
> > >> > changes.
> > >> >
> > >> > With OpenMPI 1.10.4 I see on a single node
> > >> >
> > >> >  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> > >> > ./xdsyevr
> > >> > 136 tests completed and passed residual checks.
> > >> >     0 tests completed without checking.
> > >> >     0 tests skipped for lack of memory.
> > >> >     0 tests completed and failed.
> > >> >
> > >> > With OpenMPI 1.10.4 I see on two nodes
> > >> >
> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> > >> > ./xdsyevr
> > >> >   136 tests completed and passed residual checks.
> > >> >     0 tests completed without checking.
> > >> >     0 tests skipped for lack of memory.
> > >> >     0 tests completed and failed.
> > >> >
> > >> > With OpenMPI 2.0.1 I see on a single node
> > >> >
> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> > >> > ./xdsyevr
> > >> > 32 tests completed and passed residual checks.
> > >> >     0 tests completed without checking.
> > >> >     0 tests skipped for lack of memory.
> > >> >   104 tests completed and failed.
> > >> >
> > >> > With OpenMPI 2.0.1 I see on two nodes
> > >> >
> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> > >> > ./xdsyevr
> > >> >    32 tests completed and passed residual checks.
> > >> >     0 tests completed without checking.
> > >> >     0 tests skipped for lack of memory.
> > >> >   104 tests completed and failed.
> > >> >
> > >> > A typical failure looks like this in the output
> > >> >
> > >> > IL, IU, VL or VU altered by PDSYEVR
> > >> >    500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.
>  FAILED
> > >> >    500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9
>  PASSED
> > >> >  EVR
> > >> > IL, IU, VL or VU altered by PDSYEVR
> > >> >    500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5
>  FAILED
> > >> >    500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3
>  PASSED
> > >> >  EVR
> > >> >    500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0
>  PASSED
> > >> >  EVR
> > >> > IL, IU, VL or VU altered by PDSYEVR
> > >> >    500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1
>  FAILED
> > >> >    500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8
>  PASSED
> > >> >  EVR
> > >> >
> > >> >
> > >> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
> > >> > We see similar problems with intel 2016 compilers, but I believe
> gcc is a
> > >> > good baseline.
> > >> >
> > >> > Any ideas ? For us this is a real problem in that we do not know if
> this
> > >> > indicates a network (transport) issue in the intel software stack
> (libpsm2,
> > >> > hfi1 kernel module) which might affect our production codes or if
> this is
> > >> > an OpenMPI issue. We have some other problems I might ask about
> later on
> > >> > this list, but nothing which yields such a nice reproducer and
> especially
> > >> > these other problems might well be application related.
> > >> >
> > >> > Best Regards
> > >> >
> > >> > Christof
> > >> >
> > >> > --
> > >> > Dr. rer. nat. Christof Köhler       email:
> c.koeh...@bccms.uni-bremen.de
> > >> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > >> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > >> > 28359 Bremen
> > >> >
> > >> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > >> >
> > >> > _______________________________________________
> > >> > users mailing list
> > >> > users@lists.open-mpi.org
> > >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> > >
> > > --
> > > Dr. rer. nat. Christof Köhler       email:
> c.koeh...@bccms.uni-bremen.de
> > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > 28359 Bremen
> > >
> > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > >
> > > _______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> --
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to