Yeah I suspected linear dependence. But I was puzzled by the error occurring in one machine and not the other. But even on the machine that it failed, it failed for some runs and passed successfully for others. So it suggests that the vector norm is almost zero in certain cases (i.e, in the runs that survive) and zero in others (i.e., the runs that fail). I'll use -bv_orthog_block chol to see if the error persists.
Thanks a ton, Jose. Regards, Bikash On Thu, Jan 28, 2016 at 5:18 AM, Jose E. Roman <[email protected]> wrote: > > > El 28 ene 2016, a las 9:13, Bikash Kanungo <[email protected]> escribió: > > > > Hi Jose, > > > > Here is the complete error message: > > > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > > [0]PETSC ERROR: Invalid argument > > [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3 > > [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html > for trouble shooting. > > [0]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014 > > [0]PETSC ERROR: Unknown Name on a intel-openmpi_ib named > comet-03-60.sdsc.edu by bikashk Thu Jan 28 00:09:17 2016 > > [0]PETSC ERROR: Configure options CFLAGS="-fPIC -xcore-avx2" > FFLAGS="-fPIC -xcore-avx2" CXXFLAGS="-fPIC -xcore-avx2" > --prefix=/opt/petsc/intel/openmpi_ib --with-mpi=true > --download-pastix=../pastix_5.2.2.12.tar.bz2 > --download-ptscotch=../scotch_6.0.0_esmumps.tar.gz > --with-blas-lib="-Wl,--start-group > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a > > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a > -Wl,--end-group -lpthread -lm" --with-lapack-lib="-Wl,--start-group > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a > > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a > /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a > -Wl,--end-group -lpthread -lm" > --with-superlu_dist-include=/opt/superlu/intel/openmpi_ib/include > --with-superlu_dist-lib="-L/opt/superlu/intel/openmpi_ib/lib -lsuperlu" > --with-parmetis-dir=/opt/parmetis/intel/openmpi_ib > --with-metis-dir=/opt/parmetis/intel/openmpi_ib > --with-mpi-dir=/opt/openmpi/intel/ib > --with-scalapack-dir=/opt/scalapack/intel/openmpi_ib > --download-mumps=../MUMPS_4.10.0-p3.tar.gz > --download-blacs=../blacs-dev.tar.gz > --download-fblaslapack=../fblaslapack-3.4.2.tar.gz --with-pic=true > --with-shared-libraries=1 --with-hdf5=true > --with-hdf5-dir=/opt/hdf5/intel/openmpi_ib --with-debugging=false > > [0]PETSC ERROR: #1 BVScaleColumn() line 380 in > /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvops.c > > [0]PETSC ERROR: #2 BVOrthogonalize_GS() line 474 in > /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c > > [0]PETSC ERROR: #3 BVOrthogonalize() line 535 in > /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c > > [comet-03-60:27927] *** Process received signal *** > > [comet-03-60:27927] Signal: Aborted (6) > > > > > > Here are some comments: > - These kind of errors appear only in debugging mode. I don't know why you > are getting them since you have --with-debugging=false > - The flag -xcore-avx2 enables fused multiply-add (FMA) instructions, > which means you get slightly more accurate floating-point results. This > could explain why you get different behaviour with/without this flag. > - The argument of BVScaleColumn() is guaranteed to be the same in all > processes, so the only explanation is that it has become a NaN. [Note that > in petsc-master (and hence petsc-3.7) NaN's no longer trigger this error.] > - My conclusion is that your column vectors of the BV object are not > linearly independent, so eventually the vector norm is (almost) zero. The > error will appear only if the computed value is exactly zero. > > In summary: BVOrthogonalize() is new in SLEPc, and it is not very well > tested. In particular, linearly dependent vectors are not handled well. For > the next release I will add code to take into account rank-deficient BV's. > In the meantime, you may want to try running with '-bv_orthog_block chol' > (it uses a different orthogonalization algorithm). > > Jose > > -- Bikash S. Kanungo PhD Student Computational Materials Physics Group Mechanical Engineering University of Michigan
