Unfortunately it is hard to know what is going on. A google search on "Error while reading PMI socket." indicates that the message you have means it did not work, and is not specific. Some suggestions:
a) Try mpiexec (slightly different arguments). You just edit parallel_options. https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager b) Try an older version of mvapich2 if it is on the system. c) Do you have to launch mpdboot for your system https://wiki.calculquebec.ca/w/MVAPICH2/en? d) Talk to a sys_admin, particularly the one who setup mvapich e) Do "cat *.error", maybe something else went wrong or it is not mpi's fault but a user error. ___________________________ Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi On Apr 28, 2015 10:17 PM, "lung Fermin" <[email protected]> wrote: > Thanks for Prof. Marks' comment. > > 1. In the previous email, I have missed to copy the line > > setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile > _HOSTS_ _EXEC_" > It was in the parallel_option. Sorry about that. > > 2. I have checked that the running program was lapw1c_mpi. Besides, when > the mpi calculation was done on a single node for some other system, the > results are consistent with the literature. So I believe that the mpi code > has been setup and compiled properly. > > Would there be something wrong with my option in siteconfig..? Do I have > to set some command to bind the job? Any other possible cause of the error? > > Any suggestions or comments would be appreciated. Thanks. > > > Regards, > > Fermin > > > ---------------------------------------------------------------------------------------------------- > > You appear to be missing the line > > setenv WIEN_MPIRUN=... > > This is setup when you run siteconfig, and provides the information on how > mpi is run on your system. > > N.B., did you setup and compile the mpi code? > > ___________________________ > Professor Laurence Marks > Department of Materials Science and Engineering > Northwestern University > www.numis.northwestern.edu > MURI4D.numis.northwestern.edu > Co-Editor, Acta Cryst A > "Research is to see what everybody else has seen, and to think what nobody > else has thought" > Albert Szent-Gyorgi > > On Apr 28, 2015 4:22 AM, "lung Fermin" <[email protected]> wrote: > > Dear Wien2k community, > > > > I am trying to perform calculation on a system of ~100 in-equivalent atoms > using mpi+k point parallelization on a cluster. Everything goes fine when > the program was run on a single node. However, if I perform the calculation > across different nodes, the follow error occurs. How to solve this problem? > I am a newbie to mpi programming, any help would be appreciated. Thanks. > > > > The error message (MVAPICH2 2.0a): > > > --------------------------------------------------------------------------------------------------- > > Warning: no access to tty (Bad file descriptor). > > Thus no job control in this shell. > > z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 > z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1 > > -13 z1-13 z1-13 z1-13 z1-13 z1-13 > > number of processors: 32 > > LAPW0 END > > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13 > aborted: Error while reading a PMI socket (4) > > [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546) > terminated with signal 9 -> abort job > > [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. > MPI process died? > > [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI > process died? > > [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12. > MPI process died? > > [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI > process died? > > [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454) > terminated with signal 9 -> abort job > > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2 > aborted: MPI process error (1) > > [cli_15]: aborting job: > > application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15 > > > > > stop error > > > ------------------------------------------------------------------------------------------------------ > > > > The .machines file: > > # > > 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 > z1-2 z1-2 > > 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 > z1-13 z1-13 z1-13 z1-13 > > granularity:1 > > extrafine:1 > > > -------------------------------------------------------------------------------------------------------- > > The parallel_options: > > > > setenv TASKSET "no" > > setenv USE_REMOTE 0 > > setenv MPI_REMOTE 1 > > setenv WIEN_GRANULARITY 1 > > > > > -------------------------------------------------------------------------------------------------------- > > > > Thanks. > > > > Regards, > > Fermin >
_______________________________________________ Wien mailing list [email protected] http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/[email protected]/index.html

