A guess: you are using the wrong version of blacs. You need a -lmkl_blacs_intelmpi_XX where "XX" is the one for your system. I have seen this give the same error.
Use http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ For reference, with openmpi it is _openmpi_ instead of _intelmpi_, and similarly for sgi. 2012/1/22 Paul Fons <paul-fons at aist.go.jp>: > > Hi, > I have Wien2K running on a cluster of linux boxes each with 32 cores and > connected by 10Gb ethernet. ?I have compiled Wien2K by the 3.174 version of > Wien2K (I learned the hard way that bugs in the newer versions of the Intel > compiler lead to crashes in Wien2K). ?I have also installed Intel's MPI. > ?First, the single process Wien2K, let's say for the TiC case, works fine. > ?It also works fine when I use a .machines file like > > granulaity:1 > localhost:1 > localhost:1 > ? ?(24 times). > > This file leads to parallel execution without error. ?I can vary the number > of processes by increasing the number of localhost:1 and the number of > localhost:1 lines in the file and still everything works fine. ?When I try > to use mpi to communicate with one process, it works as well. > > 1:localhost:1 > > lstarting parallel lapw1 at Mon Jan 23 06:49:16 JST 2012 > > -> starting parallel LAPW1 jobs at Mon Jan 23 06:49:16 JST 2012 > running LAPW1 in parallel mode (using .machines) > 1 number_of_parallel_jobs > [1] 22417 > LAPW1 END > [1] + Done ( cd $PWD; $t $exe ${def}_$loop.def; rm > -f .lock_$lockfile[$p] ) >> .time1_$loop > localhost(111) 179.004u 4.635s 0:32.73 561.0% 0+0k 0+26392io 0pf+0w > Summary of lapw1para: > localhost k=111 user=179.004 wallclock=32.73 > 179.167u 4.791s 0:35.61 516.5% 0+0k 0+26624io 0pf+0w > > > Changing the machine file to use more than one process ?(the same form of > error occurs for more than 2) > > 1:localhost:2 > > lead to a run time error in the MPI subsystem. > > starting parallel lapw1 at Mon Jan 23 06:51:04 JST 2012 > -> starting parallel LAPW1 jobs at Mon Jan 23 06:51:04 JST 2012 > running LAPW1 in parallel mode (using .machines) > 1 number_of_parallel_jobs > [1] 22673 > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed > MPI_Comm_size(76).: Invalid communicator > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed > MPI_Comm_size(76).: Invalid communicator > [1] + Done ( cd $PWD; $t $ttt; rm -f > .lock_$lockfile[$p] ) >> .time1_$loop > localhost localhost(111) APPLICATION TERMINATED WITH THE EXIT STRING: > Hangup (signal 1) > 0.037u 0.036s 0:00.06 100.0% 0+0k 0+0io 0pf+0w > TiC.scf1_1: No such file or directory. > Summary of lapw1para: > localhost k=0 user=111 wallclock=0 > 0.105u 0.168s 0:03.21 8.0% 0+0k 0+216io 0pf+0w > > > I have properly sourced the appropriate runtime environment for the Intel > system. ?For example, compiling (mpiifort) and running the f90 mpi test > program from intel produces: > > > > mpirun -np 32 /home/paulfons/mpitest/testf90 > ?Hello world: rank ? ? ? ? ? ?0 ?of ? ? ? ? ? 32 ?running on > ?asccmp177 > > > ?Hello world: rank ? ? ? ? ? ?1 ?of ? ? ? ? ? 32 ?running on ? ?(32 times) > > > Does anyone have any suggestions as to what to try next? ?I am not sure how > to debug things from here. ?I have about 512 nodes that I can use for larger > calculations that only can be accessed by mpi (the ssh setup works fine as > well by the way). ?It would be great to figure out what is wrong. > > Thanks. > > > > > > > > > > > > > > > > > > Dr. Paul Fons > Functional Nano-phase-change Research Team > Team Leader > Nanodevice Innovation Research Center (NIRC) > National Institute for Advanced Industrial Science & Technology > METI > > AIST Central 4, Higashi 1-1-1 > Tsukuba, Ibaraki JAPAN 305-8568 > > tel. +81-298-61-5636 > fax. +81-298-61-2939 > > email:?paul-fons at aist.go.jp > > The following lines are in a Japanese font > > ?305-8562 ????????????? 1-1-1 > ????????? > ?????????????? > ???????????????????????? > ???????? > > > > > > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > -- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi