A guess: you are using the wrong version of blacs. You need a
-lmkl_blacs_intelmpi_XX
where "XX" is the one for your system. I have seen this give the same error.

Use http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

For reference, with openmpi it is _openmpi_ instead of _intelmpi_, and
similarly for sgi.

2012/1/22 Paul Fons <paul-fons at aist.go.jp>:
>
> Hi,
> I have Wien2K running on a cluster of linux boxes each with 32 cores and
> connected by 10Gb ethernet. ?I have compiled Wien2K by the 3.174 version of
> Wien2K (I learned the hard way that bugs in the newer versions of the Intel
> compiler lead to crashes in Wien2K). ?I have also installed Intel's MPI.
> ?First, the single process Wien2K, let's say for the TiC case, works fine.
> ?It also works fine when I use a .machines file like
>
> granulaity:1
> localhost:1
> localhost:1
> ? ?(24 times).
>
> This file leads to parallel execution without error. ?I can vary the number
> of processes by increasing the number of localhost:1 and the number of
> localhost:1 lines in the file and still everything works fine. ?When I try
> to use mpi to communicate with one process, it works as well.
>
> 1:localhost:1
>
> lstarting parallel lapw1 at Mon Jan 23 06:49:16 JST 2012
>
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:49:16 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22417
>  LAPW1 END
> [1]  + Done                          ( cd $PWD; $t $exe ${def}_$loop.def; rm
> -f .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost(111) 179.004u 4.635s 0:32.73 561.0%    0+0k 0+26392io 0pf+0w
>    Summary of lapw1para:
>    localhost   k=111   user=179.004    wallclock=32.73
> 179.167u 4.791s 0:35.61 516.5%        0+0k 0+26624io 0pf+0w
>
>
> Changing the machine file to use more than one process ?(the same form of
> error occurs for more than 2)
>
> 1:localhost:2
>
> lead to a run time error in the MPI subsystem.
>
> starting parallel lapw1 at Mon Jan 23 06:51:04 JST 2012
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:51:04 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22673
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost localhost(111) APPLICATION TERMINATED WITH THE EXIT STRING:
> Hangup (signal 1)
> 0.037u 0.036s 0:00.06 100.0%  0+0k 0+0io 0pf+0w
> TiC.scf1_1: No such file or directory.
>    Summary of lapw1para:
>    localhost   k=0     user=111        wallclock=0
> 0.105u 0.168s 0:03.21 8.0%    0+0k 0+216io 0pf+0w
>
>
> I have properly sourced the appropriate runtime environment for the Intel
> system. ?For example, compiling (mpiifort) and running the f90 mpi test
> program from intel produces:
>
>
>
> mpirun -np 32 /home/paulfons/mpitest/testf90
> ?Hello world: rank ? ? ? ? ? ?0 ?of ? ? ? ? ? 32 ?running on
> ?asccmp177
>
>
> ?Hello world: rank ? ? ? ? ? ?1 ?of ? ? ? ? ? 32 ?running on ? ?(32 times)
>
>
> Does anyone have any suggestions as to what to try next? ?I am not sure how
> to debug things from here. ?I have about 512 nodes that I can use for larger
> calculations that only can be accessed by mpi (the ssh setup works fine as
> well by the way). ?It would be great to figure out what is wrong.
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr. Paul Fons
> Functional Nano-phase-change Research Team
> Team Leader
> Nanodevice Innovation Research Center (NIRC)
> National Institute for Advanced Industrial Science & Technology
> METI
>
> AIST Central 4, Higashi 1-1-1
> Tsukuba, Ibaraki JAPAN 305-8568
>
> tel. +81-298-61-5636
> fax. +81-298-61-2939
>
> email:?paul-fons at aist.go.jp
>
> The following lines are in a Japanese font
>
> ?305-8562 ????????????? 1-1-1
> ?????????
> ??????????????
> ????????????????????????
> ????????
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi

Reply via email to