Hi, Thank you so much. Well, the memory is enough. As I said, the jobs run and the whole process is actually done without complaining about memory, but they are not ended up correctly. I first tries to solve this using this algorithm:
1. all processes except root will wait before MPI_Finalize routine is called for a message from root 2. when root arrives this point, starts sending message to all processes to make them out of blocking mode This is actually a barrier. The solution didn't work initially, but when I added some "cout" lines to write if operation is done successfully, it worked perfectly. I think writing to output makes some delay that is useful here. However, I did need to write these messages, so the problem solved in a correct way. ;) Now it works anyway and I think it will work in the future too since the problem that I tested with is gigantic! Thanks for your help again, Danesh Gus Correa wrote: > Hi Danesh > > Make sure you have 700GB of RAM on the sum of all nodes you are using. > Otherwise context switching and memory swapping may be the problem. > MPI doesn't perform well in this conditions (and may break, particularly > on large problems, I suppose). > > A good way to go about it is to look at the physical > "RAM per core" if those are multi-core machines, > and compare to the actual memory per core your program requires. > You need to give the system some RAM also, and use no more than 80% or > so of the memory. > > If you or a system administrator has access to the nodes, > you can monitor the memory use with "top". > If you have Ganglia on this cluster, you can use the memory report > metric also. > > Another possibility is a memory leak, which may be in your program, > or (less likely) in MPI. > Note, however, that OpenMPI 1.3.0 and 1.3.1 had this problem (with > Infinband only), which was fixed in 1.3.2: > > http://www.open-mpi.org/community/lists/announce/2009/04/0030.php > https://svn.open-mpi.org/trac/ompi/ticket/1853 > > If you are using 1.3.0 or 1.3.1, upgrade to 1.3.2. > > I hope this helps. > > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > Danesh Daroui wrote: >> Dear all, >> >> I am not sure if this the right forum to ask this question, so sorry if >> I am wrong. I am using ScaLAPACK in my code and MPI of course (OMPI) in >> a electromagnetic solver program, running on a cluster. I get very >> strange behavior when I use a large number of processors to run my code >> for very large problems. In these cases, however, the program finishes >> successfully, but it stays until the wall time exceeds the limit and the >> job is terminated by queue manager (I use qsub ti submit a job). This >> happens when, for example I use more than 80 processors for a problem >> which needs more than 700 GB memory. For smaller problem, everything is >> OK and all output files are generated correctly, while when this >> happens, the output files are empty. I am almost sure that there is a >> synchronization problem and some processes fail to reach the >> finalization point while others are done. >> >> My code is written in C++ and in "main" function I call a routine called >> "Solver". My Solver function looks like below: >> >> Solver() >> { >> for (std::vector<double>::iterator ti=times.begin(); >> ti!=times.end(); ++ti) >> { >> Stopwatch iwatch, dwatch, twatch; >> >> // some ScaLAPACK operations >> >> if (iamroot()) >> { >> // some operation only for root process >> } >> } >> >> blacs::gridexit(ictxt); >> blacs::exit(1); >> } >> >> and my "main" function which calls "Solver" looks like below: >> >> >> int main() >> { >> >> // some preparing operations >> >> Solver(); >> if (rank==0) >> std::cout << "Total execution time: " << time.tick() << >> " s\n" << std::flush; >> >> err=MPI_Finalize(); >> >> if (MPI_SUCCESS!=err) >> { >> std::cerr << "MPI_Finalize failed: " << err << "\n"; >> return err; >> } >> >> return 0; >> } >> >> I did put a "blacs::barrier(ictxt, 'A')" at the and of "Solver" routine, >> before calling "blacs::exit(1)" to make sure that all processes arrive >> here before MPI_Finalize, but the problem didn't solve. Do you have any >> idea where the problem is? >> >> Thanks in advance, >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Danesh Daroui Ph.D Student Lulea University of Technology http://www.ltu.se danesh.dar...@ltu.se +46-704-399847