Hi,

Thank you so much. Well, the memory is enough. As I said, the jobs run
and the whole process is actually done without complaining about memory,
but they are not ended up correctly. I first tries to solve this using
this algorithm:

1. all processes except root will wait before MPI_Finalize routine is
called for a message from root
2. when root arrives this point, starts sending message to all processes
to make them out of blocking mode

This is actually a barrier. The solution didn't work initially, but when
I added some "cout" lines to write if operation is done successfully, it
worked perfectly. I think writing to output makes some delay that is
useful here. However, I did need to write these messages, so the problem
solved in a correct way. ;) Now it works anyway and I think it will work
in the future too since the problem that I tested with is gigantic!

Thanks for your help again,

Danesh


Gus Correa wrote:
> Hi Danesh
>
> Make sure you have 700GB of RAM on the sum of all nodes you are using.
> Otherwise context switching and memory swapping may be the problem.
> MPI doesn't perform well in this conditions (and may break, particularly
> on large problems, I suppose).
>
> A good way to go about it is to look at the physical
> "RAM per core" if those are multi-core machines,
> and compare to the actual memory per core your program requires.
> You need to give the system some RAM also, and use no more than 80% or
> so of the memory.
>
> If you or a system administrator has access to the nodes,
> you can monitor the memory use with "top".
> If you have Ganglia on this cluster, you can use the memory report
> metric also.
>
> Another possibility is a memory leak, which may be in your program,
> or (less likely) in MPI.
> Note, however, that OpenMPI 1.3.0 and 1.3.1 had this problem (with
> Infinband only), which was fixed in  1.3.2:
>
> http://www.open-mpi.org/community/lists/announce/2009/04/0030.php
> https://svn.open-mpi.org/trac/ompi/ticket/1853
>
> If you are using 1.3.0 or 1.3.1, upgrade to 1.3.2.
>
> I hope this helps.
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> Danesh Daroui wrote:
>> Dear all,
>>
>> I am not sure if this the right forum to ask this question, so sorry if
>> I am wrong. I am using ScaLAPACK in my code and MPI of course (OMPI) in
>> a electromagnetic solver program, running on a cluster. I get very
>> strange behavior when I use a large number of processors to run my code
>> for very large problems. In these cases, however, the program finishes
>> successfully, but it stays until the wall time exceeds the limit and the
>> job is terminated by queue manager (I use qsub ti submit a job). This
>> happens when, for example I use more than 80 processors for a problem
>> which needs more than 700 GB memory. For smaller problem, everything is
>> OK and all output files are generated correctly, while when this
>> happens, the output files are empty. I am almost sure that there is a
>> synchronization problem and some processes fail to reach the
>> finalization point while others are done.
>>
>> My code is written in C++ and in "main" function I call a routine called
>> "Solver". My Solver function looks like below:
>>
>> Solver()
>> {
>>         for (std::vector<double>::iterator ti=times.begin();
>> ti!=times.end(); ++ti)
>>         {
>>                 Stopwatch iwatch, dwatch, twatch;
>>
>>                 // some ScaLAPACK operations
>>
>>                 if (iamroot())
>>                 {
>>                          // some operation only for root process
>>                 }
>>           }
>>
>>         blacs::gridexit(ictxt);
>>         blacs::exit(1);
>> }
>>
>> and my "main" function which calls "Solver" looks like below:
>>
>>
>> int main()
>> {
>>
>>        // some preparing operations
>>
>>         Solver();
>>         if (rank==0)
>>                 std::cout << "Total execution time: " << time.tick() <<
>> " s\n" << std::flush;
>>
>>       err=MPI_Finalize();
>>
>>       if (MPI_SUCCESS!=err)
>>       {
>>               std::cerr << "MPI_Finalize failed: " << err << "\n";
>>               return err;
>>       }
>>
>>         return 0;
>> }
>>
>> I did put a "blacs::barrier(ictxt, 'A')" at the and of "Solver" routine,
>> before calling "blacs::exit(1)" to make sure that all processes arrive
>> here before MPI_Finalize, but the problem didn't solve. Do you have any
>> idea where the problem is?
>>
>> Thanks in advance,
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


-- 
Danesh Daroui
Ph.D Student
Lulea University of Technology
http://www.ltu.se

danesh.dar...@ltu.se
+46-704-399847

Reply via email to