On Nov 4, 2012, at 7:05 AM, George Markomanolis <geo...@markomanolis.com> wrote:

> Dear all,
> 
> I am trying to execute an experiment by oversubscribing the nodes. So I have 
> available some clusters (I can use up to 8-10 different clusters during one 
> execution) and I have totally around to 1300 cores. I am executing the EP 
> benchmark from the NAS suite which means that there are not a lot of MPI 
> messages, just some collective MPI calls.
> 
> The number of the MPI processes per node, depends on the available memory of 
> each node. Thus in the machinefile I have declared one node 13 times if I 
> want 13 MPI processes on it. Is that correct?

You *can* do it that way, or you could just use "slots=13" for that node in the 
file, and list it only once.

> Giving a machinefile of 32768 nodes when I want to execute 32768 processes, 
> does OpenMPI behave like there is no oversubscribing?

Yes, it should - I assume you mean "slots" and not "nodes" in the above 
statement, since you indicate that you listed each node multiple times to set 
the number of slots on that node.

> If yes how can I give a machinefile where there is different number of MPI 
> processes on each node? The maximum number of MPI processes that I have in a 
> node is 388.

Just assign the number of slots on each node to be the number of processes you 
want on that node

> 
> My problem is that I can execute 16384 processes but not 32768. In the first 
> case I need around to 3 minutes for the execution but in the second case, 
> even after 7 hours the benchmark does not even start. There is no error, I am 
> just cancelling the job by myself but I am assuming that something is wrong 
> because 7 hours it is too much. I have to say that I executed the instance of 
> 16384 processes without any problem. I added some debug info in the benchmark 
> and I can see that the execution is delayed during MPI_Init, it never passes 
> this point. For the instance of 16384 processes I need around to 2 minutes to 
> finish the MPI_Init call. I am checking the memory of all the nodes and there 
> is at least 0.5GB free memory on each node.
> 
> I know about the parameter mpi_yield_when_idle but I have read that if there 
> are not a lot of MPI messages will not improve the performance. I tried 
> though and nothing changed. I tried also the mpi_preconnect_mpi just in case 
> but again nothing. Could you please indicate a reason why is this happening?

You indicated that these jobs are actually spanning multiple clusters - true? 
If so, when you cross that 16384 boundary, do you also cross clusters? Is it 
possible one or more of the additional clusters is blocking communications?

> 
> Moreover I used just one node with 48GB memory in order to execute 2048 MPI 
> processes without any problem, of course I just had to wait a lot.
> 
> I am using OpenMPI v1.4.1 and all the clusters are 64 bit.
> 
> I execute the benchmark with the following command:
> mpirun --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_exclude ib0,lo,myri0 
> -machinefile machines -np 32768 ep.D.32768

You could just leave off the "-np N" part of the command line - we'll assign 
one process to every slot specified in the machinefile.


> 
> Best regards,
> George Markomanolis
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to