[OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time

2009-11-25 Thread Qing Pang

Dear users,

I'm running the popular Calculate PI program on a 2 node setting running 
ubuntu 8.10 and openmpi1.3.3(with default settings). Password-less ssh 
is set up but no cluster management program such as network file system, 
network time protocol, resource management, scheduler, etc. The two 
nodes are connected though TCP/IP only.


When I tried to benchmark the program, it shows that the time spent on 
MPI_Reduce(), is proportional to the Number-of-Intervals (n) used in 
calculation. For example, when n = 1,000,000, MPI_Reduce costs 15.65 
milliseconds; while n= 1,000,000,000,  MPI_Reduce costs 15526 milliseconds.


This confused me - in this Calc-PI program, MPI_Reduce is used only once 
- no matter what number of intervals is used, MPI_Reduce is called after 
both nodes got the result, to merge the result - just once.  So the time 
cost by MPI_Reduce (all though it might be slow through TCP/IP 
connection) should be somewhat consistent. But obviously it's not what I 
saw.


Had anyone have the similar problem before? I'm not sure how 
MPI_Reduce() work internally. Does the fact that I don't have network 
file system, network time protocol, resource management, scheduler, etc 
installed matters?


Below is the program - I did feed "n" to it more than once to warm it up.

#include "mpi.h"
#include 
#include 

int main(int argc, char *argv[])   
{   
  int numprocs, myid, rc;

  double ACCUPI = 3.1415926535897932384626433832795;
  double mypi, pi, h, sum, x;
  int n, i;
  double starttime, endtime;
  double time,told,bcasttime,reducetime,comptime,totaltime;

  rc = MPI_Init(&argc,&argv);
  if (rc != MPI_SUCCESS) {
 printf("Error starting MPI program. Terminating.\n");
 MPI_Abort(MPI_COMM_WORLD, rc);
  }
  MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&myid);

  while (1) {
 if (myid == 0) {
printf("Enter the number of intervals: (0 quits) \n");
scanf("%d",&n);
starttime = MPI_Wtime();
 }

 time = MPI_Wtime();
 MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

 told = time;
 time = MPI_Wtime();
 bcasttime = time - told;

 if (n == 0)
break;
 else {
h = 1.0/(double)n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h*((double)i - 0.5);
sum += (4.0/(1.0 + x*x));
}
mypi = sum*h;

told = time;
time = MPI_Wtime();
comptime = time - told;

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

told = time;
time = MPI_Wtime();
reducetime = time - told;

if (myid == 0) {
   totaltime = MPI_Wtime() - starttime;
   printf("\nElapsed time (total): %f 
milliseconds\n",totaltime*1000);
   printf("Elapsed time (Bcast):  %f milliseconds 
(%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime);
   printf("Elapsed time (Reduce): %f milliseconds 
(%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime);
   printf("Elapsed time (Comput): %f milliseconds 
(%5.2f%%)\n",comptime*1000,comptime*100/totaltime);
   printf("\nApproximated pi is %.16f, Error is %.4e\n", pi, 
fabs(pi - ACCUPI));

}
 }
  }

  MPI_Finalize();   
}




Re: [OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time

2009-11-25 Thread Eugene Loh
Your processes are probably running asynchronously.  You could perhaps 
try tracing program execution and look at the timeline.  E.g., 
http://www.open-mpi.org/faq/?category=perftools#free-tools .  Or, where 
you have MPI_Wtime calls, just capture those timestamps on each process 
and dump the results at the end of your run.  Or, report timings for all 
ranks instead of just for rank 0.


Put another way, rank 0 must broadcast n.  So, no one starts computation 
until they get the Bcast result.  Rank 0 probably starts its 
computations before anyone else does.  So, it gets to the Reduce before 
anyone else does, but it can't exit until other ranks have finished 
their computations.  So, the Reduce time on rank 0 includes some amount 
of other ranks' compute times.


Yet another approach is to insert MPI_Barrier calls at each phase of the 
program so that the various phases are synchronized.  This adds some 
overhead to the program, but helps simplify interpretation of the timing 
results.


Qing Pang wrote:

I'm running the popular Calculate PI program on a 2 node setting 
running ubuntu 8.10 and openmpi1.3.3(with default settings). 
Password-less ssh is set up but no cluster management program such as 
network file system, network time protocol, resource management, 
scheduler, etc. The two nodes are connected though TCP/IP only.


When I tried to benchmark the program, it shows that the time spent on 
MPI_Reduce(), is proportional to the Number-of-Intervals (n) used in 
calculation. For example, when n = 1,000,000, MPI_Reduce costs 15.65 
milliseconds; while n= 1,000,000,000,  MPI_Reduce costs 15526 
milliseconds.


This confused me - in this Calc-PI program, MPI_Reduce is used only 
once - no matter what number of intervals is used, MPI_Reduce is 
called after both nodes got the result, to merge the result - just 
once.  So the time cost by MPI_Reduce (all though it might be slow 
through TCP/IP connection) should be somewhat consistent. But 
obviously it's not what I saw.


Had anyone have the similar problem before? I'm not sure how 
MPI_Reduce() work internally. Does the fact that I don't have network 
file system, network time protocol, resource management, scheduler, 
etc installed matters?


Below is the program - I did feed "n" to it more than once to warm it up.

#include "mpi.h"
#include 
#include 

int main(int argc, char *argv[])   {  int numprocs, myid, rc;
   double ACCUPI = 3.1415926535897932384626433832795;
   double mypi, pi, h, sum, x;
   int n, i;
   double starttime, endtime;
   double time,told,bcasttime,reducetime,comptime,totaltime;

   rc = MPI_Init(&argc,&argv);
   if (rc != MPI_SUCCESS) {
  printf("Error starting MPI program. Terminating.\n");
  MPI_Abort(MPI_COMM_WORLD, rc);
   }
   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD,&myid);

   while (1) {
  if (myid == 0) {
 printf("Enter the number of intervals: (0 quits) \n");
 scanf("%d",&n);
 starttime = MPI_Wtime();
  }

  time = MPI_Wtime();
  MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

  told = time;
  time = MPI_Wtime();
  bcasttime = time - told;

  if (n == 0)
 break;
  else {
 h = 1.0/(double)n;
 sum = 0.0;
 for (i = myid + 1; i <= n; i += numprocs) {
 x = h*((double)i - 0.5);
 sum += (4.0/(1.0 + x*x));
 }
 mypi = sum*h;

 told = time;
 time = MPI_Wtime();
 comptime = time - told;

 MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 
MPI_COMM_WORLD);


 told = time;
 time = MPI_Wtime();
 reducetime = time - told;

 if (myid == 0) {
totaltime = MPI_Wtime() - starttime;
printf("\nElapsed time (total): %f 
milliseconds\n",totaltime*1000);
printf("Elapsed time (Bcast):  %f milliseconds 
(%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime);
printf("Elapsed time (Reduce): %f milliseconds 
(%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime);
printf("Elapsed time (Comput): %f milliseconds 
(%5.2f%%)\n",comptime*1000,comptime*100/totaltime);
printf("\nApproximated pi is %.16f, Error is %.4e\n", pi, 
fabs(pi - ACCUPI));

 }
  }
   }

   MPI_Finalize();   }





Re: [OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time

2009-12-04 Thread Qing Pang
Thank you so much! It is a synchronization issue. In my case, one node 
actually run slower than the other node. Adding MPE_Barrier() helps to 
straight things out.

Thank you for your help!


Eugene Loh wrote:
Your processes are probably running asynchronously.  You could perhaps 
try tracing program execution and look at the timeline.  E.g., 
http://www.open-mpi.org/faq/?category=perftools#free-tools .  Or, 
where you have MPI_Wtime calls, just capture those timestamps on each 
process and dump the results at the end of your run.  Or, report 
timings for all ranks instead of just for rank 0.


Put another way, rank 0 must broadcast n.  So, no one starts 
computation until they get the Bcast result.  Rank 0 probably starts 
its computations before anyone else does.  So, it gets to the Reduce 
before anyone else does, but it can't exit until other ranks have 
finished their computations.  So, the Reduce time on rank 0 includes 
some amount of other ranks' compute times.


Yet another approach is to insert MPI_Barrier calls at each phase of 
the program so that the various phases are synchronized.  This adds 
some overhead to the program, but helps simplify interpretation of the 
timing results.


Qing Pang wrote:

I'm running the popular Calculate PI program on a 2 node setting 
running ubuntu 8.10 and openmpi1.3.3(with default settings). 
Password-less ssh is set up but no cluster management program such as 
network file system, network time protocol, resource management, 
scheduler, etc. The two nodes are connected though TCP/IP only.


When I tried to benchmark the program, it shows that the time spent 
on MPI_Reduce(), is proportional to the Number-of-Intervals (n) used 
in calculation. For example, when n = 1,000,000, MPI_Reduce costs 
15.65 milliseconds; while n= 1,000,000,000,  MPI_Reduce costs 15526 
milliseconds.


This confused me - in this Calc-PI program, MPI_Reduce is used only 
once - no matter what number of intervals is used, MPI_Reduce is 
called after both nodes got the result, to merge the result - just 
once.  So the time cost by MPI_Reduce (all though it might be slow 
through TCP/IP connection) should be somewhat consistent. But 
obviously it's not what I saw.


Had anyone have the similar problem before? I'm not sure how 
MPI_Reduce() work internally. Does the fact that I don't have network 
file system, network time protocol, resource management, scheduler, 
etc installed matters?


Below is the program - I did feed "n" to it more than once to warm it 
up.


#include "mpi.h"
#include 
#include 

int main(int argc, char *argv[])   {  int numprocs, myid, rc;
   double ACCUPI = 3.1415926535897932384626433832795;
   double mypi, pi, h, sum, x;
   int n, i;
   double starttime, endtime;
   double time,told,bcasttime,reducetime,comptime,totaltime;

   rc = MPI_Init(&argc,&argv);
   if (rc != MPI_SUCCESS) {
  printf("Error starting MPI program. Terminating.\n");
  MPI_Abort(MPI_COMM_WORLD, rc);
   }
   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD,&myid);

   while (1) {
  if (myid == 0) {
 printf("Enter the number of intervals: (0 quits) \n");
 scanf("%d",&n);
 starttime = MPI_Wtime();
  }

  time = MPI_Wtime();
  MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

  told = time;
  time = MPI_Wtime();
  bcasttime = time - told;

  if (n == 0)
 break;
  else {
 h = 1.0/(double)n;
 sum = 0.0;
 for (i = myid + 1; i <= n; i += numprocs) {
 x = h*((double)i - 0.5);
 sum += (4.0/(1.0 + x*x));
 }
 mypi = sum*h;

 told = time;
 time = MPI_Wtime();
 comptime = time - told;

 MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 
MPI_COMM_WORLD);


 told = time;
 time = MPI_Wtime();
 reducetime = time - told;

 if (myid == 0) {
totaltime = MPI_Wtime() - starttime;
printf("\nElapsed time (total): %f 
milliseconds\n",totaltime*1000);
printf("Elapsed time (Bcast):  %f milliseconds 
(%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime);
printf("Elapsed time (Reduce): %f milliseconds 
(%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime);
printf("Elapsed time (Comput): %f milliseconds 
(%5.2f%%)\n",comptime*1000,comptime*100/totaltime);
printf("\nApproximated pi is %.16f, Error is %.4e\n", pi, 
fabs(pi - ACCUPI));

 }
  }
   }

   MPI_Finalize();   }



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users