Jeff, All,

testing our well-known example of the registered memory problem (see http://www.open-mpi.org/community/lists/users/2012/02/18565.php) on freshly-installed 1.6.1rc2, found out that "Fall back to send/receive semantics" did not work always it. However the behaviour has changed:

1.5.3. and older: MPI processes hang and block the IB interface(s) forever

1.6.1rc2: a) MPI processes run through (if the chunk size is less than 8Gb) with or without a warning; or
          b) MPI processes die  (if the chunk size is more than 8Gb)
Note that the same program which die in (b) run fine over IPoIB (-mca btl ^openib). However, the performance is very bad in this case... some 1100 sec. instead of about a minute.

Reproducing: compile attached file and let it run on nodes with >=24GB with
    log_num_mtt     : 20
    log_mtts_per_seg: 3
(=32Gb, our default values):
$ mpiexec ....<one proc per node> .... a.out 1080000000 1080000001

Well, we know about the need to raise the values of one of these parameters, but I wanted to let you to know that your workaround for the problem is still not 100% perfect but only 99%.


Best,
Paul Kapinos


P.S: A note about the informative warning:
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.
....
  Registerable memory:     32768 MiB
  Total memory:            98293 MiB
--------------------------------------------------------------------------
On node with 24 GB this warning did not came around, although the max. size of registered memory (32GB) is only 1.5x of RAM, but in
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
at least the 2x RAM size is recommended.

Should this warning not came out in all cases when registered memory < 2x RAM?




On 07/28/12 04:20, Jeff Squyres wrote:
- A bunch of changes to eliminate hangs on OpenFabrics-based networks.
   Users with Mellanox hardware are ***STRONGLY ENCOURAGED*** to check
   their registered memory kernel module settings to ensure that the OS
   will allow registering more than 8GB of memory.  See this FAQ item
   for details:

   http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

   - Fall back to send/receive semantics if registered memory is
     unavilable for RDMA.


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
// #include "mpi.h" nach oben, sonst Fehlermeldung
//    /opt/intel/impi/4.0.3.008/intel64/include/mpicxx.h(93): catastrophic error: #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"
//  #error "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include <math.h>

using namespace std;







void startMPI( int &rank, int &procCount, int & argc, char ** argv )
{
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &procCount);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  
}

void cleanUp( double * vector)
{
    delete [] vector;
    vector = 0;
}

// wollen wir MPI_Standard-Konform bleibene
//void initVector( double * vector, unsigned long &length, double val)
void initVector( double * vector, int &length, double val)
{
    for( int i=0; i<length; ++i )
    {
      vector[i] = val;
    }
}

/*
**  Ausführung nur vom Master!
**  Test der Eingabe:
**  Initialisierung der Vektorlänge mit dem eingelesenen Wert
*/
// wollen wir MPI_Standard-Konform bleibene
//void input( int & argc, char ** argv, unsigned long &length)
void input( int & argc, char ** argv, int &length, int &block)
{
    length = 10000000;
    block = 1000;
    if( argc > 1)
    {
      length = atol(argv[1]);
    }
    if( argc > 2)
    {
      block = atol(argv[2]);
    }
}

/*
**  Testausgabe (optional)
*/
// wollen wir MPI_Standard-Konform bleibene
//void printVector( double * vector, unsigned long &length, int &proc, int &count)
void printVector( double * vector, int &length, int &proc, int &count)
{
  printf("process %i:\t[ ", proc);
  for(long i=0;i<length;++i)
  {
      if( i%count==0 && i>0)
      {
	printf("\n\t\t  ");
      }
     printf(" %5.1f", vector[i] );
  }
  printf(" ] \n");
}
// wollen wir MPI_Standard-Konform bleibene
//void checkResult( double &checkSum, unsigned long &bufferLength, const double &testVal , int &procCount, const double &epsilon)
void checkResult( double &checkSum, int &bufferLength, const double &testVal , int &procCount, const double &epsilon)
{
    double targetVal = bufferLength * testVal * procCount;
    double diff = (targetVal >= checkSum)? (targetVal - checkSum):(targetVal - checkSum)*(-1);
    
    if(diff < epsilon)
    {
        printf("#####  Test ok!  #####\n");
    }
    else
    {
        printf("difference occured: %lf \n", diff);
    }
    printf("[SOLL | IST] = ( %0.2lf | %0.2lf)\n", targetVal, checkSum );
}


//void checkResultVector( double &checkSum, int &bufferLength, const double &testVal , int &procCount, const double &epsilon)
//{}




int main( int argc,  char * argv[])
{
  const double TEST_VAL = 0.1;
  const double EPSILON = pow(10, -9);

  int rank, procCount;
  // wollen wir MPI_Standard-Konform bleibene
  // unsigned long size=0;
  int size=0;
  int block=0;
  int tot, sz0;
  double startTime, endTime;
  double localResult=0, checkSum=0;
  
  //--------------------------------------------------------------
  startMPI( rank, procCount, argc, argv);   // Initialisierung der MPI-Umgebung
  // MPI_Wtime NACH startMPI, sonst Fehlermeldung
  //     Attempting to use an MPI routine before initializing MPI

  startTime = MPI_Wtime();

  printf("process %i starts test \n", rank);

  
  if(0==rank)
  {
    printf("Epsilon = %0.10lf\n", EPSILON);
    input(argc, argv, size, block);
  }
  MPI_Bcast(&size, 1, MPI_INT, 0, MPI_COMM_WORLD);
  MPI_Bcast(&block, 1, MPI_INT, 0, MPI_COMM_WORLD);
  double * vector = new double[size];
  
  if(0==rank)
  {
    initVector(vector, size, TEST_VAL);
  }
  
  //printVector(vector, size, rank, procCount );
  MPI_Bcast(vector, size, MPI_DOUBLE, 0, MPI_COMM_WORLD);
  //printVector(vector, size, rank, procCount );
  
  // auszuführen von allen Prozessen:
  for(int i=0; i<size;++i)
  {
    localResult+=vector[i];
  }
  printf("#######  process %i yields partial sum: %0.1lf\n", rank, localResult);
  //printf("#######  process %i yields check sum: %0.1lf\n", rank, checkSum);
  
  MPI_Reduce( &localResult, &checkSum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );

  double * vector_empf = new double[size];

  printf("size, block: %i %i \n ", size, block);
  if (size < block)
    {
      printf("MPI_Reduce in einem!\n ");
      MPI_Reduce( &vector[0], &vector_empf[0], size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
    }
  else
    {
      tot = 0;
      do 
	{
	  sz0 = block;
	  if (tot + sz0 > size)  sz0 = size - tot;
//          if(0==rank) printf("MPI_Reduce: tot %i\n ", tot);
          // fflush(stdout);


	  MPI_Reduce( &vector[tot], &vector_empf[tot], sz0, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );

	  tot = tot + sz0;
	  MPI_Barrier(MPI_COMM_WORLD);
	}
      while (tot < size);
    }





//   if(0==rank) {
//     MPI_Reduce( MPI_IN_PLACE, &vector, size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
//   } else {
//     MPI_Reduce( &vector, NULL, size, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
//   }
// int MPI_Reduce(void *sendbuf, void *recvbuf, int count,  MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)







  MPI_Barrier(MPI_COMM_WORLD);
  

  if(0==rank)
  { 
    checkResult( checkSum, size, TEST_VAL, procCount, EPSILON );
    endTime = MPI_Wtime();
    printf("Elapsed time: %lf\n", (endTime - startTime));

    localResult=0;
    for(int i=0; i<size;++i)    {
      localResult+=vector_empf[i];
//      if (vector_empf[i] == TEST_VAL) printf("UNCHANGED %i\n ", i);    
    }
    printf("Master's Summe: %lf \n", localResult);


  }
  
//  cleanUp(vector);


  MPI_Finalize();

  //--------------------------------------------------------------

} // main
process 5 starts test 
process 0 starts test 
Epsilon = 0.0000000010
process 4 starts test 
process 1 starts test 
process 2 starts test 
process 3 starts test 
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 10000 
#######  process 5 yields partial sum: 108000000.4
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 10000 
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 10000 
#######  process 3 yields partial sum: 108000000.4
#######  process 0 yields partial sum: 108000000.4
size, block: 1080000000 10000 
size, block: 1080000000 10000 
size, block: 1080000000 10000 
 difference occured: 2.597129 
[SOLL | IST] = ( 648000000.00 | 648000002.60)
Elapsed time: 52.758014
Master's Summe: 647999996.699953 
  
rank: 2  VmPeak:        17105056 kB

rank: 1  VmPeak:        17105068 kB
   
rank: 3  VmPeak:        17220544 kB

rank: 4  VmPeak:        17220440 kB

rank: 5  VmPeak:        17220444 kB

rank: 0  VmPeak:        17678540 kB
process 0 starts test 
Epsilon = 0.0000000010
process 2 starts test 
process 3 starts test 
process 4 starts test 
process 1 starts test 
process 5 starts test 
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 0 yields partial sum: 108000000.4
#######  process 5 yields partial sum: 108000000.4
#######  process 3 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             linuxbmc0372.rz.RWTH-Aachen.DE
Local device:           mlx4_0
Queue pair type:        Reliable connected (RC)
--------------------------------------------------------------------------
[linuxbmc0372.rz.RWTH-Aachen.DE:12898] *** An error occurred in MPI_Barrier
[linuxbmc0372.rz.RWTH-Aachen.DE:12898] *** on communicator MPI_COMM_WORLD
[linuxbmc0372.rz.RWTH-Aachen.DE:12898] *** MPI_ERR_OTHER: known error not in 
list
[linuxbmc0372.rz.RWTH-Aachen.DE:12898] *** MPI_ERRORS_ARE_FATAL: your MPI job 
will now abort
 
rank: 2  VmPeak:        25775548 kB

rank: 4  VmPeak:        25775552 kB

rank: 3  VmPeak:        25775664 kB

rank: 0  VmPeak:        17338008 kB
--------------------------------------------------------------------------
mpiexec has exited due to process rank 2 with PID 12897 on
node linuxbmc0372 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

rank: 1  VmPeak:        25775792 kB

rank: 5  VmPeak:        17104304 kB
Failure executing command /opt/MPI/openmpi-1.6.1rc2/linux/intel/bin/mpiexec -x  
LD_LIBRARY_PATH -x  PATH -x  OMP_NUM_THREADS -x  MPI_NAME -mca 
oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0 --hostfile 
/tmp/pk224850/cluster_52665/hostfile-39693 -np 6 memusage a.out 1080000000 
1080000001
process 1 starts test 
process 0 starts test 
Epsilon = 0.0000000010
process 4 starts test 
process 2 starts test 
process 3 starts test 
process 5 starts test 
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 0 yields partial sum: 108000000.4
#######  process 3 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 5 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
 difference occured: 2.597129 
[SOLL | IST] = ( 648000000.00 | 648000002.60)
Elapsed time: 1105.989871
Master's Summe: 647999996.699953 
     
rank: 5  VmPeak:        16944988 kB

rank: 2  VmPeak:        25382652 kB

rank: 0  VmPeak:        16945144 kB

rank: 3  VmPeak:        25382680 kB

rank: 4  VmPeak:        25382652 kB

rank: 1  VmPeak:        25382680 kB
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              linuxbmc0246.rz.RWTH-Aachen.DE
  Registerable memory:     32768 MiB
  Total memory:            98293 MiB
--------------------------------------------------------------------------
process 0 starts test 
Epsilon = 0.0000000010
process 1 starts test 
process 4 starts test 
process 2 starts test 
process 3 starts test 
process 5 starts test 
[cluster.rz.RWTH-Aachen.DE:25617] 5 more processes have sent help message 
help-mpi-btl-openib.txt / reg mem limit low
[cluster.rz.RWTH-Aachen.DE:25617] Set MCA parameter "orte_base_help_aggregate" 
to 0 to see all help / error messages
#######  process 5 yields partial sum: 108000000.4
#######  process 0 yields partial sum: 108000000.4
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 10000 
#######  process 3 yields partial sum: 108000000.4
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 10000 
size, block: 1080000000 10000 
size, block: 1080000000 10000 
size, block: 1080000000 10000 
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 10000 
 difference occured: 2.597129 
[SOLL | IST] = ( 648000000.00 | 648000002.60)
Elapsed time: 51.110225
Master's Summe: 647999996.699953 
     
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              linuxbmc0221.rz.RWTH-Aachen.DE
  Registerable memory:     32768 MiB
  Total memory:            98293 MiB
--------------------------------------------------------------------------
process 5 starts test 
process 2 starts test 
process 3 starts test 
process 1 starts test 
process 0 starts test 
Epsilon = 0.0000000010
process 4 starts test 
[cluster.rz.RWTH-Aachen.DE:28988] 5 more processes have sent help message 
help-mpi-btl-openib.txt / reg mem limit low
[cluster.rz.RWTH-Aachen.DE:28988] Set MCA parameter "orte_base_help_aggregate" 
to 0 to see all help / error messages
#######  process 5 yields partial sum: 108000000.4
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 3 yields partial sum: 108000000.4
#######  process 0 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             linuxbmc0191.rz.RWTH-Aachen.DE
Local device:           mlx4_0
Queue pair type:        Reliable connected (RC)
--------------------------------------------------------------------------
[linuxbmc0191.rz.RWTH-Aachen.DE:21884] *** An error occurred in MPI_Barrier
[linuxbmc0191.rz.RWTH-Aachen.DE:21884] *** on communicator MPI_COMM_WORLD
[linuxbmc0191.rz.RWTH-Aachen.DE:21884] *** MPI_ERR_OTHER: known error not in 
list
[linuxbmc0191.rz.RWTH-Aachen.DE:21884] *** MPI_ERRORS_ARE_FATAL: your MPI job 
will now abort
  [cluster.rz.RWTH-Aachen.DE:28988] 1 more process has sent help message 
help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
[cluster.rz.RWTH-Aachen.DE:28988] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: linuxbmc0055.rz.RWTH-Aachen.DE
PID:  32263

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------
   --------------------------------------------------------------------------
mpiexec has exited due to process rank 2 with PID 13153 on
node linuxbmc0105 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
 [cluster.rz.RWTH-Aachen.DE:28988] 3 more processes have sent help message 
help-odls-default.txt / odls-default:could-not-kill
Failure executing command /opt/MPI/openmpi-1.6.1rc2/linux/intel/bin/mpiexec -x  
LD_LIBRARY_PATH -x  PATH -x  OMP_NUM_THREADS -x  MPI_NAME -mca 
oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0 --hostfile 
/tmp/pk224850/cluster_52665/hostfile-28863 -np 6 a.out 1080000000 1080000001
MPI_Reduce cannot handle fields which are bigger than some 8 Gb. 
Neinther in one piece, nor (surtrise-surprise!) in pieces.

first paramter: number of doubles to send
second parameter: chink size


OK: (less than 8 GB)
$ mpiexec -np 2 -m 1  memusage a.out 1010000000 10000
$ mpiexec -np 2 -m 1  memusage a.out 1010000000 1010000001

OK: (more than 8 GB) since 1.6.1rc2
$ mpiexec -np 2 -m 1  memusage a.out 1080000000 10000

NOK: (more than 8 GB in one piece; the same program run fine bur slow if -mca 
btl ^openib)
$ mpiexec -np 2 -m 1  memusage a.out 1080000000 1080000001

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to