from:"Eric Chamberland"

[OMPI devel] assert in opal_datatype_is_contiguous_memory_layout

2013-04-05 Thread Eric Chamberland


Hi all,

(Sorry, I have sent this to "users" but I should have sent it to "devel" 
list instead.  Sorry for the mess...)


I have attached a very small example which raise an assertion.

The problem is arising from a process which does not have any element to 
write in a file (and then in the MPI_File_set_view)...


You can see this "bug" with openmpi 1.6.3, 1.6.4 and 1.7.0 configured with:

./configure --enable-mem-debug --enable-mem-profile --enable-memchecker
 --with-mpi-param-check --enable-debug

Just compile the given example (idx_null.cc) as-is with

mpicxx -o idx_null idx_null.cc

and run with 3 processes:

mpirun -n 3 idx_null

You can modify the example by commenting "#define WITH_ZERO_ELEMNT_BUG" 
to see that everything is going well when all processes have something 
to write.


There is no "bug" if you use openmpi 1.6.3 (and higher) without the 
debugging options.


Also, all is working well with mpich-3.0.3 configured with:

./configure --enable-g=yes


So, is this a wrong "assert" in openmpi?

Is there a real problem to use this example in a "release" mode?

Thanks,

Eric
#include "mpi.h"
#include 
#include 

using namespace std;

void abortOnError(int ierr) {
  if (ierr != MPI_SUCCESS) {
printf("ERROR Returned by MPI: %d\n",ierr);
char* lCharPtr = new char[MPI_MAX_ERROR_STRING];
int lLongueur = 0;
MPI_Error_string(ierr,lCharPtr, &lLongueur);
printf("ERROR_string Returned by MPI: %s\n",lCharPtr);
MPI_Abort( MPI_COMM_WORLD, 1 );
  }
}
// This main is showing how to have an assertion raised if you try
// to create a MPI_File_set_view with some process holding no data

#define WITH_ZERO_ELEMNT_BUG

int main(int argc, char *argv[])
{
  int rank, size, i;
  MPI_Datatype lTypeIndexIntWithExtent, lTypeIndexIntWithoutExtent;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  if (size != 3)
  {
printf("Please run with 3 processes.\n");
MPI_Finalize();
return 1;
  }
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  int displacement[3];
  int* buffer = 0;

  int lTailleBuf = 0;
  if (rank == 0)
  {
lTailleBuf = 3;
displacement[0] = 0;
displacement[1] = 4;
displacement[2] = 5;
buffer = new int[lTailleBuf];
for (i=0; i("temp"), 
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &lFile ));

  MPI_Type_create_indexed_block(lTailleBuf, 1, displacement, MPI_INT, 
&lTypeIndexIntWithoutExtent);
  MPI_Type_commit(&lTypeIndexIntWithoutExtent);

  // Here we compute the total number of int to write to resize the type:
  // Ici, on veut s'échanger le nb total de int à écrire à chaque appel pcqu'on 
doit calculer le bon "extent" du type.
  // Ça revient à dire que chaque processus ne n'écrira qu'une petite partie du 
fichier, mais devra avancer son pointeur
  // local d'écriture suffisamment loin pour ne pas écrire par dessus les 
données des autres
  int lTailleGlobale = 0;
  printf("[%d] Local size : %d \n",rank,lTailleBuf);

  MPI_Allreduce( &lTailleBuf, &lTailleGlobale, 1, MPI_INT, MPI_SUM, 
MPI_COMM_WORLD );

  printf("[%d] MPI_AllReduce : %d \n",rank,lTailleGlobale);

  //We now modify the extent of the type "type_without_extent"
  MPI_Type_create_resized( lTypeIndexIntWithoutExtent, 0, 
lTailleGlobale*sizeof(int), &lTypeIndexIntWithExtent );
  MPI_Type_commit(&lTypeIndexIntWithExtent);

  abortOnError(MPI_File_set_view( lFile, 0, MPI_INT, lTypeIndexIntWithExtent, 
const_cast("native"), MPI_INFO_NULL));

  for (int i =0; i<2;++i) {
abortOnError(MPI_File_write_all( lFile, buffer, lTailleBuf, MPI_INT, 
MPI_STATUS_IGNORE));
MPI_Offset lOffset,lSharedOffset;
MPI_File_get_position(lFile, &lOffset);
MPI_File_get_position_shared(lFile, &lSharedOffset);
printf("[%d] Offset after write : %d int: Local: %ld Shared: %ld 
\n",rank,lTailleBuf,lOffset,lSharedOffset);

  }

  abortOnError(MPI_File_close( &lFile ));

  abortOnError(MPI_Type_free(&lTypeIndexIntWithExtent));
  abortOnError(MPI_Type_free(&lTypeIndexIntWithoutExtent));

  MPI_Finalize();
  return 0;
}

[OMPI devel] Simplified: Misuse or bug with nested types?

2013-04-23 Thread Eric Chamberland


Hi,

I have sent a previous message showing something that I think is a bug 
(or maybe a misuse, but...).


I worked on the example sent to have it simplified: now it is almost 
half of the lines of code and the structures are more simple... but 
still showing the wrong behaviour.


Briefly, we construct different MPI_datatype and nests them into a final 
type which is a:

{MPI_LONG,{{MPI_LONG,MPI_CHAR}*2}

Here is the output from OpenMPI 1.6.3:

 Rank 0 send this:
 i: 0 => {{0},{{3,%},{7,5}}}
 i: 1 => {{1},{{3,%},{7,5}}}
 i: 2 => {{2},{{3,%},{7,5}}}
 i: 3 => {{3},{{3,%},{7,5}}}
 i: 4 => {{4},{{3,%},{7,5}}}
 i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after 
receive.

 Rank 1 received this:
 i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR 
 i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR 
 i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR 
 i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR 
 i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR 
 i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR 

Here is the expected output, obtained with mpich-3.0.3:

 Rank 0 send this:
 i: 0 => {{0},{{3,%},{7,5}}}
 i: 1 => {{1},{{3,%},{7,5}}}
 i: 2 => {{2},{{3,%},{7,5}}}
 i: 3 => {{3},{{3,%},{7,5}}}
 i: 4 => {{4},{{3,%},{7,5}}}
 i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after 
receive.

 Rank 1 received this:
 i: 0 => {{0},{{3,%},{7,5}}} OK
 i: 1 => {{1},{{3,%},{7,5}}} OK
 i: 2 => {{2},{{3,%},{7,5}}} OK
 i: 3 => {{3},{{3,%},{7,5}}} OK
 i: 4 => {{4},{{3,%},{7,5}}} OK
 i: 5 => {{5},{{3,%},{7,5}}} OK

Is it related to the bug reported here:
 http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ?

Thanks,

Eric

Re: [OMPI devel] Simplified: Misuse or bug with nested types?

2013-04-23 Thread Eric Chamberland


Sorry,

here is the attachment...

Eric

On 04/23/2013 09:54 AM, Eric Chamberland wrote:

Hi,

I have sent a previous message showing something that I think is a bug
(or maybe a misuse, but...).

I worked on the example sent to have it simplified: now it is almost
half of the lines of code and the structures are more simple... but
still showing the wrong behaviour.

Briefly, we construct different MPI_datatype and nests them into a final
type which is a:
{MPI_LONG,{{MPI_LONG,MPI_CHAR}*2}

Here is the output from OpenMPI 1.6.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR 
  i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR 
  i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR 
  i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR 
  i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR 
  i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR 

Here is the expected output, obtained with mpich-3.0.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{7,5}}} OK
  i: 1 => {{1},{{3,%},{7,5}}} OK
  i: 2 => {{2},{{3,%},{7,5}}} OK
  i: 3 => {{3},{{3,%},{7,5}}} OK
  i: 4 => {{4},{{3,%},{7,5}}} OK
  i: 5 => {{5},{{3,%},{7,5}}} OK

Is it related to the bug reported here:
  http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ?

Thanks,

Eric


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


#include "mpi.h"
#include 

//**
//
// This example is showing a problem with nested types!
// It works perfectly with mpich-3.0.3 but seems to do a wrong transmission
// with openmpi 1.6.3, 1.6.4, 1.7.0 and 1.7.1
//
// The basic problem seems to arise with a vector of PALong_2Pairs which is a
// MPI nested type constructed like this:
//--
// Struct  | is composed of
//--
// PAPairLC|  {long, char}
// PALong_2Pairs   |  {long,{PAPairLC,PAPairLC}}
//--
//
//**

using namespace std;

//! Function to abort on any MPI error:
void abortOnError(int ierr) {
  if (ierr != MPI_SUCCESS) {
std::cerr << "ERROR Returned by MPI: " << ierr << std::endl;
char* lCharPtr = new char[MPI_MAX_ERROR_STRING];
int lLongueur = 0;
MPI_Error_string(ierr,lCharPtr, &lLongueur);
std::cerr << "ERROR_string Returned by MPI: " << lCharPtr << std::endl;
MPI_Abort( MPI_COMM_WORLD, 1 );
  }
}

// a constant:
#define FIRST_CHAR 32

//*
//
// PAPairLC is a pair: {long, char}
//
//*

class PAPairLC
{
public:
  PAPairLC() :aLong(-999), aChar(FIRST_CHAR+4) {}

  long   aLong;
  char   aChar;

  static MPI_Datatype  asMPIDatatype;
  static MPI_Datatype& reqMPIDatatype() { return asMPIDatatype;}

  void print(std::ostream& pOS) {pOS << "{" << aLong << "," << aChar << "}";}

  static void createMPIDatatype() {

PAPairLC lPAType;

MPI_Datatype lTypes[2];

lTypes[0] = MPI_LONG;
lTypes[1] = MPI_CHAR;

MPI_Aint lDeplacements[2];

MPI_Aint lPtrBase = 0;
MPI_Get_address(&lPAType, &lPtrBase);
MPI_Get_address(&lPAType.aLong,   &lDeplacements[0]);
MPI_Get_address(&lPAType.aChar, &lDeplacements[1]);

//Compute the "displacement" from lPtrBase
lDeplacements[0] -= lPtrBase;
lDeplacements[1] -= lPtrBase;

int lBlocLen[2] = {1,1};

abortOnError(MPI_Type_create_struct(2, lBlocLen, lDeplacements, lTypes, 
&asMPIDatatype));

abortOnError(MPI_Type_commit(&asMPIDatatype));

  }
};
MPI_Datatype PAPairLC::asMPIDatatype = MPI_DATATYPE_NULL;

//*
//
// PALong_2Pairs is a struct of: {long, PAPairLC[2]}
//
//*
class PALong_2Pairs
{
public:
  PALong_2Pairs()  {}

  long  aFirst;
  PAPairLC a2Pairs[2];

  static MPI_Datatype  asMPIDatatype;
  static MPI_Datatype& reqMPIDatatype() { r

Re: [OMPI devel] Simplified: Misuse or bug with nested types?

2013-04-23 Thread Eric Chamberland

another information: I just tested the example with Intel MPI 4.0.1.007 
and it works correctly...


So the problem seems to be only with OpenMPI... which is the default 
distribution we use... :-/


Is my example code too long?

Eric

Le 2013-04-23 09:55, Eric Chamberland a écrit :

Sorry,

here is the attachment...

Eric

On 04/23/2013 09:54 AM, Eric Chamberland wrote:

Hi,

I have sent a previous message showing something that I think is a bug
(or maybe a misuse, but...).

I worked on the example sent to have it simplified: now it is almost
half of the lines of code and the structures are more simple... but
still showing the wrong behaviour.

Briefly, we construct different MPI_datatype and nests them into a final
type which is a:
{MPI_LONG,{{MPI_LONG,MPI_CHAR}*2}

Here is the output from OpenMPI 1.6.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR 
  i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR 
  i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR 
  i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR 
  i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR 
  i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR 

Here is the expected output, obtained with mpich-3.0.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{7,5}}} OK
  i: 1 => {{1},{{3,%},{7,5}}} OK
  i: 2 => {{2},{{3,%},{7,5}}} OK
  i: 3 => {{3},{{3,%},{7,5}}} OK
  i: 4 => {{4},{{3,%},{7,5}}} OK
  i: 5 => {{5},{{3,%},{7,5}}} OK

Is it related to the bug reported here:
http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ?

Thanks,

Eric


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Simplified: Misuse or bug with nested types?

2013-04-23 Thread Eric Chamberland


Hi Jeff,

thanks for your answer!

You inserted a doubt in my mind... and gave me hope... :-)

So I did some modifications on the code to help everyone:

1- it's now in "C"... :-)
2- Concerning your remark about arbitrary address: I am now using the 
"offsetof" macro defined in "stddef.h" to compute the offset, or 
displacement needed to create the datatype
3- I have simplified and reduced (again) the number of lines to 
reproduce the error...


see "nested_bug.c" attached to this mail...

Output with openmpi 1.6.3:

 Rank 0 send this:
{{1},{{2,3},{4,5}}}
 Rank 1 received this: {{1},{{2,3},{4199789,15773951}}} *** ERROR 

Expected output (still ok with mpich 3.0.3 and intel mpi 4):

Rank 0 send this:
{{1},{{2,3},{4,5}}}
 Rank 1 received this: {{1},{{2,3},{4,5}}} OK

Thanks!

Eric


Le 2013-04-23 18:03, Jeff Squyres (jsquyres) a écrit :

Sorry for the delay.

My C++ is a bit rusty, but this does not seem correct to me.

You're making the datatypes relative to an arbitrary address (&lPtrBase) in a 
static method on each class.  You really need the datatypes to be relative to each 
instance's *this* pointer.

Doing so allows MPI to read/write the data relative to the specific instance of 
the objects that you're trying to send/receive.

Make sense?


On Apr 23, 2013, at 5:01 PM, Eric Chamberland 
 wrote:


another information: I just tested the example with Intel MPI 4.0.1.007 and it 
works correctly...

So the problem seems to be only with OpenMPI... which is the default 
distribution we use... :-/

Is my example code too long?

Eric

Le 2013-04-23 09:55, Eric Chamberland a écrit :

Sorry,

here is the attachment...

Eric

On 04/23/2013 09:54 AM, Eric Chamberland wrote:

Hi,

I have sent a previous message showing something that I think is a bug
(or maybe a misuse, but...).

I worked on the example sent to have it simplified: now it is almost
half of the lines of code and the structures are more simple... but
still showing the wrong behaviour.

Briefly, we construct different MPI_datatype and nests them into a final
type which is a:
{MPI_LONG,{{MPI_LONG,MPI_CHAR}*2}

Here is the output from OpenMPI 1.6.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR 
  i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR 
  i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR 
  i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR 
  i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR 
  i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR 

Here is the expected output, obtained with mpich-3.0.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{7,5}}} OK
  i: 1 => {{1},{{3,%},{7,5}}} OK
  i: 2 => {{2},{{3,%},{7,5}}} OK
  i: 3 => {{3},{{3,%},{7,5}}} OK
  i: 4 => {{4},{{3,%},{7,5}}} OK
  i: 5 => {{5},{{3,%},{7,5}}} OK

Is it related to the bug reported here:
http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ?

Thanks,

Eric


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




#include "mpi.h"
#include 
#include 
#include 

/**
//
// This example is showing a problem with nested types!
// It works perfectly with mpich-3.0.3 but seems to do a wrong transmission
// with openmpi 1.6.3, 1.6.4, 1.7.0 and 1.7.1
//
// The basic problem seems to arise with a vector of PALong_2Pairs which is a
// MPI nested type constructed like this:
//--
// Struct  | is composed of
//--
// PAPairLI|  {long, int}
// PALong_2Pairs   |  {long,{PAPairLI,PAPairLI}}
//--
//
*/


/*! Function to abort on any MPI error:*/
void abortOnError(int ierr) {
  if (ierr != MPI_SUCCESS) {
printf("ERROR Returned by MPI: %d\n",ierr);
char* lCharPtr = malloc(sizeof(char)*MPI_MAX_ERROR_STRING);
int lLongueur = 0;
MPI_Error_string(ierr,lCharPtr, &lLongueur);
printf("ERROR_string Returned by MPI: %s\n",lCharPtr);
MPI_Abort( MPI_

Re: [OMPI devel] Simplified: Misuse or bug with nested types?

2013-04-23 Thread Eric Chamberland


Hi Paul,

okay, I have compiled the sources from the trunk and it works fine now...

Sorry to have reported a duplicate...

It will be in the next 1.6.X release?

Thanks,

Eric

Le 2013-04-23 20:46, Paul Hargrove a écrit :

Eric,

Are you testing against the Open MPI svn trunk?
I ask because on April 9 George commited a fix for the bug reported 
by Thomas Jahns:

http://www.open-mpi.org/community/lists/devel/2013/04/12268.php


-Paul



On Tue, Apr 23, 2013 at 5:35 PM, Eric Chamberland 
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:


Hi Jeff,

thanks for your answer!

You inserted a doubt in my mind... and gave me hope... :-)

So I did some modifications on the code to help everyone:

1- it's now in "C"... :-)
2- Concerning your remark about arbitrary address: I am now using
the "offsetof" macro defined in "stddef.h" to compute the offset,
or displacement needed to create the datatype
3- I have simplified and reduced (again) the number of lines to
reproduce the error...

see "nested_bug.c" attached to this mail...

Output with openmpi 1.6.3:

 Rank 0 send this:
{{1},{{2,3},{4,5}}}
 Rank 1 received this: {{1},{{2,3},{4199789,15773951}}} *** ERROR 

Expected output (still ok with mpich 3.0.3 and intel mpi 4):

Rank 0 send this:
{{1},{{2,3},{4,5}}}
 Rank 1 received this: {{1},{{2,3},{4,5}}} OK

Thanks!

Eric


Le 2013-04-23 18:03, Jeff Squyres (jsquyres) a écrit :

Sorry for the delay.

My C++ is a bit rusty, but this does not seem correct to me.

You're making the datatypes relative to an arbitrary address
(&lPtrBase) in a static method on each class.  You really need
the datatypes to be relative to each instance's *this* pointer.

Doing so allows MPI to read/write the data relative to the
specific instance of the objects that you're trying to
send/receive.

    Make sense?


On Apr 23, 2013, at 5:01 PM, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

another information: I just tested the example with Intel
MPI 4.0.1.007 and it works correctly...

So the problem seems to be only with OpenMPI... which is
the default distribution we use... :-/

Is my example code too long?

Eric

Le 2013-04-23 09:55, Eric Chamberland a écrit :

Sorry,

here is the attachment...

Eric

On 04/23/2013 09:54 AM, Eric Chamberland wrote:

Hi,

I have sent a previous message showing something
that I think is a bug
(or maybe a misuse, but...).

I worked on the example sent to have it
simplified: now it is almost
half of the lines of code and the structures are
more simple... but
still showing the wrong behaviour.

Briefly, we construct different MPI_datatype and
nests them into a final
type which is a:
{MPI_LONG,{{MPI_LONG,MPI_CHAR}*2}

Here is the output from OpenMPI 1.6.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in
MPI_Status is correct after
receive.
  Rank 1 received this:
  i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR 
  i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR 
  i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR 
  i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR 
  i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR 
  i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR 

Here is the expected output, obtained with
mpich-3.0.3:

  Rank 0 send this:
  i: 0 => {{0},{{3,%},{7,5}}}
  i: 1 => {{1},{{3,%},{7,5}}}
  i: 2 => {{2},{{3,%},{7,5}}}
  i: 3 => {{3},{{3,%},{7,5}}}
  i: 4 => {{4},{{3,%},{7,5}}}
  i: 5 => {{5},{{3,%},{7,5}}}
MPI_Recv returned success and everything in

[OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-08 Thread Eric Chamberland


Hi,

I am testing for the first time the 2.X release candidate.

I have a segmentation violation using  MPI_File_write_all_end(MPI_File 
fh, const void *buf, MPI_Status *status)


The "special" thing, may be that in the faulty test cases, there are 
processes that haven't written anything, so they a a zero length buffer, 
so the second parameter (buf) passed is a null pointer.


Until now, it was a valid call, has it changed?

Thanks,

Eric

FWIW: We are using our test suite (~2000 nightly tests) successfully 
with openmpi-1.{6,8,10}.* and MPICH since many years...

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-08 Thread Eric Chamberland


Hi,

On 08/07/16 12:52 PM, Edgar Gabriel wrote:

The default MPI I/O library has changed in the 2.x release to OMPIO for


ok, I am now doing I/O on my own hard drive... but I can test over NFS 
easily.  For Lustre, I will have to produce a reduced example out of our 
test suite...



most file systems. I can look into that problem, any chance to get
access to the testsuite that you mentioned?


Yikes! Sounds interesting, but difficult to realize...  Our in-house 
code is not public... :/


I however proposed (to myself) to add a nightly compilation of openmpi 
(see http://www.open-mpi.org/community/lists/users/2016/06/29515.php) so 
I can report problems before releases are made...


Anyway, I will work on the little script to automate the 
MPI+PETSc+InHouseCode combination so I get give you a feedback as soon 
as you will propose me to test a patch...


Hoping this will be enough convenient for you...

Thanks!

Eric



Thanks
Edgar


On 7/8/2016 11:32 AM, Eric Chamberland wrote:

Hi,

I am testing for the first time the 2.X release candidate.

I have a segmentation violation using  MPI_File_write_all_end(MPI_File
fh, const void *buf, MPI_Status *status)

The "special" thing, may be that in the faulty test cases, there are
processes that haven't written anything, so they a a zero length buffer,
so the second parameter (buf) passed is a null pointer.

Until now, it was a valid call, has it changed?

Thanks,

Eric

FWIW: We are using our test suite (~2000 nightly tests) successfully
with openmpi-1.{6,8,10}.* and MPICH since many years...
___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19169.php

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-08 Thread Eric Chamberland




On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case, basically what you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?


Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so 
it continued to the next MPI_File_write_all_begin with a different 
datatype but on the same file...


thanks!

Eric

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-12 Thread Eric Chamberland


Hi Edgard,

I just saw that your patch got into ompi/master... any chances it goes 
into ompi-release/v2.x before rc5?


thanks,

Eric


On 08/07/16 03:14 PM, Edgar Gabriel wrote:

I think I found the problem, I filed a pr towards master, and if that
passes I will file a pr for the 2.x branch.

Thanks!
Edgar


On 7/8/2016 1:14 PM, Eric Chamberland wrote:


On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case, basically what you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?

Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so
it continued to the next MPI_File_write_all_begin with a different
datatype but on the same file...

thanks!

Eric
___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19173.php

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-13 Thread Eric Chamberland


Hi Howard,

ok, I will wait for 2.0.1rcX... ;)

I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and 
our code from the git repos.


Now I am in a somewhat uncomfortable situation where neither the 
ompi-release.git or ompi.git repos are working for me.


The first gives me the errors with MPI_File_write_all_end I reported, 
but the former gives me errors like these:


[lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in 
file ess_singleton_module.c at line 167

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:106919] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able 
to guarantee that all other processes were killed!


So, for my continuous integration of OpenMPI I am in a no man's land... :(

Thanks anyway for the follow-up!

Eric

On 13/07/16 07:49 AM, Howard Pritchard wrote:

Hi Eric,

Thanks very much for finding this problem.   We decided in order to have
a reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

Hi Edgard,

I just saw that your patch got into ompi/master... any chances it
goes into ompi-release/v2.x before rc5?

thanks,

Eric


On 08/07/16 03:14 PM, Edgar Gabriel wrote:

I think I found the problem, I filed a pr towards master, and if
that
passes I will file a pr for the 2.x branch.

Thanks!
Edgar


    On 7/8/2016 1:14 PM, Eric Chamberland wrote:


On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case,
basically what you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?

Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after
MPI_File_write_all_end so
it continued to the next MPI_File_write_all_begin with a
different
datatype but on the same file...

thanks!

Eric
___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19173.php


___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19192.php

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-13 Thread Eric Chamberland


Hi,

FYI: I've tested the SHA e28951e

From git clone launched around 01h19:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.13.01h19m30s_config.log

Eric

On 13/07/16 04:01 PM, Pritchard Jr., Howard wrote:

Jeff,

I think this was fixed in PR 1227 on v2.x

Howard

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-14 Thread Eric Chamberland


Hi Gilles,

On 13/07/16 08:01 PM, Gilles Gouaillardet wrote:

Eric,


OpenMPI 2.0.0 has been released, so the fix should land into the v2.x
branch shortly.


ok, thanks again.



If i understand correctly, you script download/compile OpenMPI and then
download/compile PETSc.

More precisely, for OpenMPI I am cloning 
https://github.com/open-mpi/ompi.git and for Petsc, I just compile the 
latest proved stable with our code which is now 3.7.2.



In this is correct, and for the time being, feel free to patch Open MPI
v2.x before compiling it, the fix can be

downloaded ad
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1263.patch



Ok but I think it is already included into the master of the clone I 
get... :)


Cheers,

Eric




Cheers,


Gilles


On 7/14/2016 3:37 AM, Eric Chamberland wrote:

Hi Howard,

ok, I will wait for 2.0.1rcX... ;)

I've put in place a script to download/compile OpenMPI+PETSc(3.7.2)
and our code from the git repos.

Now I am in a somewhat uncomfortable situation where neither the
ompi-release.git or ompi.git repos are working for me.

The first gives me the errors with MPI_File_write_all_end I reported,
but the former gives me errors like these:

[lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file ess_singleton_module.c at line 167
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:106919] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

So, for my continuous integration of OpenMPI I am in a no man's
land... :(

Thanks anyway for the follow-up!

Eric

On 13/07/16 07:49 AM, Howard Pritchard wrote:

Hi Eric,

Thanks very much for finding this problem.   We decided in order to have
a reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

Hi Edgard,

I just saw that your patch got into ompi/master... any chances it
goes into ompi-release/v2.x before rc5?

thanks,

Eric


On 08/07/16 03:14 PM, Edgar Gabriel wrote:

I think I found the problem, I filed a pr towards master, and if
that
passes I will file a pr for the 2.x branch.

Thanks!
Edgar


    On 7/8/2016 1:14 PM, Eric Chamberland wrote:


On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case,
basically what you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?

Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after
MPI_File_write_all_end so
it continued to the next MPI_File_write_all_begin with a
different
datatype but on the same file...

thanks!

Eric
___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19173.php


___
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19192.php



___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19201.php



___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19206.php

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-14 Thread Eric Chamberland


Thanks Ralph,

It is now *much* better: all sequential executions are working... ;)
but I still have issues with a lot of parallel tests... (but not all)

The SHA tested last night was c3c262b.

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.14.01h20m32s_config.log

Here is what is the backtrace for most of these issues:

*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt': 
free(): invalid pointer: 0x7f9ab09c6020 ***

=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f]
/lib64/libc.so.6(+0x78026)[0x7f9ab01a1026]
/lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1]
/opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc]

The very same code ans tests are all working well with 
openmpi-1.{8.4,10.2} and the same version of PETSc...


And the segfault with MPI_File_write_all_end seems gone... Thanks to 
Edgar! :)


Btw, I am wondering when I should report a bug or not, since I am 
"blindly" cloning around 01h20 am each day, independently of the 
"status" of the master...  I don't want to bother anyone on this list 
with annoying bug reports...  So tell me what you would like please...


Thanks,

Eric


On 13/07/16 08:36 PM, Ralph Castain wrote:

Fixed on master


On Jul 13, 2016, at 12:47 PM, Jeff Squyres (jsquyres)  
wrote:

I literally just noticed that this morning (that singleton was broken on 
master), but hadn't gotten to bisecting / reporting it yet...

I also haven't tested 2.0.0.  I really hope singletons aren't broken then...

/me goes to test 2.0.0...

Whew -- 2.0.0 singletons are fine.  :-)



On Jul 13, 2016, at 3:01 PM, Ralph Castain  wrote:

Hmmm…I see where the singleton on master might be broken - will check later 
today


On Jul 13, 2016, at 11:37 AM, Eric Chamberland 
 wrote:

Hi Howard,

ok, I will wait for 2.0.1rcX... ;)

I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and our 
code from the git repos.

Now I am in a somewhat uncomfortable situation where neither the 
ompi-release.git or ompi.git repos are working for me.

The first gives me the errors with MPI_File_write_all_end I reported, but the 
former gives me errors like these:

[lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
ess_singleton_module.c at line 167
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:106919] Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

So, for my continuous integration of OpenMPI I am in a no man's land... :(

Thanks anyway for the follow-up!

Eric

On 13/07/16 07:49 AM, Howard Pritchard wrote:

Hi Eric,

Thanks very much for finding this problem.   We decided in order to have
a reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

 Hi Edgard,

 I just saw that your patch got into ompi/master... any chances it
 goes into ompi-release/v2.x before rc5?

 thanks,

 Eric


 On 08/07/16 03:14 PM, Edgar Gabriel wrote:

 I think I found the problem, I filed a pr towards master, and if
 that
 passes I will file a pr for the 2.x branch.

 Thanks!
 Edgar

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-25 Thread Eric Chamberland


Hi Edgard,

just to tell that I tested your fixe that has been merged into 
ompi-release/v2.x (9ba667815) and it works! :)


Thanks!

Eric

On 12/07/16 04:30 PM, Edgar Gabriel wrote:

I think the decision was made to postpone the patch to 2.0.1, since the
release of 2.0.0 is eminent.

Thanks
Edgar

On 7/12/2016 2:51 PM, Eric Chamberland wrote:

Hi Edgard,

I just saw that your patch got into ompi/master... any chances it goes
into ompi-release/v2.x before rc5?

thanks,

Eric


On 08/07/16 03:14 PM, Edgar Gabriel wrote:

I think I found the problem, I filed a pr towards master, and if that
passes I will file a pr for the 2.x branch.

Thanks!
Edgar


On 7/8/2016 1:14 PM, Eric Chamberland wrote:

On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case, basically what
you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?

Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after
MPI_File_write_all_end so
it continued to the next MPI_File_write_all_begin with a different
datatype but on the same file...

thanks!

Eric
___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19173.php

___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19192.php

[OMPI devel] OpenMPI 2.0 and Petsc 3.7.2

2016-07-25 Thread Eric Chamberland


Hi,

has someone tried OpenMPI 2.0 with Petsc 3.7.2?

I am having some errors with petsc, maybe someone have them too?

Here are the configure logs for PETSc:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log

And for OpenMPI:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log

(in fact, I am testing the ompi-release branch, a sort of petsc-master 
branch, since I need the commit 9ba6678156).


For a set of parallel tests, I have 104 that works on 124 total tests.

And the typical error:
*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev': 
free(): invalid pointer:

=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f80eb11677f]
/lib64/libc.so.6(+0x78026)[0x7f80eb11c026]
/lib64/libc.so.6(+0x78d53)[0x7f80eb11cd53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f80ea8f9d60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f80df0ea628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f80df0eac50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f80eb7029dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f80eb702ad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f80f2fa6c6d]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f80f2fa1c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0xa9d0f5)[0x7f80f35960f5]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f80f35c2588]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x10bf0f4)[0x7f80f3bb80f4]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f80f3a79fd9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f80f3d1a334]

a similar one:
*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProbFluideIncompressible.dev': 
free(): invalid pointer: 0x7f382a7c5bc0 ***

=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f3829f1c77f]
/lib64/libc.so.6(+0x78026)[0x7f3829f22026]
/lib64/libc.so.6(+0x78d53)[0x7f3829f22d53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f38296ffd60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f381deab628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f381deabc50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f382a5089dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f382a508ad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f3831dacc6d]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f3831da7c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x9f4755)[0x7f38322f3755]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f38323c8588]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x4e2)[0x7f383287f87a]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f383287ffd9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f3832b20334]

another one:

*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.MortierDiffusion.dev': 
free(): invalid pointer: 0x7f67b6d37bc0 ***

=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f67b648e77f]
/lib64/libc.so.6(+0x78026)[0x7f67b6494026]
/lib64/libc.so.6(+0x78d53)[0x7f67b6494d53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f67b5c71d60]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1adae)[0x7f67aa4cddae]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1b4ca)[0x7f67aa4ce4ca]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f67b6a7a9dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f67b6a7aad6]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adb09)[0x7f67be31eb09]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f67be319c45]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+

Re: [OMPI devel] [petsc-users] OpenMPI 2.0 and Petsc 3.7.2

2016-07-25 Thread Eric Chamberland


Ok,

here is the 2 points answered:

#1) got valgrind output... here is the fatal free operation:

==107156== Invalid free() / delete / delete[] / realloc()
==107156==at 0x4C2A37C: free (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)

==107156==by 0x1E63CD5F: opal_free (malloc.c:184)
==107156==by 0x27622627: mca_pml_ob1_recv_request_fini 
(pml_ob1_recvreq.h:133)
==107156==by 0x27622C4F: mca_pml_ob1_recv_request_free 
(pml_ob1_recvreq.c:90)

==107156==by 0x1D3EF9DC: ompi_request_free (request.h:362)
==107156==by 0x1D3EFAD5: PMPI_Request_free (prequest_free.c:59)
==107156==by 0x14AE3B9C: VecScatterDestroy_PtoP (vpscat.c:219)
==107156==by 0x14ADEB74: VecScatterDestroy (vscat.c:1860)
==107156==by 0x14A8D426: VecDestroy_MPI (pdvec.c:25)
==107156==by 0x14A33809: VecDestroy (vector.c:432)
==107156==by 0x10A2A5AB: GIREFVecDestroy(_p_Vec*&) 
(girefConfigurationPETSc.h:115)
==107156==by 0x10BA9F14: VecteurPETSc::detruitObjetPETSc() 
(VecteurPETSc.cc:2292)
==107156==by 0x10BA9D0D: VecteurPETSc::~VecteurPETSc() 
(VecteurPETSc.cc:287)
==107156==by 0x10BA9F48: VecteurPETSc::~VecteurPETSc() 
(VecteurPETSc.cc:281)
==107156==by 0x1135A57B: 
PPReactionsAppuiEL3D::~PPReactionsAppuiEL3D() (PPReactionsAppuiEL3D.cc:216)
==107156==by 0xCD9A1EA: ProblemeGD::~ProblemeGD() (in 
/home/mefpp_ericc/depots_prepush/GIREF/lib/libgiref_dev_Formulation.so)

==107156==by 0x435702: main (Test.ProblemeGD.icc:381)
==107156==  Address 0x1d6acbc0 is 0 bytes inside data symbol 
"ompi_mpi_double"
--107156-- REDIR: 0x1dda2680 (libc.so.6:__GI_stpcpy) redirected to 
0x4c2f330 (__GI_stpcpy)

==107156==
==107156== Process terminating with default action of signal 6 
(SIGABRT): dumping core

==107156==at 0x1DD520C7: raise (in /lib64/libc-2.19.so)
==107156==by 0x1DD53534: abort (in /lib64/libc-2.19.so)
==107156==by 0x1DD4B145: __assert_fail_base (in /lib64/libc-2.19.so)
==107156==by 0x1DD4B1F1: __assert_fail (in /lib64/libc-2.19.so)
==107156==by 0x27626D12: mca_pml_ob1_send_request_fini 
(pml_ob1_sendreq.h:221)
==107156==by 0x276274C9: mca_pml_ob1_send_request_free 
(pml_ob1_sendreq.c:117)

==107156==by 0x1D3EF9DC: ompi_request_free (request.h:362)
==107156==by 0x1D3EFAD5: PMPI_Request_free (prequest_free.c:59)
==107156==by 0x14AE3C3C: VecScatterDestroy_PtoP (vpscat.c:225)
==107156==by 0x14ADEB74: VecScatterDestroy (vscat.c:1860)
==107156==by 0x14A8D426: VecDestroy_MPI (pdvec.c:25)
==107156==by 0x14A33809: VecDestroy (vector.c:432)
==107156==by 0x10A2A5AB: GIREFVecDestroy(_p_Vec*&) 
(girefConfigurationPETSc.h:115)
==107156==by 0x10BA9F14: VecteurPETSc::detruitObjetPETSc() 
(VecteurPETSc.cc:2292)
==107156==by 0x10BA9D0D: VecteurPETSc::~VecteurPETSc() 
(VecteurPETSc.cc:287)
==107156==by 0x10BA9F48: VecteurPETSc::~VecteurPETSc() 
(VecteurPETSc.cc:281)
==107156==by 0x1135A57B: 
PPReactionsAppuiEL3D::~PPReactionsAppuiEL3D() (PPReactionsAppuiEL3D.cc:216)
==107156==by 0xCD9A1EA: ProblemeGD::~ProblemeGD() (in 
/home/mefpp_ericc/depots_prepush/GIREF/lib/libgiref_dev_Formulation.so)

==107156==by 0x435702: main (Test.ProblemeGD.icc:381)


#2) For the run with -vecscatter_alltoall it works...!

As an "end user", should I ever modify these VecScatterCreate options? 
How do they change the performances of the code on large problems?


Thanks,

Eric

On 25/07/16 02:57 PM, Matthew Knepley wrote:

On Mon, Jul 25, 2016 at 11:33 AM, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Hi,

has someone tried OpenMPI 2.0 with Petsc 3.7.2?

I am having some errors with petsc, maybe someone have them too?

Here are the configure logs for PETSc:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log

And for OpenMPI:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log

(in fact, I am testing the ompi-release branch, a sort of
petsc-master branch, since I need the commit 9ba6678156).

For a set of parallel tests, I have 104 that works on 124 total tests.


It appears that the fault happens when freeing the VecScatter we build
for MatMult, which contains Request structures
for the ISends and  IRecvs. These looks like internal OpenMPI errors to
me since the Request should be opaque.
I would try at least two things:

1) Run under valgrind.

2) Switch the VecScatter implementation. All the options are here,

  
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterCreate.html#VecScatterCreate

but maybe use alltoall.

  Thanks,

 Matt


And the typical error:
*** Error in

`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev':
free(): invalid pointer:
=== Backtrace: =
/lib64/libc.so.6(+0x7277

Re: [OMPI devel] OpenMPI 2.0 and Petsc 3.7.2

2016-07-26 Thread Eric Chamberland


Hi Gilles,


On 25/07/16 10:38 PM, Gilles Gouaillardet wrote:

Eric,

where can your test case be downloaded ? how many nodes and tasks do you
need to reproduce the bug ?


Sadly, it is in our in-house code and it requires to whole source code 
which isn't public... :/


I have this bug with 20 parallel tests from our 124 tests database, 
running with 2 to 10 processes (but 2 for most of them).


The bug is happening at the very end of the execution (FE 
resolution+exports), when everything get destroyed, including PETSc stuff.


Unfortunately, running "make test" and "make testexamples" at the end of 
petsc installation doesn't trigger the bug... :/




fwiw, currently there are two Open MPI repositories
- https://github.com/open-mpi/ompi
  there is only one branch and is the 'master' branch, today, this can
be seen as Open MPI 3.0 pre alpha
- https://github.com/open-mpi/ompi-release
  the default branch is 'v2.x', today, this can be seen as Open MPI
2.0.1 pre alpha


I tested both...  I reported the error also for the "master" of ompi, 
and they seems related to me, see: 
https://github.com/open-mpi/ompi/issues/1875


Thanks,

Eric



Cheers,

Gilles

On 7/26/2016 3:33 AM, Eric Chamberland wrote:

Hi,

has someone tried OpenMPI 2.0 with Petsc 3.7.2?

I am having some errors with petsc, maybe someone have them too?

Here are the configure logs for PETSc:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log


And for OpenMPI:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log


(in fact, I am testing the ompi-release branch, a sort of petsc-master
branch, since I need the commit 9ba6678156).

For a set of parallel tests, I have 104 that works on 124 total tests.

And the typical error:
*** Error in
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev':
free(): invalid pointer:
=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f80eb11677f]
/lib64/libc.so.6(+0x78026)[0x7f80eb11c026]
/lib64/libc.so.6(+0x78d53)[0x7f80eb11cd53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f80ea8f9d60]

/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f80df0ea628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f80df0eac50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f80eb7029dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f80eb702ad6]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f80f2fa6c6d]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f80f2fa1c45]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0xa9d0f5)[0x7f80f35960f5]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f80f35c2588]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x10bf0f4)[0x7f80f3bb80f4]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f80f3a79fd9]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f80f3d1a334]


a similar one:
*** Error in
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProbFluideIncompressible.dev':
free(): invalid pointer: 0x7f382a7c5bc0 ***
=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f3829f1c77f]
/lib64/libc.so.6(+0x78026)[0x7f3829f22026]
/lib64/libc.so.6(+0x78d53)[0x7f3829f22d53]
/opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f38296ffd60]

/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f381deab628]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f381deabc50]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f382a5089dd]
/opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f382a508ad6]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f3831dacc6d]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f3831da7c45]

/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.

[OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland


Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that fails 
at the very beginning with this strange error:


[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[9325,0],0] from [[5590,0],0]


But I can't reproduce the problem right now... ie: If I launch this test 
alone "by hand", it is successful... the same test was successful 
yesterday...


Is there some kind of "race condition" that can happen on the creation 
of "tmp" files if many tests runs together on the same node? (we are 
oversubcribing even sequential runs...)


Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland

Other relevant info: I never saw this problem with OpenMPI 1.6.5,1.8.4 
and 1.10.[3,4] which runs the same test suite...


thanks,

Eric


On 13/09/16 11:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that fails
at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],0] from [[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch this test
alone "by hand", it is successful... the same test was successful
yesterday...

Is there some kind of "race condition" that can happen on the creation
of "tmp" files if many tests runs together on the same node? (we are
oversubcribing even sequential runs...)

Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland




On 13/09/16 12:11 PM, Pritchard Jr., Howard wrote:

Hello Eric,

Is the failure seen with the same two tests?  Or is it random
which tests fail?  If its not random, would you be able to post


No, the tests that failed were different ones...


the tests to the list?

Also,  if possible, it would be great if you could test against a master
snapshot:

https://www.open-mpi.org/nightly/master/


Yes I can, but since the bug appears time to time, I think I can't get 
relevant info from a single run on master will have to wait let's 
say 10 or 15 days before it crashes... but that may be hard since master 
is less stable than release and will have normal failures... :/



Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland




On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:

Eric,


can you please provide more information on how your tests are launched ?



Yes!


do you

mpirun -np 1 ./a.out

or do you simply

./a.out



For all sequential tests, we do ./a.out.



do you use a batch manager ? if yes, which one ?


No.



do you run one test per job ? or multiple tests per job ?


On this automatic compilation, up to 16 tests are launched together.



how are these tests launched ?
For sequential ones, the special thing is that they are launched via 
python Popen call, which launches "time" which launches the code.


So the "full" commande line is:

/usr/bin/time -v -o 
/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt 
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt 
mpi_v=2 verbose=True Beowulf=False outilMassif=False 
outilPerfRecord=False verifValgrind=False outilPerfStat=False 
outilCallgrind=False 
RepertoireDestination=/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien 
RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien 
Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier






do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which suggests there
MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?



[eric@lorien] Scripts (master $ u+1)> hostname
lorien


if you configure'd Open MPI with --enable-debug, can you

Yes.



export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?



Hmmm, strange?

[lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash 
1366255883

[lorien:93841] plm:base:set_hnp_name: final jobfam 22260
[lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:93841] [[22260,0],0] plm:base:receive start comm
[lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered
[lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a 
dynamic spawn

[lorien:93841] [[22260,0],0] plm:base:receive stop comm

~ 



from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager and/or
a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on the same node
are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, you will be able to check which pids produced
the faulty jobfam, and that could hint to a conflict.

Does this gives the same output as with export 
OMPI_MCA_plm_base_verbose=5 without the patch?


If so, beacause all is automated, applying a patch is "harder" than 
doing a simple
export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add 
OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs?


Thanks!

Eric




Cheers,


Gilles


On 9/14/2016 12:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that
fails at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],0] from [[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch this
test alone "by hand", it is successful... the same test was successful
yesterday...

Is there some kind of "race condition" that can happen on the creation
of "tmp" files if many tests runs together on the same node? (we are
oversubcribing even sequential runs...)

Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


Thanks,

Eric
_

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland


Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls 
-la |wc -l

1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls 
-la |more

total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13 03:05 10049
drwx--2 cmpbib bib 6 Sep 13 03:15 10052
drwx--2 cmpbib bib 6 Sep 13 02:22 10059
drwx--2 cmpbib bib 6 Sep 13 02:22 10110
drwx--2 cmpbib bib 6 Sep 13 02:41 10114
...

If I do:

lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
  Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
  Output information may be incomplete.

nothing...

What else may I check?

Eric


On 14/09/16 08:47 AM, Joshua Ladd wrote:

Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145

I'm wondering if you can look into the /tmp directory and see if you
have a bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet mailto:gil...@rist.or.jp>> wrote:

Eric,


can you please provide more information on how your tests are launched ?

do you

mpirun -np 1 ./a.out

or do you simply

./a.out


do you use a batch manager ? if yes, which one ?

do you run one test per job ? or multiple tests per job ?

how are these tests launched ?


do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which suggests
there MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?

if you configure'd Open MPI with --enable-debug, can you

export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?


from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager
and/or a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family from the
(32 bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on the same
node are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, you will be able to check which pids
produced the faulty jobfam, and that could hint to a conflict.


Cheers,


Gilles



On 9/14/2016 12:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests
that fails at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[9325,0],0] from
[[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch
this test alone "by hand", it is successful... the same test was
successful yesterday...

Is there some kind of "race condition" that can happen on the
creation of "tmp" files if many tests runs together on the same
node? (we are oversubcribing even sequential runs...)

Here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>


Thanks,

Eric
__

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland




On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?


No


or a unique $TMP per "batch" of run ?


Yes.

I was happy because each nighlty batch has it's own TMP, so I can check 
afterward for problems related to a specific night without interference 
with another nightly batch of tests... if a bug ever happens... ;)




in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict



ok, I will launch it manually later today, but it will be automatic 
tonight (with export OMPI_MCA_plm_base_verbose=5).


Thanks!

Eric



Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13 03:05 10049
drwx--2 cmpbib bib 6 Sep 13 03:15 10052
drwx--2 cmpbib bib 6 Sep 13 02:22 10059
drwx--2 cmpbib bib 6 Sep 13 02:22 10110
drwx--2 cmpbib bib 6 Sep 13 02:41 10114
...

If I do:

lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
/run/user/1000/gvfs
  Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system
/sys/kernel/debug/tracing
  Output information may be incomplete.

nothing...

What else may I check?

Eric


On 14/09/16 08:47 AM, Joshua Ladd wrote:

Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>

I'm wondering if you can look into the /tmp directory and see if you
have a bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
mailto:gil...@rist.or.jp>> wrote:

Eric,


can you please provide more information on how your tests
are launched ?

do you

mpirun -np 1 ./a.out

or do you simply

./a.out


do you use a batch manager ? if yes, which one ?

do you run one test per job ? or multiple tests per job ?

how are these tests launched ?


do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which
suggests
there MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?

if you configure'd Open MPI with --enable-debug, can you

export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?


from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager
and/or a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family
from the
(32 bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on
the same
node are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final
jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4,
orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final
jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

a

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland


Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because 
there has been a segfault:


stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt

[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
1366255883

[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 163

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able 
to guarantee that all other processes were killed!


stdout:

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) 
instead of ORTE_SUCCESS

--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) 
instead of "Success" (0)

--

openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
openmpi-sessions-40031@lorien_0

srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find 
. -type f

./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
plm:base:set_hnp_name: final jobfam 53310


(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt

config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log

ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt

Maybe it aborted (instead of giving the other message) while doing the 
error because of export OMPI_MCA_plm_base_verbose=5 ?


Thanks,

Eric


On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?

in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict

Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Eric Chamberland


Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch


the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic 
process started again... (which erase all directories before starting) :/


I would like to put core files in a user specific directory, but it 
seems it has to be a system-wide configuration... :/  I will trick this 
by changing the "pwd" to a path outside the erased directory...


So as of tonight I should be able to retrieve core files even after I 
relaunched the process..


Thanks for all the support!

Eric



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt


[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--

--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127)
instead of "Success" (0)
--


openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34
openmpi-sessions-40031@lorien_0
srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
find . -type f
./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
plm:base:set_hnp_name: final jobfam 53310

(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt


config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log


ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt


Maybe it aborted (instead of giving the other mess

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Eric Chamberland


Hi,

I know the pull request has not (yet) been merged, but here is a 
somewhat "different" output from a single sequential test 
(automatically) laucnhed without mpirun last night:


[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 
1366255883

[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a 
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[41545,0],0] from [[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop 
comm



unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt

Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch



the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out


Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even after I
relaunched the process..

Thanks for all the support!

Eric



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--


--


It looks like MPI_INIT failed for some reason; your parallel process

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Eric Chamberland


Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, I 
haven't got any failure!


Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a 
good one... So will there be a 2.0.2 release or will it go to 2.1.0 
directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior anyway)

Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Hi,

I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from [[39075,0],0]
[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm


unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log>


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt>

Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at

https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch

<https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch>



the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,

it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before
starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will
trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even
after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

    Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:

stderr:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt

<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt>



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh
path NU

[OMPI devel] Bug on branch v2.x since october 3

2018-10-17 Thread Eric Chamberland


Hi,

since commit 18f23724a, our nightly base test is broken on v2.x branch.

Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but 
was repaired some days after (can't tell exactly, but at most it was 
fixed with fa3d92981a).


I get segmentation faults or deadlocks in many cases.

Could this be related with issue 5842 ?
(https://github.com/open-mpi/ompi/issues/5842)

Here is an example of backtrace for a deadlock:

#4  
#5  0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6
#6  0x7f9dccee in opal_progress () at runtime/opal_progress.c:243
#7  0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000) 
at ../../../../ompi/request/request.h:392
#8  0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
datatype=0x7f9dca61e2c0 , src=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
pml_ob1_irecv.c:129
#9  0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30 
long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, 
type=0x7f9dca61e2c0 , source=1, tag=32767, 
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at 
precv.c:77
#10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus 
(pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2, 
pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus 
const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of 
length 1, capacity 1 = {...}) at 
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332


And some informations about configuration:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Bug on branch v2.x since october 3

2018-10-17 Thread Eric Chamberland


ok, thanks a lot! :)

Eric


On 17/10/18 01:32 PM, Nathan Hjelm via devel wrote:


Ah yes, 18f23724a broke things so we had to fix the fix. Didn't apply it 
to the v2.x branch. Will open a PR to bring it over.


-Nathan

On Oct 17, 2018, at 11:28 AM, Eric Chamberland 
 wrote:



Hi,

since commit 18f23724a, our nightly base test is broken on v2.x branch.

Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but
was repaired some days after (can't tell exactly, but at most it was
fixed with fa3d92981a).

I get segmentation faults or deadlocks in many cases.

Could this be related with issue 5842 ?
(https://github.com/open-mpi/ompi/issues/5842)

Here is an example of backtrace for a deadlock:

#4 
#5 0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6
#6 0x7f9dccee in opal_progress () at runtime/opal_progress.c:243
#7 0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000)
at ../../../../ompi/request/request.h:392
#8 0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30
*, std::__debug::vector >&)::slValeurs>, count=3,
datatype=0x7f9dca61e2c0 , src=1, tag=32767,
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at
pml_ob1_irecv.c:129
#9 0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30
*, std::__debug::vector >&)::slValeurs>, count=3,
type=0x7f9dca61e2c0 , source=1, tag=32767,
comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at
precv.c:77
#10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus
(pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2,
pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus 


const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of
length 1, capacity 1 = {...}) at
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332

And some informations about configuration:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt

Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland


On 12/11/2014 05:45 AM, Ralph Castain wrote:
...


by the reporters. Still, I would appreciate a fairly thorough testing as
this is expected to be the last 1.8 series release for some time.


Is is relevant to report valgrind leaks?  Maybe they are "normal" or 
not, I don't know.  If they are normal, maybe suppressions should be 
added to .../share/openmpi/openmpi-valgrind.supp before the release?


Here is a simple test case ;-) :

cat mpi_init_finalize.c

#include "mpi.h"

int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Finalize();
return 0;
}


mpicc -o mpi_init_finalize mpi_init_finalize.c

mpiexec -np 1 valgrind -v 
--suppressions=/opt/openmpi-1.8.4rc2/share/openmpi/openmpi-valgrind.supp 
--gen-suppressions=all --leak-check=full --leak-resolution=high 
--show-reachable=yes --error-limit=no --num-callers=24 --track-fds=yes 
--log-file=valgrind_out.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize


running with 2 processes generates some more:

mpiexec -np 2  
--log-file=valgrind_out_2proc.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize


which results in the files attached...

Thanks,

Eric



valgrind_out.tgz
Description: application/compressed-tar

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland


On 12/12/2014 11:38 AM, Jeff Squyres (jsquyres) wrote:

Did you configure OMPI with --enable-memchecker?


No, only "--prefix="

Eric

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-13 Thread Eric Chamberland


On 12/12/2014 01:12 PM, Ralph Castain wrote:

I just checked it with —enable-memchecker —with-valgrind and found that many of 
these are legitimate leaks. We can take a look at them, though as I said, 
perhaps may wait for 1.8.5 as I wouldn’t hold up 1.8.4 for it.


wait!

When end-developpers of other software valgrind their code, they find 
leaks from openmpi and then they ask themself: "Did I made a misuse of 
MPI?"  So they have to look around, into the FAQ, and find this:


http://www.open-mpi.org/faq/?category=debugging#valgrind_clean

and tell theme self: "Fine, now with this suppression file, I am sure 
the leaks are my fault!" and try to find why theses leaks remains in 
their code...


then, not understanding what is wrong... they ask the list to see if it 
is normal or not... ;-)


May I suggest to give suppression name like 
"real_leak_to_be_fixed_in_next_release_#" so at least, you guys won't 
forget to fix it, and all of us won't be upset about misuse of the library?


Or maybe put them into another suppression file?   But list them down 
somewhere: that would really help us!


Thanks,

Eric

ps: we valgrind our code each night to be able to detect asap new leaks 
or defects...

[OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found

2014-12-15 Thread Eric Chamberland


Hi,

I first saw this message using 1.8.4rc3:

--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--

I have compiled it in "debug" mode... is it the problem?

...but I think I do have a loopback on my host:

ifconfig -a

eth0  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:38
  inet addr:132.203.7.22  Bcast:132.203.7.255  Mask:255.255.255.0
  inet6 addr: fe80::225:90ff:fe0d:a538/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:49080380 errors:0 dropped:0 overruns:0 frame:0
  TX packets:67526463 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:35710440484 (34056.1 Mb)  TX bytes:64050625687 
(61083.4 Mb)

  Interrupt:16 Memory:faee-faf0

eth1  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:39
  BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Interrupt:17 Memory:fafe-fb00

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:3089696 errors:0 dropped:0 overruns:0 frame:0
  TX packets:3089696 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:8421008033 (8030.8 Mb)  TX bytes:8421008033 (8030.8 Mb)

Is that message erroneous?

Thanks,

Eric

Re: [OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found

2014-12-15 Thread Eric Chamberland


Forgot this:

ompi_info -all : 
http://www.giref.ulaval.ca/~ericc/ompi_bug/ompi_info.all.184rc3.txt.gz

config.log: http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184rc3.log.gz

Eric

[OMPI devel] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Eric Chamberland


Hi,

I encountered a new bug while testing our collective MPI I/O 
functionnalities over NFS.  This is not a big issue for us, but I think 
someone should have a look at it.


While running at 3 processes, we have this error on rank #0 and rank #2, 
knowing that rank #1 have nothing to write (0 length size) on this 
particular PMPI_File_write_all_begin call:



==19211== Syscall param write(buf) points to uninitialised byte(s)
==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so)
==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4E9A67: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4C79A2: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4961AD: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==  Address 0x295af060 is 144 bytes inside a block of size 
524,288 alloc'd
==19211==at 0x4C2C27B: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)

==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4E9A67: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4C79A2: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4961AD: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)

==19211==  Uninitialised value was created by a heap allocation
==19211==at 0x4C2C27B: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)

==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.op

[OMPI devel] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Eric Chamberland

AccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ADIOI_FileD*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType, 
std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4DDBFE: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4BCB22: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x48E213: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)


Can't tell if it is a big issue or not, but I thought I should mention 
it to the list


We run without this valgrind  error when I use my local disk partition 
instead of an nfs parition or if I run with only 1 process  (which 
always have something to write for each PMPI_File_write_all_begin) and 
write to an nfs partition.


Have you guys thinked about unifying this part of code?  Making it a 
sub-library? (please don't flame me... ;-) )


Anyway,

thanks,

Eric


On 12/19/2014 02:16 PM, Howard Pritchard wrote:

HI Eric,

Does your app also work with MPICH?  The romio in Open MPI is getting a
bit old, so it would be useful to know if you see the same valgrind
error using a recent MPICH.

Howard


2014-12-19 9:50 GMT-07:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

Hi,

I encountered a new bug while testing our collective MPI I/O
functionnalities over NFS.  This is not a big issue for us, but I
think someone should have a look at it.

While running at 3 processes, we have this error on rank #0 and rank
#2, knowing that rank #1 have nothing to write (0 length size) on
this particular PMPI_File_write_all_begin call:


==19211== Syscall param write(buf) points to uninitialised byte(s)
==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so
<http://libpthread-2.17.so>)
==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl
(ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin
(write_allb.c:114)
==19211==by 0x27431DBF:
mca_io_romio_dist_MPI_File___write_all_begin (write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all___begin
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage
PAIO::__ecritureIndexeParBlocMPI<__PAIOType,
PtrPorteurConst,
FunctorCopieInfosSurDansVectPA__Type,
std::vector*, std::allocator*> > const>,
FunctorAccesseurPorteurLocal<__PtrPorteurConst >
 >(PAGroupeProcessus&, ompi_file_t*, long long,
PtrPorteurConst, PtrPorteurConst,
FunctorCopieInfosSurDansVectPA__Type,
std::vector*, std::allocator*> > const>&,
FunctorAccesseurPorteurLocal<__PtrPorteurConst >&,
long, DistributionComposantes&, long, unsigned long, unsigned long,
std::string const&) (in
/home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt)
==19211==by 0x4E9A67:
GISLectureEcriture::__visiteMaillage(Maillage const&) (in
/home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt)
==19211==by 0x4C79A2:
GISLectureEcriture::__ecritGISMPI(std::string,
GroupeInfoSur const&, std::string const&) (in
/home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt)
==19211==by 0x4961AD: main (in
/home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt)
==19211==  Address 0x295af060 is 144 bytes inside a block of size
524,288 alloc'd
==19211==at 0x4C2C27B: malloc (in
/usr/lib64/valgrind/vgpreload___memcheck-amd64-linux.so)
==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl
(ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin
(write_allb.c:114)
==19211==by 0x27431DBF:
mca_io_romio_dist_MPI_File___write_all_begin (write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all___begin
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage
PAIO::__ecritureIndexeParBlocMPI<__PAIOType,
PtrPorteurConst,
FunctorCopieInfosSurDansVectPA__Type,
std::vector*, std::allocator*> > const>,
Fun

Re: [OMPI devel] [mpich-discuss] BUG in ADIOI_NFS_WriteStrided

2014-12-21 Thread Eric Chamberland


On 12/19/2014 09:52 PM, Rob Latham wrote:
Please don't use NFS for MPI-IO.  ROMIO makes a best effort but 
there's no way to guarantee you won't corrupt a block of data (NFS 


Ok.  But how can I know the type of filesystem my users will work on?  
For small jobs, they may have data on NFS and don't car too much for 
read/write speed... and I want only 1 file format that can be used on 
any filesystem...


Do you recommend me to disable ROMIO/NFS support when configuring MPICH 
(how do you ask this to configure?)?


What other library is recommend to use if I have to write distributed 
data on NFS?  Does HDF5, for example, switches from MPI I/O to something 
else when doing collective I/O on NFS?


I don't want to write a function to write to a file that depends on the 
final type of filesystem...  I expect the library to do a good job for 
me... and I have chosen MPI I/O do to that job... ;-)



clients are allowed to cache... arbitrarily, it seems).  There are so 
many good parallel file systems with saner consistency semantics .


Can't tell anything about how NFS is usable or not with MPI I/O... I 
Just use it because our nightly tests are writing results to NFS 
partitions... as our users may do...




This looks like maybe a calloc would clean it right up.


Ok, the point is: is there a bug, and can it be fixed (even if it is not 
recommended to use ROMIO/NFS) or at least tracked?


Thanks!

Eric

Re: [OMPI devel] Open MPI v5.0.x branch created

2021-06-28 Thread Eric Chamberland via devel


Hi,

I just checked out the 5.0.x branch ans gave it a try.

Is it ok to report problems or shall we wait until an official rc1 ?

Thanks,

Eric

ps: I have a bug with MPI_File_open...

On 2021-03-11 1:24 p.m., Geoffrey Paulsen via devel wrote:

Open MPI developers,

  We've created the Open MPI v5.0.x branch today, and are receiving 
bugfixes. Please cherry-pick any master PRs to v5.0.x once they've 
been merged to master.


  We're targeting an aggressive but achievable release date of May 15th.

  If you're in charge of your organization's CI tests, please enable 
for v5.0.x PRs.  It may be a few days until all of our CI is enabled 
on v5.0.x.


  Thanks everyone for your continued commitment to Open MPI's success.

  Josh Ladd, Austen Lauria, and Geoff Paulsen - v5.0 RMs


--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

39 matches

Mail list logo