[OMPI devel] assert in opal_datatype_is_contiguous_memory_layout
Hi all, (Sorry, I have sent this to "users" but I should have sent it to "devel" list instead. Sorry for the mess...) I have attached a very small example which raise an assertion. The problem is arising from a process which does not have any element to write in a file (and then in the MPI_File_set_view)... You can see this "bug" with openmpi 1.6.3, 1.6.4 and 1.7.0 configured with: ./configure --enable-mem-debug --enable-mem-profile --enable-memchecker --with-mpi-param-check --enable-debug Just compile the given example (idx_null.cc) as-is with mpicxx -o idx_null idx_null.cc and run with 3 processes: mpirun -n 3 idx_null You can modify the example by commenting "#define WITH_ZERO_ELEMNT_BUG" to see that everything is going well when all processes have something to write. There is no "bug" if you use openmpi 1.6.3 (and higher) without the debugging options. Also, all is working well with mpich-3.0.3 configured with: ./configure --enable-g=yes So, is this a wrong "assert" in openmpi? Is there a real problem to use this example in a "release" mode? Thanks, Eric #include "mpi.h" #include #include using namespace std; void abortOnError(int ierr) { if (ierr != MPI_SUCCESS) { printf("ERROR Returned by MPI: %d\n",ierr); char* lCharPtr = new char[MPI_MAX_ERROR_STRING]; int lLongueur = 0; MPI_Error_string(ierr,lCharPtr, &lLongueur); printf("ERROR_string Returned by MPI: %s\n",lCharPtr); MPI_Abort( MPI_COMM_WORLD, 1 ); } } // This main is showing how to have an assertion raised if you try // to create a MPI_File_set_view with some process holding no data #define WITH_ZERO_ELEMNT_BUG int main(int argc, char *argv[]) { int rank, size, i; MPI_Datatype lTypeIndexIntWithExtent, lTypeIndexIntWithoutExtent; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); if (size != 3) { printf("Please run with 3 processes.\n"); MPI_Finalize(); return 1; } MPI_Comm_rank(MPI_COMM_WORLD, &rank); int displacement[3]; int* buffer = 0; int lTailleBuf = 0; if (rank == 0) { lTailleBuf = 3; displacement[0] = 0; displacement[1] = 4; displacement[2] = 5; buffer = new int[lTailleBuf]; for (i=0; i("temp"), MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &lFile )); MPI_Type_create_indexed_block(lTailleBuf, 1, displacement, MPI_INT, &lTypeIndexIntWithoutExtent); MPI_Type_commit(&lTypeIndexIntWithoutExtent); // Here we compute the total number of int to write to resize the type: // Ici, on veut s'échanger le nb total de int à écrire à chaque appel pcqu'on doit calculer le bon "extent" du type. // Ça revient à dire que chaque processus ne n'écrira qu'une petite partie du fichier, mais devra avancer son pointeur // local d'écriture suffisamment loin pour ne pas écrire par dessus les données des autres int lTailleGlobale = 0; printf("[%d] Local size : %d \n",rank,lTailleBuf); MPI_Allreduce( &lTailleBuf, &lTailleGlobale, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD ); printf("[%d] MPI_AllReduce : %d \n",rank,lTailleGlobale); //We now modify the extent of the type "type_without_extent" MPI_Type_create_resized( lTypeIndexIntWithoutExtent, 0, lTailleGlobale*sizeof(int), &lTypeIndexIntWithExtent ); MPI_Type_commit(&lTypeIndexIntWithExtent); abortOnError(MPI_File_set_view( lFile, 0, MPI_INT, lTypeIndexIntWithExtent, const_cast("native"), MPI_INFO_NULL)); for (int i =0; i<2;++i) { abortOnError(MPI_File_write_all( lFile, buffer, lTailleBuf, MPI_INT, MPI_STATUS_IGNORE)); MPI_Offset lOffset,lSharedOffset; MPI_File_get_position(lFile, &lOffset); MPI_File_get_position_shared(lFile, &lSharedOffset); printf("[%d] Offset after write : %d int: Local: %ld Shared: %ld \n",rank,lTailleBuf,lOffset,lSharedOffset); } abortOnError(MPI_File_close( &lFile )); abortOnError(MPI_Type_free(&lTypeIndexIntWithExtent)); abortOnError(MPI_Type_free(&lTypeIndexIntWithoutExtent)); MPI_Finalize(); return 0; }
[OMPI devel] Simplified: Misuse or bug with nested types?
Hi, I have sent a previous message showing something that I think is a bug (or maybe a misuse, but...). I worked on the example sent to have it simplified: now it is almost half of the lines of code and the structures are more simple... but still showing the wrong behaviour. Briefly, we construct different MPI_datatype and nests them into a final type which is a: {MPI_LONG,{{MPI_LONG,MPI_CHAR}*2} Here is the output from OpenMPI 1.6.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR Here is the expected output, obtained with mpich-3.0.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{7,5}}} OK i: 1 => {{1},{{3,%},{7,5}}} OK i: 2 => {{2},{{3,%},{7,5}}} OK i: 3 => {{3},{{3,%},{7,5}}} OK i: 4 => {{4},{{3,%},{7,5}}} OK i: 5 => {{5},{{3,%},{7,5}}} OK Is it related to the bug reported here: http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ? Thanks, Eric
Re: [OMPI devel] Simplified: Misuse or bug with nested types?
Sorry, here is the attachment... Eric On 04/23/2013 09:54 AM, Eric Chamberland wrote: Hi, I have sent a previous message showing something that I think is a bug (or maybe a misuse, but...). I worked on the example sent to have it simplified: now it is almost half of the lines of code and the structures are more simple... but still showing the wrong behaviour. Briefly, we construct different MPI_datatype and nests them into a final type which is a: {MPI_LONG,{{MPI_LONG,MPI_CHAR}*2} Here is the output from OpenMPI 1.6.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR Here is the expected output, obtained with mpich-3.0.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{7,5}}} OK i: 1 => {{1},{{3,%},{7,5}}} OK i: 2 => {{2},{{3,%},{7,5}}} OK i: 3 => {{3},{{3,%},{7,5}}} OK i: 4 => {{4},{{3,%},{7,5}}} OK i: 5 => {{5},{{3,%},{7,5}}} OK Is it related to the bug reported here: http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ? Thanks, Eric ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel #include "mpi.h" #include //** // // This example is showing a problem with nested types! // It works perfectly with mpich-3.0.3 but seems to do a wrong transmission // with openmpi 1.6.3, 1.6.4, 1.7.0 and 1.7.1 // // The basic problem seems to arise with a vector of PALong_2Pairs which is a // MPI nested type constructed like this: //-- // Struct | is composed of //-- // PAPairLC| {long, char} // PALong_2Pairs | {long,{PAPairLC,PAPairLC}} //-- // //** using namespace std; //! Function to abort on any MPI error: void abortOnError(int ierr) { if (ierr != MPI_SUCCESS) { std::cerr << "ERROR Returned by MPI: " << ierr << std::endl; char* lCharPtr = new char[MPI_MAX_ERROR_STRING]; int lLongueur = 0; MPI_Error_string(ierr,lCharPtr, &lLongueur); std::cerr << "ERROR_string Returned by MPI: " << lCharPtr << std::endl; MPI_Abort( MPI_COMM_WORLD, 1 ); } } // a constant: #define FIRST_CHAR 32 //* // // PAPairLC is a pair: {long, char} // //* class PAPairLC { public: PAPairLC() :aLong(-999), aChar(FIRST_CHAR+4) {} long aLong; char aChar; static MPI_Datatype asMPIDatatype; static MPI_Datatype& reqMPIDatatype() { return asMPIDatatype;} void print(std::ostream& pOS) {pOS << "{" << aLong << "," << aChar << "}";} static void createMPIDatatype() { PAPairLC lPAType; MPI_Datatype lTypes[2]; lTypes[0] = MPI_LONG; lTypes[1] = MPI_CHAR; MPI_Aint lDeplacements[2]; MPI_Aint lPtrBase = 0; MPI_Get_address(&lPAType, &lPtrBase); MPI_Get_address(&lPAType.aLong, &lDeplacements[0]); MPI_Get_address(&lPAType.aChar, &lDeplacements[1]); //Compute the "displacement" from lPtrBase lDeplacements[0] -= lPtrBase; lDeplacements[1] -= lPtrBase; int lBlocLen[2] = {1,1}; abortOnError(MPI_Type_create_struct(2, lBlocLen, lDeplacements, lTypes, &asMPIDatatype)); abortOnError(MPI_Type_commit(&asMPIDatatype)); } }; MPI_Datatype PAPairLC::asMPIDatatype = MPI_DATATYPE_NULL; //* // // PALong_2Pairs is a struct of: {long, PAPairLC[2]} // //* class PALong_2Pairs { public: PALong_2Pairs() {} long aFirst; PAPairLC a2Pairs[2]; static MPI_Datatype asMPIDatatype; static MPI_Datatype& reqMPIDatatype() { r
Re: [OMPI devel] Simplified: Misuse or bug with nested types?
another information: I just tested the example with Intel MPI 4.0.1.007 and it works correctly... So the problem seems to be only with OpenMPI... which is the default distribution we use... :-/ Is my example code too long? Eric Le 2013-04-23 09:55, Eric Chamberland a écrit : Sorry, here is the attachment... Eric On 04/23/2013 09:54 AM, Eric Chamberland wrote: Hi, I have sent a previous message showing something that I think is a bug (or maybe a misuse, but...). I worked on the example sent to have it simplified: now it is almost half of the lines of code and the structures are more simple... but still showing the wrong behaviour. Briefly, we construct different MPI_datatype and nests them into a final type which is a: {MPI_LONG,{{MPI_LONG,MPI_CHAR}*2} Here is the output from OpenMPI 1.6.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR Here is the expected output, obtained with mpich-3.0.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{7,5}}} OK i: 1 => {{1},{{3,%},{7,5}}} OK i: 2 => {{2},{{3,%},{7,5}}} OK i: 3 => {{3},{{3,%},{7,5}}} OK i: 4 => {{4},{{3,%},{7,5}}} OK i: 5 => {{5},{{3,%},{7,5}}} OK Is it related to the bug reported here: http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ? Thanks, Eric ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simplified: Misuse or bug with nested types?
Hi Jeff, thanks for your answer! You inserted a doubt in my mind... and gave me hope... :-) So I did some modifications on the code to help everyone: 1- it's now in "C"... :-) 2- Concerning your remark about arbitrary address: I am now using the "offsetof" macro defined in "stddef.h" to compute the offset, or displacement needed to create the datatype 3- I have simplified and reduced (again) the number of lines to reproduce the error... see "nested_bug.c" attached to this mail... Output with openmpi 1.6.3: Rank 0 send this: {{1},{{2,3},{4,5}}} Rank 1 received this: {{1},{{2,3},{4199789,15773951}}} *** ERROR Expected output (still ok with mpich 3.0.3 and intel mpi 4): Rank 0 send this: {{1},{{2,3},{4,5}}} Rank 1 received this: {{1},{{2,3},{4,5}}} OK Thanks! Eric Le 2013-04-23 18:03, Jeff Squyres (jsquyres) a écrit : Sorry for the delay. My C++ is a bit rusty, but this does not seem correct to me. You're making the datatypes relative to an arbitrary address (&lPtrBase) in a static method on each class. You really need the datatypes to be relative to each instance's *this* pointer. Doing so allows MPI to read/write the data relative to the specific instance of the objects that you're trying to send/receive. Make sense? On Apr 23, 2013, at 5:01 PM, Eric Chamberland wrote: another information: I just tested the example with Intel MPI 4.0.1.007 and it works correctly... So the problem seems to be only with OpenMPI... which is the default distribution we use... :-/ Is my example code too long? Eric Le 2013-04-23 09:55, Eric Chamberland a écrit : Sorry, here is the attachment... Eric On 04/23/2013 09:54 AM, Eric Chamberland wrote: Hi, I have sent a previous message showing something that I think is a bug (or maybe a misuse, but...). I worked on the example sent to have it simplified: now it is almost half of the lines of code and the structures are more simple... but still showing the wrong behaviour. Briefly, we construct different MPI_datatype and nests them into a final type which is a: {MPI_LONG,{{MPI_LONG,MPI_CHAR}*2} Here is the output from OpenMPI 1.6.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR Here is the expected output, obtained with mpich-3.0.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{7,5}}} OK i: 1 => {{1},{{3,%},{7,5}}} OK i: 2 => {{2},{{3,%},{7,5}}} OK i: 3 => {{3},{{3,%},{7,5}}} OK i: 4 => {{4},{{3,%},{7,5}}} OK i: 5 => {{5},{{3,%},{7,5}}} OK Is it related to the bug reported here: http://www.open-mpi.org/community/lists/devel/2013/04/12267.php ? Thanks, Eric ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel #include "mpi.h" #include #include #include /** // // This example is showing a problem with nested types! // It works perfectly with mpich-3.0.3 but seems to do a wrong transmission // with openmpi 1.6.3, 1.6.4, 1.7.0 and 1.7.1 // // The basic problem seems to arise with a vector of PALong_2Pairs which is a // MPI nested type constructed like this: //-- // Struct | is composed of //-- // PAPairLI| {long, int} // PALong_2Pairs | {long,{PAPairLI,PAPairLI}} //-- // */ /*! Function to abort on any MPI error:*/ void abortOnError(int ierr) { if (ierr != MPI_SUCCESS) { printf("ERROR Returned by MPI: %d\n",ierr); char* lCharPtr = malloc(sizeof(char)*MPI_MAX_ERROR_STRING); int lLongueur = 0; MPI_Error_string(ierr,lCharPtr, &lLongueur); printf("ERROR_string Returned by MPI: %s\n",lCharPtr); MPI_Abort( MPI_
Re: [OMPI devel] Simplified: Misuse or bug with nested types?
Hi Paul, okay, I have compiled the sources from the trunk and it works fine now... Sorry to have reported a duplicate... It will be in the next 1.6.X release? Thanks, Eric Le 2013-04-23 20:46, Paul Hargrove a écrit : Eric, Are you testing against the Open MPI svn trunk? I ask because on April 9 George commited a fix for the bug reported by Thomas Jahns: http://www.open-mpi.org/community/lists/devel/2013/04/12268.php -Paul On Tue, Apr 23, 2013 at 5:35 PM, Eric Chamberland <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Hi Jeff, thanks for your answer! You inserted a doubt in my mind... and gave me hope... :-) So I did some modifications on the code to help everyone: 1- it's now in "C"... :-) 2- Concerning your remark about arbitrary address: I am now using the "offsetof" macro defined in "stddef.h" to compute the offset, or displacement needed to create the datatype 3- I have simplified and reduced (again) the number of lines to reproduce the error... see "nested_bug.c" attached to this mail... Output with openmpi 1.6.3: Rank 0 send this: {{1},{{2,3},{4,5}}} Rank 1 received this: {{1},{{2,3},{4199789,15773951}}} *** ERROR Expected output (still ok with mpich 3.0.3 and intel mpi 4): Rank 0 send this: {{1},{{2,3},{4,5}}} Rank 1 received this: {{1},{{2,3},{4,5}}} OK Thanks! Eric Le 2013-04-23 18:03, Jeff Squyres (jsquyres) a écrit : Sorry for the delay. My C++ is a bit rusty, but this does not seem correct to me. You're making the datatypes relative to an arbitrary address (&lPtrBase) in a static method on each class. You really need the datatypes to be relative to each instance's *this* pointer. Doing so allows MPI to read/write the data relative to the specific instance of the objects that you're trying to send/receive. Make sense? On Apr 23, 2013, at 5:01 PM, Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>> wrote: another information: I just tested the example with Intel MPI 4.0.1.007 and it works correctly... So the problem seems to be only with OpenMPI... which is the default distribution we use... :-/ Is my example code too long? Eric Le 2013-04-23 09:55, Eric Chamberland a écrit : Sorry, here is the attachment... Eric On 04/23/2013 09:54 AM, Eric Chamberland wrote: Hi, I have sent a previous message showing something that I think is a bug (or maybe a misuse, but...). I worked on the example sent to have it simplified: now it is almost half of the lines of code and the structures are more simple... but still showing the wrong behaviour. Briefly, we construct different MPI_datatype and nests them into a final type which is a: {MPI_LONG,{{MPI_LONG,MPI_CHAR}*2} Here is the output from OpenMPI 1.6.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in MPI_Status is correct after receive. Rank 1 received this: i: 0 => {{0},{{3,%},{-999,$}}} *** ERROR i: 1 => {{1},{{3,%},{-999,$}}} *** ERROR i: 2 => {{2},{{3,%},{-999,$}}} *** ERROR i: 3 => {{3},{{3,%},{-999,$}}} *** ERROR i: 4 => {{4},{{3,%},{-999,$}}} *** ERROR i: 5 => {{5},{{3,%},{-999,$}}} *** ERROR Here is the expected output, obtained with mpich-3.0.3: Rank 0 send this: i: 0 => {{0},{{3,%},{7,5}}} i: 1 => {{1},{{3,%},{7,5}}} i: 2 => {{2},{{3,%},{7,5}}} i: 3 => {{3},{{3,%},{7,5}}} i: 4 => {{4},{{3,%},{7,5}}} i: 5 => {{5},{{3,%},{7,5}}} MPI_Recv returned success and everything in
[OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi, I am testing for the first time the 2.X release candidate. I have a segmentation violation using MPI_File_write_all_end(MPI_File fh, const void *buf, MPI_Status *status) The "special" thing, may be that in the faulty test cases, there are processes that haven't written anything, so they a a zero length buffer, so the second parameter (buf) passed is a null pointer. Until now, it was a valid call, has it changed? Thanks, Eric FWIW: We are using our test suite (~2000 nightly tests) successfully with openmpi-1.{6,8,10}.* and MPICH since many years...
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi, On 08/07/16 12:52 PM, Edgar Gabriel wrote: The default MPI I/O library has changed in the 2.x release to OMPIO for ok, I am now doing I/O on my own hard drive... but I can test over NFS easily. For Lustre, I will have to produce a reduced example out of our test suite... most file systems. I can look into that problem, any chance to get access to the testsuite that you mentioned? Yikes! Sounds interesting, but difficult to realize... Our in-house code is not public... :/ I however proposed (to myself) to add a nightly compilation of openmpi (see http://www.open-mpi.org/community/lists/users/2016/06/29515.php) so I can report problems before releases are made... Anyway, I will work on the little script to automate the MPI+PETSc+InHouseCode combination so I get give you a feedback as soon as you will propose me to test a patch... Hoping this will be enough convenient for you... Thanks! Eric Thanks Edgar On 7/8/2016 11:32 AM, Eric Chamberland wrote: Hi, I am testing for the first time the 2.X release candidate. I have a segmentation violation using MPI_File_write_all_end(MPI_File fh, const void *buf, MPI_Status *status) The "special" thing, may be that in the faulty test cases, there are processes that haven't written anything, so they a a zero length buffer, so the second parameter (buf) passed is a null pointer. Until now, it was a valid call, has it changed? Thanks, Eric FWIW: We are using our test suite (~2000 nightly tests) successfully with openmpi-1.{6,8,10}.* and MPICH since many years... ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19169.php
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
On 08/07/16 01:44 PM, Edgar Gabriel wrote: ok, but just to be able to construct a test case, basically what you are doing is MPI_File_write_all_begin (fh, NULL, 0, some datatype); MPI_File_write_all_end (fh, NULL, &status), is this correct? Yes, but with 2 processes: rank 0 writes something, but not rank 1... other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so it continued to the next MPI_File_write_all_begin with a different datatype but on the same file... thanks! Eric
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi Edgard, I just saw that your patch got into ompi/master... any chances it goes into ompi-release/v2.x before rc5? thanks, Eric On 08/07/16 03:14 PM, Edgar Gabriel wrote: I think I found the problem, I filed a pr towards master, and if that passes I will file a pr for the 2.x branch. Thanks! Edgar On 7/8/2016 1:14 PM, Eric Chamberland wrote: On 08/07/16 01:44 PM, Edgar Gabriel wrote: ok, but just to be able to construct a test case, basically what you are doing is MPI_File_write_all_begin (fh, NULL, 0, some datatype); MPI_File_write_all_end (fh, NULL, &status), is this correct? Yes, but with 2 processes: rank 0 writes something, but not rank 1... other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so it continued to the next MPI_File_write_all_begin with a different datatype but on the same file... thanks! Eric ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19173.php
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi Howard, ok, I will wait for 2.0.1rcX... ;) I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and our code from the git repos. Now I am in a somewhat uncomfortable situation where neither the ompi-release.git or ompi.git repos are working for me. The first gives me the errors with MPI_File_write_all_end I reported, but the former gives me errors like these: [lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file ess_singleton_module.c at line 167 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:106919] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! So, for my continuous integration of OpenMPI I am in a no man's land... :( Thanks anyway for the follow-up! Eric On 13/07/16 07:49 AM, Howard Pritchard wrote: Hi Eric, Thanks very much for finding this problem. We decided in order to have a reasonably timely release, that we'd triage issues and turn around a new RC if something drastic appeared. We want to fix this issue (and it will be fixed), but we've decided to defer the fix for this issue to a 2.0.1 bug fix release. Howard 2016-07-12 13:51 GMT-06:00 Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>>: Hi Edgard, I just saw that your patch got into ompi/master... any chances it goes into ompi-release/v2.x before rc5? thanks, Eric On 08/07/16 03:14 PM, Edgar Gabriel wrote: I think I found the problem, I filed a pr towards master, and if that passes I will file a pr for the 2.x branch. Thanks! Edgar On 7/8/2016 1:14 PM, Eric Chamberland wrote: On 08/07/16 01:44 PM, Edgar Gabriel wrote: ok, but just to be able to construct a test case, basically what you are doing is MPI_File_write_all_begin (fh, NULL, 0, some datatype); MPI_File_write_all_end (fh, NULL, &status), is this correct? Yes, but with 2 processes: rank 0 writes something, but not rank 1... other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so it continued to the next MPI_File_write_all_begin with a different datatype but on the same file... thanks! Eric ___ devel mailing list de...@open-mpi.org <mailto:de...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19173.php ___ devel mailing list de...@open-mpi.org <mailto:de...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19192.php
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi, FYI: I've tested the SHA e28951e From git clone launched around 01h19: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.13.01h19m30s_config.log Eric On 13/07/16 04:01 PM, Pritchard Jr., Howard wrote: Jeff, I think this was fixed in PR 1227 on v2.x Howard
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi Gilles, On 13/07/16 08:01 PM, Gilles Gouaillardet wrote: Eric, OpenMPI 2.0.0 has been released, so the fix should land into the v2.x branch shortly. ok, thanks again. If i understand correctly, you script download/compile OpenMPI and then download/compile PETSc. More precisely, for OpenMPI I am cloning https://github.com/open-mpi/ompi.git and for Petsc, I just compile the latest proved stable with our code which is now 3.7.2. In this is correct, and for the time being, feel free to patch Open MPI v2.x before compiling it, the fix can be downloaded ad https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1263.patch Ok but I think it is already included into the master of the clone I get... :) Cheers, Eric Cheers, Gilles On 7/14/2016 3:37 AM, Eric Chamberland wrote: Hi Howard, ok, I will wait for 2.0.1rcX... ;) I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and our code from the git repos. Now I am in a somewhat uncomfortable situation where neither the ompi-release.git or ompi.git repos are working for me. The first gives me the errors with MPI_File_write_all_end I reported, but the former gives me errors like these: [lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file ess_singleton_module.c at line 167 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:106919] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! So, for my continuous integration of OpenMPI I am in a no man's land... :( Thanks anyway for the follow-up! Eric On 13/07/16 07:49 AM, Howard Pritchard wrote: Hi Eric, Thanks very much for finding this problem. We decided in order to have a reasonably timely release, that we'd triage issues and turn around a new RC if something drastic appeared. We want to fix this issue (and it will be fixed), but we've decided to defer the fix for this issue to a 2.0.1 bug fix release. Howard 2016-07-12 13:51 GMT-06:00 Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>>: Hi Edgard, I just saw that your patch got into ompi/master... any chances it goes into ompi-release/v2.x before rc5? thanks, Eric On 08/07/16 03:14 PM, Edgar Gabriel wrote: I think I found the problem, I filed a pr towards master, and if that passes I will file a pr for the 2.x branch. Thanks! Edgar On 7/8/2016 1:14 PM, Eric Chamberland wrote: On 08/07/16 01:44 PM, Edgar Gabriel wrote: ok, but just to be able to construct a test case, basically what you are doing is MPI_File_write_all_begin (fh, NULL, 0, some datatype); MPI_File_write_all_end (fh, NULL, &status), is this correct? Yes, but with 2 processes: rank 0 writes something, but not rank 1... other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so it continued to the next MPI_File_write_all_begin with a different datatype but on the same file... thanks! Eric ___ devel mailing list de...@open-mpi.org <mailto:de...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19173.php ___ devel mailing list de...@open-mpi.org <mailto:de...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19192.php ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19201.php ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19206.php
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Thanks Ralph, It is now *much* better: all sequential executions are working... ;) but I still have issues with a lot of parallel tests... (but not all) The SHA tested last night was c3c262b. http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.14.01h20m32s_config.log Here is what is the backtrace for most of these issues: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt': free(): invalid pointer: 0x7f9ab09c6020 *** === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f] /lib64/libc.so.6(+0x78026)[0x7f9ab01a1026] /lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1] /opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25] /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc] The very same code ans tests are all working well with openmpi-1.{8.4,10.2} and the same version of PETSc... And the segfault with MPI_File_write_all_end seems gone... Thanks to Edgar! :) Btw, I am wondering when I should report a bug or not, since I am "blindly" cloning around 01h20 am each day, independently of the "status" of the master... I don't want to bother anyone on this list with annoying bug reports... So tell me what you would like please... Thanks, Eric On 13/07/16 08:36 PM, Ralph Castain wrote: Fixed on master On Jul 13, 2016, at 12:47 PM, Jeff Squyres (jsquyres) wrote: I literally just noticed that this morning (that singleton was broken on master), but hadn't gotten to bisecting / reporting it yet... I also haven't tested 2.0.0. I really hope singletons aren't broken then... /me goes to test 2.0.0... Whew -- 2.0.0 singletons are fine. :-) On Jul 13, 2016, at 3:01 PM, Ralph Castain wrote: Hmmm…I see where the singleton on master might be broken - will check later today On Jul 13, 2016, at 11:37 AM, Eric Chamberland wrote: Hi Howard, ok, I will wait for 2.0.1rcX... ;) I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and our code from the git repos. Now I am in a somewhat uncomfortable situation where neither the ompi-release.git or ompi.git repos are working for me. The first gives me the errors with MPI_File_write_all_end I reported, but the former gives me errors like these: [lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file ess_singleton_module.c at line 167 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:106919] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! So, for my continuous integration of OpenMPI I am in a no man's land... :( Thanks anyway for the follow-up! Eric On 13/07/16 07:49 AM, Howard Pritchard wrote: Hi Eric, Thanks very much for finding this problem. We decided in order to have a reasonably timely release, that we'd triage issues and turn around a new RC if something drastic appeared. We want to fix this issue (and it will be fixed), but we've decided to defer the fix for this issue to a 2.0.1 bug fix release. Howard 2016-07-12 13:51 GMT-06:00 Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>>: Hi Edgard, I just saw that your patch got into ompi/master... any chances it goes into ompi-release/v2.x before rc5? thanks, Eric On 08/07/16 03:14 PM, Edgar Gabriel wrote: I think I found the problem, I filed a pr towards master, and if that passes I will file a pr for the 2.x branch. Thanks! Edgar
Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end
Hi Edgard, just to tell that I tested your fixe that has been merged into ompi-release/v2.x (9ba667815) and it works! :) Thanks! Eric On 12/07/16 04:30 PM, Edgar Gabriel wrote: I think the decision was made to postpone the patch to 2.0.1, since the release of 2.0.0 is eminent. Thanks Edgar On 7/12/2016 2:51 PM, Eric Chamberland wrote: Hi Edgard, I just saw that your patch got into ompi/master... any chances it goes into ompi-release/v2.x before rc5? thanks, Eric On 08/07/16 03:14 PM, Edgar Gabriel wrote: I think I found the problem, I filed a pr towards master, and if that passes I will file a pr for the 2.x branch. Thanks! Edgar On 7/8/2016 1:14 PM, Eric Chamberland wrote: On 08/07/16 01:44 PM, Edgar Gabriel wrote: ok, but just to be able to construct a test case, basically what you are doing is MPI_File_write_all_begin (fh, NULL, 0, some datatype); MPI_File_write_all_end (fh, NULL, &status), is this correct? Yes, but with 2 processes: rank 0 writes something, but not rank 1... other info: rank 0 didn't wait for rank1 after MPI_File_write_all_end so it continued to the next MPI_File_write_all_begin with a different datatype but on the same file... thanks! Eric ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19173.php ___ devel mailing list de...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/07/19192.php
[OMPI devel] OpenMPI 2.0 and Petsc 3.7.2
Hi, has someone tried OpenMPI 2.0 with Petsc 3.7.2? I am having some errors with petsc, maybe someone have them too? Here are the configure logs for PETSc: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log And for OpenMPI: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log (in fact, I am testing the ompi-release branch, a sort of petsc-master branch, since I need the commit 9ba6678156). For a set of parallel tests, I have 104 that works on 124 total tests. And the typical error: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev': free(): invalid pointer: === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f80eb11677f] /lib64/libc.so.6(+0x78026)[0x7f80eb11c026] /lib64/libc.so.6(+0x78d53)[0x7f80eb11cd53] /opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f80ea8f9d60] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f80df0ea628] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f80df0eac50] /opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f80eb7029dd] /opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f80eb702ad6] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f80f2fa6c6d] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f80f2fa1c45] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0xa9d0f5)[0x7f80f35960f5] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f80f35c2588] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x10bf0f4)[0x7f80f3bb80f4] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f80f3a79fd9] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f80f3d1a334] a similar one: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProbFluideIncompressible.dev': free(): invalid pointer: 0x7f382a7c5bc0 *** === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f3829f1c77f] /lib64/libc.so.6(+0x78026)[0x7f3829f22026] /lib64/libc.so.6(+0x78d53)[0x7f3829f22d53] /opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f38296ffd60] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f381deab628] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f381deabc50] /opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f382a5089dd] /opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f382a508ad6] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f3831dacc6d] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f3831da7c45] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x9f4755)[0x7f38322f3755] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f38323c8588] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x4e2)[0x7f383287f87a] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f383287ffd9] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f3832b20334] another one: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.MortierDiffusion.dev': free(): invalid pointer: 0x7f67b6d37bc0 *** === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f67b648e77f] /lib64/libc.so.6(+0x78026)[0x7f67b6494026] /lib64/libc.so.6(+0x78d53)[0x7f67b6494d53] /opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f67b5c71d60] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1adae)[0x7f67aa4cddae] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x1b4ca)[0x7f67aa4ce4ca] /opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f67b6a7a9dd] /opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f67b6a7aad6] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adb09)[0x7f67be31eb09] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f67be319c45] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+
Re: [OMPI devel] [petsc-users] OpenMPI 2.0 and Petsc 3.7.2
Ok, here is the 2 points answered: #1) got valgrind output... here is the fatal free operation: ==107156== Invalid free() / delete / delete[] / realloc() ==107156==at 0x4C2A37C: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==107156==by 0x1E63CD5F: opal_free (malloc.c:184) ==107156==by 0x27622627: mca_pml_ob1_recv_request_fini (pml_ob1_recvreq.h:133) ==107156==by 0x27622C4F: mca_pml_ob1_recv_request_free (pml_ob1_recvreq.c:90) ==107156==by 0x1D3EF9DC: ompi_request_free (request.h:362) ==107156==by 0x1D3EFAD5: PMPI_Request_free (prequest_free.c:59) ==107156==by 0x14AE3B9C: VecScatterDestroy_PtoP (vpscat.c:219) ==107156==by 0x14ADEB74: VecScatterDestroy (vscat.c:1860) ==107156==by 0x14A8D426: VecDestroy_MPI (pdvec.c:25) ==107156==by 0x14A33809: VecDestroy (vector.c:432) ==107156==by 0x10A2A5AB: GIREFVecDestroy(_p_Vec*&) (girefConfigurationPETSc.h:115) ==107156==by 0x10BA9F14: VecteurPETSc::detruitObjetPETSc() (VecteurPETSc.cc:2292) ==107156==by 0x10BA9D0D: VecteurPETSc::~VecteurPETSc() (VecteurPETSc.cc:287) ==107156==by 0x10BA9F48: VecteurPETSc::~VecteurPETSc() (VecteurPETSc.cc:281) ==107156==by 0x1135A57B: PPReactionsAppuiEL3D::~PPReactionsAppuiEL3D() (PPReactionsAppuiEL3D.cc:216) ==107156==by 0xCD9A1EA: ProblemeGD::~ProblemeGD() (in /home/mefpp_ericc/depots_prepush/GIREF/lib/libgiref_dev_Formulation.so) ==107156==by 0x435702: main (Test.ProblemeGD.icc:381) ==107156== Address 0x1d6acbc0 is 0 bytes inside data symbol "ompi_mpi_double" --107156-- REDIR: 0x1dda2680 (libc.so.6:__GI_stpcpy) redirected to 0x4c2f330 (__GI_stpcpy) ==107156== ==107156== Process terminating with default action of signal 6 (SIGABRT): dumping core ==107156==at 0x1DD520C7: raise (in /lib64/libc-2.19.so) ==107156==by 0x1DD53534: abort (in /lib64/libc-2.19.so) ==107156==by 0x1DD4B145: __assert_fail_base (in /lib64/libc-2.19.so) ==107156==by 0x1DD4B1F1: __assert_fail (in /lib64/libc-2.19.so) ==107156==by 0x27626D12: mca_pml_ob1_send_request_fini (pml_ob1_sendreq.h:221) ==107156==by 0x276274C9: mca_pml_ob1_send_request_free (pml_ob1_sendreq.c:117) ==107156==by 0x1D3EF9DC: ompi_request_free (request.h:362) ==107156==by 0x1D3EFAD5: PMPI_Request_free (prequest_free.c:59) ==107156==by 0x14AE3C3C: VecScatterDestroy_PtoP (vpscat.c:225) ==107156==by 0x14ADEB74: VecScatterDestroy (vscat.c:1860) ==107156==by 0x14A8D426: VecDestroy_MPI (pdvec.c:25) ==107156==by 0x14A33809: VecDestroy (vector.c:432) ==107156==by 0x10A2A5AB: GIREFVecDestroy(_p_Vec*&) (girefConfigurationPETSc.h:115) ==107156==by 0x10BA9F14: VecteurPETSc::detruitObjetPETSc() (VecteurPETSc.cc:2292) ==107156==by 0x10BA9D0D: VecteurPETSc::~VecteurPETSc() (VecteurPETSc.cc:287) ==107156==by 0x10BA9F48: VecteurPETSc::~VecteurPETSc() (VecteurPETSc.cc:281) ==107156==by 0x1135A57B: PPReactionsAppuiEL3D::~PPReactionsAppuiEL3D() (PPReactionsAppuiEL3D.cc:216) ==107156==by 0xCD9A1EA: ProblemeGD::~ProblemeGD() (in /home/mefpp_ericc/depots_prepush/GIREF/lib/libgiref_dev_Formulation.so) ==107156==by 0x435702: main (Test.ProblemeGD.icc:381) #2) For the run with -vecscatter_alltoall it works...! As an "end user", should I ever modify these VecScatterCreate options? How do they change the performances of the code on large problems? Thanks, Eric On 25/07/16 02:57 PM, Matthew Knepley wrote: On Mon, Jul 25, 2016 at 11:33 AM, Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Hi, has someone tried OpenMPI 2.0 with Petsc 3.7.2? I am having some errors with petsc, maybe someone have them too? Here are the configure logs for PETSc: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log And for OpenMPI: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log (in fact, I am testing the ompi-release branch, a sort of petsc-master branch, since I need the commit 9ba6678156). For a set of parallel tests, I have 104 that works on 124 total tests. It appears that the fault happens when freeing the VecScatter we build for MatMult, which contains Request structures for the ISends and IRecvs. These looks like internal OpenMPI errors to me since the Request should be opaque. I would try at least two things: 1) Run under valgrind. 2) Switch the VecScatter implementation. All the options are here, http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterCreate.html#VecScatterCreate but maybe use alltoall. Thanks, Matt And the typical error: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev': free(): invalid pointer: === Backtrace: = /lib64/libc.so.6(+0x7277
Re: [OMPI devel] OpenMPI 2.0 and Petsc 3.7.2
Hi Gilles, On 25/07/16 10:38 PM, Gilles Gouaillardet wrote: Eric, where can your test case be downloaded ? how many nodes and tasks do you need to reproduce the bug ? Sadly, it is in our in-house code and it requires to whole source code which isn't public... :/ I have this bug with 20 parallel tests from our 124 tests database, running with 2 to 10 processes (but 2 for most of them). The bug is happening at the very end of the execution (FE resolution+exports), when everything get destroyed, including PETSc stuff. Unfortunately, running "make test" and "make testexamples" at the end of petsc installation doesn't trigger the bug... :/ fwiw, currently there are two Open MPI repositories - https://github.com/open-mpi/ompi there is only one branch and is the 'master' branch, today, this can be seen as Open MPI 3.0 pre alpha - https://github.com/open-mpi/ompi-release the default branch is 'v2.x', today, this can be seen as Open MPI 2.0.1 pre alpha I tested both... I reported the error also for the "master" of ompi, and they seems related to me, see: https://github.com/open-mpi/ompi/issues/1875 Thanks, Eric Cheers, Gilles On 7/26/2016 3:33 AM, Eric Chamberland wrote: Hi, has someone tried OpenMPI 2.0 with Petsc 3.7.2? I am having some errors with petsc, maybe someone have them too? Here are the configure logs for PETSc: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_configure.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_RDict.log And for OpenMPI: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.25.01h16m02s_config.log (in fact, I am testing the ompi-release branch, a sort of petsc-master branch, since I need the commit 9ba6678156). For a set of parallel tests, I have 104 that works on 124 total tests. And the typical error: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.dev': free(): invalid pointer: === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f80eb11677f] /lib64/libc.so.6(+0x78026)[0x7f80eb11c026] /lib64/libc.so.6(+0x78d53)[0x7f80eb11cd53] /opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f80ea8f9d60] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f80df0ea628] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f80df0eac50] /opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f80eb7029dd] /opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f80eb702ad6] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f80f2fa6c6d] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f80f2fa1c45] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0xa9d0f5)[0x7f80f35960f5] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(MatDestroy+0x648)[0x7f80f35c2588] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x10bf0f4)[0x7f80f3bb80f4] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPReset+0x502)[0x7f80f3d19779] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x11707f7)[0x7f80f3c697f7] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCReset+0x346)[0x7f80f3a796de] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(PCDestroy+0x5d1)[0x7f80f3a79fd9] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(KSPDestroy+0x7b6)[0x7f80f3d1a334] a similar one: *** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProbFluideIncompressible.dev': free(): invalid pointer: 0x7f382a7c5bc0 *** === Backtrace: = /lib64/libc.so.6(+0x7277f)[0x7f3829f1c77f] /lib64/libc.so.6(+0x78026)[0x7f3829f22026] /lib64/libc.so.6(+0x78d53)[0x7f3829f22d53] /opt/openmpi-2.x_opt/lib/libopen-pal.so.20(opal_free+0x1f)[0x7f38296ffd60] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16628)[0x7f381deab628] /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x16c50)[0x7f381deabc50] /opt/openmpi-2.x_opt/lib/libmpi.so.20(+0x9f9dd)[0x7f382a5089dd] /opt/openmpi-2.x_opt/lib/libmpi.so.20(MPI_Request_free+0xf7)[0x7f382a508ad6] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adc6d)[0x7f3831dacc6d] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f3831da7c45] /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.
[OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt Thanks, Eric ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Other relevant info: I never saw this problem with OpenMPI 1.6.5,1.8.4 and 1.10.[3,4] which runs the same test suite... thanks, Eric On 13/09/16 11:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt Thanks, Eric ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
On 13/09/16 12:11 PM, Pritchard Jr., Howard wrote: Hello Eric, Is the failure seen with the same two tests? Or is it random which tests fail? If its not random, would you be able to post No, the tests that failed were different ones... the tests to the list? Also, if possible, it would be great if you could test against a master snapshot: https://www.open-mpi.org/nightly/master/ Yes I can, but since the bug appears time to time, I think I can't get relevant info from a single run on master will have to wait let's say 10 or 15 days before it crashes... but that may be hard since master is less stable than release and will have normal failures... :/ Eric ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote: Eric, can you please provide more information on how your tests are launched ? Yes! do you mpirun -np 1 ./a.out or do you simply ./a.out For all sequential tests, we do ./a.out. do you use a batch manager ? if yes, which one ? No. do you run one test per job ? or multiple tests per job ? On this automatic compilation, up to 16 tests are launched together. how are these tests launched ? For sequential ones, the special thing is that they are launched via python Popen call, which launches "time" which launches the code. So the "full" commande line is: /usr/bin/time -v -o /users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt mpi_v=2 verbose=True Beowulf=False outilMassif=False outilPerfRecord=False verifValgrind=False outilPerfStat=False outilCallgrind=False RepertoireDestination=/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier do the test that crashes use MPI_Comm_spawn ? i am surprised by the process name [[9325,5754],0], which suggests there MPI_Comm_spawn was called 5753 times (!) can you also run hostname on the 'lorien' host ? [eric@lorien] Scripts (master $ u+1)> hostname lorien if you configure'd Open MPI with --enable-debug, can you Yes. export OMPI_MCA_plm_base_verbose 5 then run one test and post the logs ? Hmmm, strange? [lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash 1366255883 [lorien:93841] plm:base:set_hnp_name: final jobfam 22260 [lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:93841] [[22260,0],0] plm:base:receive start comm [lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered [lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a dynamic spawn [lorien:93841] [[22260,0],0] plm:base:receive stop comm ~ from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce job family 5576 (but you get 9325) the discrepancy could be explained by the use of a batch manager and/or a full hostname i am unaware of. orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. so strictly speaking, it is possible two jobs launched on the same node are assigned the same 16 bits job family. the easiest way to detect this could be to - edit orte/mca/plm/base/plm_base_jobid.c and replace OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); with OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); configure Open MPI with --enable-debug and rebuild and then export OMPI_MCA_plm_base_verbose=4 and run your tests. when the problem occurs, you will be able to check which pids produced the faulty jobfam, and that could hint to a conflict. Does this gives the same output as with export OMPI_MCA_plm_base_verbose=5 without the patch? If so, beacause all is automated, applying a patch is "harder" than doing a simple export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs? Thanks! Eric Cheers, Gilles On 9/14/2016 12:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt Thanks, Eric _
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc -l 1841 cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |more total 68 drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 . drwx--3 cmpbib bib 231 Sep 13 03:50 .. drwx--2 cmpbib bib 6 Sep 13 02:10 10015 drwx--2 cmpbib bib 6 Sep 13 03:05 10049 drwx--2 cmpbib bib 6 Sep 13 03:15 10052 drwx--2 cmpbib bib 6 Sep 13 02:22 10059 drwx--2 cmpbib bib 6 Sep 13 02:22 10110 drwx--2 cmpbib bib 6 Sep 13 02:41 10114 ... If I do: lsof |grep "openmpi-sessions-40031" lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing Output information may be incomplete. nothing... What else may I check? Eric On 14/09/16 08:47 AM, Joshua Ladd wrote: Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet mailto:gil...@rist.or.jp>> wrote: Eric, can you please provide more information on how your tests are launched ? do you mpirun -np 1 ./a.out or do you simply ./a.out do you use a batch manager ? if yes, which one ? do you run one test per job ? or multiple tests per job ? how are these tests launched ? do the test that crashes use MPI_Comm_spawn ? i am surprised by the process name [[9325,5754],0], which suggests there MPI_Comm_spawn was called 5753 times (!) can you also run hostname on the 'lorien' host ? if you configure'd Open MPI with --enable-debug, can you export OMPI_MCA_plm_base_verbose 5 then run one test and post the logs ? from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce job family 5576 (but you get 9325) the discrepancy could be explained by the use of a batch manager and/or a full hostname i am unaware of. orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. so strictly speaking, it is possible two jobs launched on the same node are assigned the same 16 bits job family. the easiest way to detect this could be to - edit orte/mca/plm/base/plm_base_jobid.c and replace OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); with OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); configure Open MPI with --enable-debug and rebuild and then export OMPI_MCA_plm_base_verbose=4 and run your tests. when the problem occurs, you will be able to check which pids produced the faulty jobfam, and that could hint to a conflict. Cheers, Gilles On 9/14/2016 12:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> Thanks, Eric __
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: Eric, do you mean you have a unique $TMP per a.out ? No or a unique $TMP per "batch" of run ? Yes. I was happy because each nighlty batch has it's own TMP, so I can check afterward for problems related to a specific night without interference with another nightly batch of tests... if a bug ever happens... ;) in the first case, my understanding is that conflicts cannot happen ... once you hit the bug, can you please please post the output of the failed a.out, and run egrep 'jobfam|stop' on all your logs, so we might spot a conflict ok, I will launch it manually later today, but it will be automatic tonight (with export OMPI_MCA_plm_base_verbose=5). Thanks! Eric Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc -l 1841 cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |more total 68 drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 . drwx--3 cmpbib bib 231 Sep 13 03:50 .. drwx--2 cmpbib bib 6 Sep 13 02:10 10015 drwx--2 cmpbib bib 6 Sep 13 03:05 10049 drwx--2 cmpbib bib 6 Sep 13 03:15 10052 drwx--2 cmpbib bib 6 Sep 13 02:22 10059 drwx--2 cmpbib bib 6 Sep 13 02:22 10110 drwx--2 cmpbib bib 6 Sep 13 02:41 10114 ... If I do: lsof |grep "openmpi-sessions-40031" lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing Output information may be incomplete. nothing... What else may I check? Eric On 14/09/16 08:47 AM, Joshua Ladd wrote: Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 <https://github.com/pmix/master/pull/145> I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet mailto:gil...@rist.or.jp>> wrote: Eric, can you please provide more information on how your tests are launched ? do you mpirun -np 1 ./a.out or do you simply ./a.out do you use a batch manager ? if yes, which one ? do you run one test per job ? or multiple tests per job ? how are these tests launched ? do the test that crashes use MPI_Comm_spawn ? i am surprised by the process name [[9325,5754],0], which suggests there MPI_Comm_spawn was called 5753 times (!) can you also run hostname on the 'lorien' host ? if you configure'd Open MPI with --enable-debug, can you export OMPI_MCA_plm_base_verbose 5 then run one test and post the logs ? from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce job family 5576 (but you get 9325) the discrepancy could be explained by the use of a batch manager and/or a full hostname i am unaware of. orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. so strictly speaking, it is possible two jobs launched on the same node are assigned the same 16 bits job family. the easiest way to detect this could be to - edit orte/mca/plm/base/plm_base_jobid.c and replace OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); with OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); configure Open MPI with --enable-debug and rebuild and then export OMPI_MCA_plm_base_verbose=4 a
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x01e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0) -- openmpi content of $TMP: /tmp/tmp.GoQXICeyJl> ls -la total 1500 drwx--3 cmpbib bib 250 Sep 14 13:34 . drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. ... drwx-- 1848 cmpbib bib 45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0 srw-rw-r--1 cmpbib bib 0 Sep 14 12:24 pmix-190552 cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . -type f ./53310/contact.txt cat 53310/contact.txt 3493724160.0;usock;tcp://132.203.7.36:54605 190552 egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] plm:base:set_hnp_name: final jobfam 53310 (this is the faulty test) full egrep: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt config.log: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log ompi_info: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt Maybe it aborted (instead of giving the other message) while doing the error because of export OMPI_MCA_plm_base_verbose=5 ? Thanks, Eric On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: Eric, do you mean you have a unique $TMP per a.out ? or a unique $TMP per "batch" of run ? in the first case, my understanding is that conflicts cannot happen ... once you hit the bug, can you please please post the output of the failed a.out, and run egrep 'jobfam|stop' on all your logs, so we might spot a conflict Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc -l 1841 cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |more total 68 drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 . drwx--3 cmpbib bib 231 Sep 13 03:50 .. drwx--2 cmpbib bib 6 Sep 13 02:10 10015 drwx--2 cmpbib bib 6 Sep 13
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi Gilles, On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if applying a patch does not fit your test workflow, it might be easier for you to update it and mpirun -np 1 ./a.out instead of ./a.out basically, increasing verbosity runs some extra code, which include sprintf. so yes, it is possible to crash an app by increasing verbosity by running into a bug that is hidden under normal operation. my intuition suggests this is quite unlikely ... if you can get a core file and a backtrace, we will soon find out Damn! I did got one but it got erased last night when the automatic process started again... (which erase all directories before starting) :/ I would like to put core files in a user specific directory, but it seems it has to be a system-wide configuration... :/ I will trick this by changing the "pwd" to a path outside the erased directory... So as of tonight I should be able to retrieve core files even after I relaunched the process.. Thanks for all the support! Eric Cheers, Gilles On 9/15/2016 2:58 AM, Eric Chamberland wrote: Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x01e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0) -- openmpi content of $TMP: /tmp/tmp.GoQXICeyJl> ls -la total 1500 drwx--3 cmpbib bib 250 Sep 14 13:34 . drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. ... drwx-- 1848 cmpbib bib 45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0 srw-rw-r--1 cmpbib bib 0 Sep 14 12:24 pmix-190552 cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . -type f ./53310/contact.txt cat 53310/contact.txt 3493724160.0;usock;tcp://132.203.7.36:54605 190552 egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] plm:base:set_hnp_name: final jobfam 53310 (this is the faulty test) full egrep: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt config.log: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log ompi_info: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt Maybe it aborted (instead of giving the other mess
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi, I know the pull request has not (yet) been merged, but here is a somewhat "different" output from a single sequential test (automatically) laucnhed without mpirun last night: [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 1366255883 [lorien:172229] plm:base:set_hnp_name: final jobfam 39075 [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:172229] [[39075,0],0] plm:base:receive start comm [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a dynamic spawn [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received unexpected process identifier [[41545,0],0] from [[39075,0],0] [lorien:172218] *** Process received signal *** [lorien:172218] Signal: Segmentation fault (11) [lorien:172218] Signal code: Invalid permissions (2) [lorien:172218] Failing at address: 0x2d07e00 [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop comm unfortunately, I didn't got any coredump (???) The line: [lorien:172218] Signal code: Invalid permissions (2) is curious or not? as usual, here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt Does the PR #1376 will prevent or fix this too? Thanks again! Eric On 15/09/16 09:32 AM, Eric Chamberland wrote: Hi Gilles, On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if applying a patch does not fit your test workflow, it might be easier for you to update it and mpirun -np 1 ./a.out instead of ./a.out basically, increasing verbosity runs some extra code, which include sprintf. so yes, it is possible to crash an app by increasing verbosity by running into a bug that is hidden under normal operation. my intuition suggests this is quite unlikely ... if you can get a core file and a backtrace, we will soon find out Damn! I did got one but it got erased last night when the automatic process started again... (which erase all directories before starting) :/ I would like to put core files in a user specific directory, but it seems it has to be a system-wide configuration... :/ I will trick this by changing the "pwd" to a path outside the erased directory... So as of tonight I should be able to retrieve core files even after I relaunched the process.. Thanks for all the support! Eric Cheers, Gilles On 9/15/2016 2:58 AM, Eric Chamberland wrote: Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x01e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi Gilles, just to mention that since the PR 2091 as been merged into 2.0.x, I haven't got any failure! Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a good one... So will there be a 2.0.2 release or will it go to 2.1.0 directly? Thanks, Eric On 16/09/16 10:01 AM, Gilles Gouaillardet wrote: Eric, I expect the PR will fix this bug. The crash occur after the unexpected process identifier error, and this error should not happen in the first place. So at this stage, I would not worry too much of that crash (to me, it is an undefined behavior anyway) Cheers, Gilles On Friday, September 16, 2016, Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Hi, I know the pull request has not (yet) been merged, but here is a somewhat "different" output from a single sequential test (automatically) laucnhed without mpirun last night: [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 1366255883 [lorien:172229] plm:base:set_hnp_name: final jobfam 39075 [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:172229] [[39075,0],0] plm:base:receive start comm [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a dynamic spawn [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received unexpected process identifier [[41545,0],0] from [[39075,0],0] [lorien:172218] *** Process received signal *** [lorien:172218] Signal: Segmentation fault (11) [lorien:172218] Signal code: Invalid permissions (2) [lorien:172218] Failing at address: 0x2d07e00 [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop comm unfortunately, I didn't got any coredump (???) The line: [lorien:172218] Signal code: Invalid permissions (2) is curious or not? as usual, here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt> Does the PR #1376 will prevent or fix this too? Thanks again! Eric On 15/09/16 09:32 AM, Eric Chamberland wrote: Hi Gilles, On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch <https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if applying a patch does not fit your test workflow, it might be easier for you to update it and mpirun -np 1 ./a.out instead of ./a.out basically, increasing verbosity runs some extra code, which include sprintf. so yes, it is possible to crash an app by increasing verbosity by running into a bug that is hidden under normal operation. my intuition suggests this is quite unlikely ... if you can get a core file and a backtrace, we will soon find out Damn! I did got one but it got erased last night when the automatic process started again... (which erase all directories before starting) :/ I would like to put core files in a user specific directory, but it seems it has to be a system-wide configuration... :/ I will trick this by changing the "pwd" to a path outside the erased directory... So as of tonight I should be able to retrieve core files even after I relaunched the process.. Thanks for all the support! Eric Cheers, Gilles On 9/15/2016 2:58 AM, Eric Chamberland wrote: Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NU
[OMPI devel] Bug on branch v2.x since october 3
Hi, since commit 18f23724a, our nightly base test is broken on v2.x branch. Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but was repaired some days after (can't tell exactly, but at most it was fixed with fa3d92981a). I get segmentation faults or deadlocks in many cases. Could this be related with issue 5842 ? (https://github.com/open-mpi/ompi/issues/5842) Here is an example of backtrace for a deadlock: #4 #5 0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6 #6 0x7f9dccee in opal_progress () at runtime/opal_progress.c:243 #7 0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000) at ../../../../ompi/request/request.h:392 #8 0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30 long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, datatype=0x7f9dca61e2c0 , src=1, tag=32767, comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at pml_ob1_irecv.c:129 #9 0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30 long, long, PAType*, std::__debug::vectorstd::allocator >&)::slValeurs>, count=3, type=0x7f9dca61e2c0 , source=1, tag=32767, comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at precv.c:77 #10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus (pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2, pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of length 1, capacity 1 = {...}) at /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332 And some informations about configuration: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt Thanks, Eric ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Bug on branch v2.x since october 3
ok, thanks a lot! :) Eric On 17/10/18 01:32 PM, Nathan Hjelm via devel wrote: Ah yes, 18f23724a broke things so we had to fix the fix. Didn't apply it to the v2.x branch. Will open a PR to bring it over. -Nathan On Oct 17, 2018, at 11:28 AM, Eric Chamberland wrote: Hi, since commit 18f23724a, our nightly base test is broken on v2.x branch. Strangely, on branch v3.x, it broke the same day with 2fd9510b4b44, but was repaired some days after (can't tell exactly, but at most it was fixed with fa3d92981a). I get segmentation faults or deadlocks in many cases. Could this be related with issue 5842 ? (https://github.com/open-mpi/ompi/issues/5842) Here is an example of backtrace for a deadlock: #4 #5 0x7f9dc9151d17 in sched_yield () from /lib64/libc.so.6 #6 0x7f9dccee in opal_progress () at runtime/opal_progress.c:243 #7 0x7f9dbe53cf78 in ompi_request_wait_completion (req=0x46ea000) at ../../../../ompi/request/request.h:392 #8 0x7f9dbe53e162 in mca_pml_ob1_recv (addr=0x7f9dd64a6b30 *, std::__debug::vector >&)::slValeurs>, count=3, datatype=0x7f9dca61e2c0 , src=1, tag=32767, comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at pml_ob1_irecv.c:129 #9 0x7f9dca35f3c4 in PMPI_Recv (buf=0x7f9dd64a6b30 *, std::__debug::vector >&)::slValeurs>, count=3, type=0x7f9dca61e2c0 , source=1, tag=32767, comm=0x7f9dca62a840 , status=0x7ffcf4f08170) at precv.c:77 #10 0x7f9dd6261d06 in assertionValeursIdentiquesSurTousLesProcessus (pComm=0x7f9dca62a840 , pRang=0, pNbProcessus=2, pValeurs=0x7f9dd5a94da0 girefSynchroniseGroupeProcessusModeDebugImpl(PAGroupeProcessus const&, char const*, int)::slDonnees>, pRequetes=std::__debug::vector of length 1, capacity 1 = {...}) at /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/src/commun/Parallele/mpi_giref.cc:332 And some informations about configuration: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2018.10.17.02h16m02s_ompi_info_all.txt Thanks, Eric ___ devel mailing list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] 1.8.4rc2 now available for testing
On 12/11/2014 05:45 AM, Ralph Castain wrote: ... by the reporters. Still, I would appreciate a fairly thorough testing as this is expected to be the last 1.8 series release for some time. Is is relevant to report valgrind leaks? Maybe they are "normal" or not, I don't know. If they are normal, maybe suppressions should be added to .../share/openmpi/openmpi-valgrind.supp before the release? Here is a simple test case ;-) : cat mpi_init_finalize.c #include "mpi.h" int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); MPI_Finalize(); return 0; } mpicc -o mpi_init_finalize mpi_init_finalize.c mpiexec -np 1 valgrind -v --suppressions=/opt/openmpi-1.8.4rc2/share/openmpi/openmpi-valgrind.supp --gen-suppressions=all --leak-check=full --leak-resolution=high --show-reachable=yes --error-limit=no --num-callers=24 --track-fds=yes --log-file=valgrind_out.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize running with 2 processes generates some more: mpiexec -np 2 --log-file=valgrind_out_2proc.n%q{OMPI_COMM_WORLD_RANK} ./mpi_init_finalize which results in the files attached... Thanks, Eric valgrind_out.tgz Description: application/compressed-tar
Re: [OMPI devel] 1.8.4rc2 now available for testing
On 12/12/2014 11:38 AM, Jeff Squyres (jsquyres) wrote: Did you configure OMPI with --enable-memchecker? No, only "--prefix=" Eric
Re: [OMPI devel] 1.8.4rc2 now available for testing
On 12/12/2014 01:12 PM, Ralph Castain wrote: I just checked it with —enable-memchecker —with-valgrind and found that many of these are legitimate leaks. We can take a look at them, though as I said, perhaps may wait for 1.8.5 as I wouldn’t hold up 1.8.4 for it. wait! When end-developpers of other software valgrind their code, they find leaks from openmpi and then they ask themself: "Did I made a misuse of MPI?" So they have to look around, into the FAQ, and find this: http://www.open-mpi.org/faq/?category=debugging#valgrind_clean and tell theme self: "Fine, now with this suppression file, I am sure the leaks are my fault!" and try to find why theses leaks remains in their code... then, not understanding what is wrong... they ask the list to see if it is normal or not... ;-) May I suggest to give suppression name like "real_leak_to_be_fixed_in_next_release_#" so at least, you guys won't forget to fix it, and all of us won't be upset about misuse of the library? Or maybe put them into another suppression file? But list them down somewhere: that would really help us! Thanks, Eric ps: we valgrind our code each night to be able to detect asap new leaks or defects...
[OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found
Hi, I first saw this message using 1.8.4rc3: -- WARNING: No loopback interface was found. This can cause problems when we spawn processes as they are likely to be unable to connect back to their host daemon. Sadly, it may take awhile for the connect attempt to fail, so you may experience a significant hang time. You may wish to ctrl-c out of your job and activate loopback support on at least one interface before trying again. -- I have compiled it in "debug" mode... is it the problem? ...but I think I do have a loopback on my host: ifconfig -a eth0 Link encap:Ethernet HWaddr 00:25:90:0D:A5:38 inet addr:132.203.7.22 Bcast:132.203.7.255 Mask:255.255.255.0 inet6 addr: fe80::225:90ff:fe0d:a538/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:49080380 errors:0 dropped:0 overruns:0 frame:0 TX packets:67526463 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:35710440484 (34056.1 Mb) TX bytes:64050625687 (61083.4 Mb) Interrupt:16 Memory:faee-faf0 eth1 Link encap:Ethernet HWaddr 00:25:90:0D:A5:39 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:17 Memory:fafe-fb00 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:3089696 errors:0 dropped:0 overruns:0 frame:0 TX packets:3089696 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:8421008033 (8030.8 Mb) TX bytes:8421008033 (8030.8 Mb) Is that message erroneous? Thanks, Eric
Re: [OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found
Forgot this: ompi_info -all : http://www.giref.ulaval.ca/~ericc/ompi_bug/ompi_info.all.184rc3.txt.gz config.log: http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184rc3.log.gz Eric
[OMPI devel] BUG in ADIOI_NFS_WriteStrided
Hi, I encountered a new bug while testing our collective MPI I/O functionnalities over NFS. This is not a big issue for us, but I think someone should have a look at it. While running at 3 processes, we have this error on rank #0 and rank #2, knowing that rank #1 have nothing to write (0 length size) on this particular PMPI_File_write_all_begin call: ==19211== Syscall param write(buf) points to uninitialised byte(s) ==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so) ==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645) ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159) ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114) ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin (write_allb.c:44) ==19211==by 0x2742A367: mca_io_romio_file_write_all_begin (io_romio_file_write.c:264) ==19211==by 0x12126520: PMPI_File_write_all_begin (pfile_write_all_begin.c:74) ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, PtrPorteurConst, PtrPorteurConst, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4E9A67: GISLectureEcriture::visiteMaillage(Maillage const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4C79A2: GISLectureEcriture::ecritGISMPI(std::string, GroupeInfoSur const&, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4961AD: main (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211== Address 0x295af060 is 144 bytes inside a block of size 524,288 alloc'd ==19211==at 0x4C2C27B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50) ==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497) ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159) ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114) ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin (write_allb.c:44) ==19211==by 0x2742A367: mca_io_romio_file_write_all_begin (io_romio_file_write.c:264) ==19211==by 0x12126520: PMPI_File_write_all_begin (pfile_write_all_begin.c:74) ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, PtrPorteurConst, PtrPorteurConst, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4E9A67: GISLectureEcriture::visiteMaillage(Maillage const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4C79A2: GISLectureEcriture::ecritGISMPI(std::string, GroupeInfoSur const&, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211==by 0x4961AD: main (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==19211== Uninitialised value was created by a heap allocation ==19211==at 0x4C2C27B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50) ==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497) ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159) ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114) ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin (write_allb.c:44) ==19211==by 0x2742A367: mca_io_romio_file_write_all_begin (io_romio_file_write.c:264) ==19211==by 0x12126520: PMPI_File_write_all_begin (pfile_write_all_begin.c:74) ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::ecritureIndexeParBlocMPI, PtrPorteurConstArete>, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>, FunctorAccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ompi_file_t*, long long, PtrPorteurConst, PtrPorteurConst, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.op
[OMPI devel] BUG in ADIOI_NFS_WriteStrided
AccesseurPorteurLocalArete> > >(PAGroupeProcessus&, ADIOI_FileD*, long long, PtrPorteurConst, PtrPorteurConst, FunctorCopieInfosSurDansVectPAType, std::vector*, std::allocatorArete>*> > const>&, FunctorAccesseurPorteurLocalArete> >&, long, DistributionComposantes&, long, unsigned long, unsigned long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==3434==by 0x4DDBFE: GISLectureEcriture::visiteMaillage(Maillage const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==3434==by 0x4BCB22: GISLectureEcriture::ecritGISMPI(std::string, GroupeInfoSur const&, std::string const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) ==3434==by 0x48E213: main (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt) Can't tell if it is a big issue or not, but I thought I should mention it to the list We run without this valgrind error when I use my local disk partition instead of an nfs parition or if I run with only 1 process (which always have something to write for each PMPI_File_write_all_begin) and write to an nfs partition. Have you guys thinked about unifying this part of code? Making it a sub-library? (please don't flame me... ;-) ) Anyway, thanks, Eric On 12/19/2014 02:16 PM, Howard Pritchard wrote: HI Eric, Does your app also work with MPICH? The romio in Open MPI is getting a bit old, so it would be useful to know if you see the same valgrind error using a recent MPICH. Howard 2014-12-19 9:50 GMT-07:00 Eric Chamberland mailto:eric.chamberl...@giref.ulaval.ca>>: Hi, I encountered a new bug while testing our collective MPI I/O functionnalities over NFS. This is not a big issue for us, but I think someone should have a look at it. While running at 3 processes, we have this error on rank #0 and rank #2, knowing that rank #1 have nothing to write (0 length size) on this particular PMPI_File_write_all_begin call: ==19211== Syscall param write(buf) points to uninitialised byte(s) ==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so <http://libpthread-2.17.so>) ==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645) ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159) ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114) ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File___write_all_begin (write_allb.c:44) ==19211==by 0x2742A367: mca_io_romio_file_write_all___begin (io_romio_file_write.c:264) ==19211==by 0x12126520: PMPI_File_write_all_begin (pfile_write_all_begin.c:74) ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::__ecritureIndexeParBlocMPI<__PAIOType, PtrPorteurConst, FunctorCopieInfosSurDansVectPA__Type, std::vector*, std::allocator*> > const>, FunctorAccesseurPorteurLocal<__PtrPorteurConst > >(PAGroupeProcessus&, ompi_file_t*, long long, PtrPorteurConst, PtrPorteurConst, FunctorCopieInfosSurDansVectPA__Type, std::vector*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal<__PtrPorteurConst >&, long, DistributionComposantes&, long, unsigned long, unsigned long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt) ==19211==by 0x4E9A67: GISLectureEcriture::__visiteMaillage(Maillage const&) (in /home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt) ==19211==by 0x4C79A2: GISLectureEcriture::__ecritGISMPI(std::string, GroupeInfoSur const&, std::string const&) (in /home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt) ==19211==by 0x4961AD: main (in /home/mefpp_ericc/GIREF/bin/__Test.LectureEcritureGISMPI.__opt) ==19211== Address 0x295af060 is 144 bytes inside a block of size 524,288 alloc'd ==19211==at 0x4C2C27B: malloc (in /usr/lib64/valgrind/vgpreload___memcheck-amd64-linux.so) ==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50) ==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497) ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159) ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114) ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File___write_all_begin (write_allb.c:44) ==19211==by 0x2742A367: mca_io_romio_file_write_all___begin (io_romio_file_write.c:264) ==19211==by 0x12126520: PMPI_File_write_all_begin (pfile_write_all_begin.c:74) ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::__ecritureIndexeParBlocMPI<__PAIOType, PtrPorteurConst, FunctorCopieInfosSurDansVectPA__Type, std::vector*, std::allocator*> > const>, Fun
Re: [OMPI devel] [mpich-discuss] BUG in ADIOI_NFS_WriteStrided
On 12/19/2014 09:52 PM, Rob Latham wrote: Please don't use NFS for MPI-IO. ROMIO makes a best effort but there's no way to guarantee you won't corrupt a block of data (NFS Ok. But how can I know the type of filesystem my users will work on? For small jobs, they may have data on NFS and don't car too much for read/write speed... and I want only 1 file format that can be used on any filesystem... Do you recommend me to disable ROMIO/NFS support when configuring MPICH (how do you ask this to configure?)? What other library is recommend to use if I have to write distributed data on NFS? Does HDF5, for example, switches from MPI I/O to something else when doing collective I/O on NFS? I don't want to write a function to write to a file that depends on the final type of filesystem... I expect the library to do a good job for me... and I have chosen MPI I/O do to that job... ;-) clients are allowed to cache... arbitrarily, it seems). There are so many good parallel file systems with saner consistency semantics . Can't tell anything about how NFS is usable or not with MPI I/O... I Just use it because our nightly tests are writing results to NFS partitions... as our users may do... This looks like maybe a calloc would clean it right up. Ok, the point is: is there a bug, and can it be fixed (even if it is not recommended to use ROMIO/NFS) or at least tracked? Thanks! Eric
Re: [OMPI devel] Open MPI v5.0.x branch created
Hi, I just checked out the 5.0.x branch ans gave it a try. Is it ok to report problems or shall we wait until an official rc1 ? Thanks, Eric ps: I have a bug with MPI_File_open... On 2021-03-11 1:24 p.m., Geoffrey Paulsen via devel wrote: Open MPI developers, We've created the Open MPI v5.0.x branch today, and are receiving bugfixes. Please cherry-pick any master PRs to v5.0.x once they've been merged to master. We're targeting an aggressive but achievable release date of May 15th. If you're in charge of your organization's CI tests, please enable for v5.0.x PRs. It may be a few days until all of our CI is enabled on v5.0.x. Thanks everyone for your continued commitment to Open MPI's success. Josh Ladd, Austen Lauria, and Geoff Paulsen - v5.0 RMs -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42