There is definetly something wrong in types.

OMPI_DATATYPE_MAX_PREDEFINED is set to 45, while there are 55 predefined types. When accessing ompi_op_ddt_map[ddt->id] with MPI_REAL8 (ddt->id=54), we're reading the ompi_mpi_op_bxor struct.

Depending on various things (padding, uninitialized memory), we may get 0 and not crash. If you're not lucky, you get a random value and crash soon afterwards.

So, I extended things a bit and it seems to fix my problem. I'm not sure all types are now handled, I just added some that are not defined.

Sylvain

diff -r e82b914000bd -r 1a40aee2925c ompi/datatype/ompi_datatype.h
--- a/ompi/datatype/ompi_datatype.h     Thu Dec 03 04:46:31 2009 +0000
+++ b/ompi/datatype/ompi_datatype.h     Fri Dec 04 19:59:26 2009 +0100
@@ -57,7 +57,7 @@
 #define OMPI_DATATYPE_FLAG_DATA_FORTRAN  0xC000
 #define OMPI_DATATYPE_FLAG_DATA_LANGUAGE 0xC000

-#define OMPI_DATATYPE_MAX_PREDEFINED 45
+#define OMPI_DATATYPE_MAX_PREDEFINED 55

 #if OMPI_DATATYPE_MAX_PREDEFINED > OPAL_DATATYPE_MAX_SUPPORTED
 #error Need to increase the number of supported dataypes by OPAL (value 
OPAL_DATATYPE_MAX_SUPPORTED).
diff -r e82b914000bd -r 1a40aee2925c ompi/op/op.c
--- a/ompi/op/op.c      Thu Dec 03 04:46:31 2009 +0000
+++ b/ompi/op/op.c      Fri Dec 04 19:59:26 2009 +0100
@@ -137,6 +137,14 @@
     ompi_op_ddt_map[OMPI_DATATYPE_MPI_2INTEGER] = OMPI_OP_BASE_TYPE_2INTEGER;
     ompi_op_ddt_map[OMPI_DATATYPE_MPI_LONG_DOUBLE_INT] = 
OMPI_OP_BASE_TYPE_LONG_DOUBLE_INT;
     ompi_op_ddt_map[OMPI_DATATYPE_MPI_WCHAR] = OMPI_OP_BASE_TYPE_WCHAR;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER2] = OMPI_OP_BASE_TYPE_INTEGER2;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER4] = OMPI_OP_BASE_TYPE_INTEGER4;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER8] = OMPI_OP_BASE_TYPE_INTEGER8;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER16] = OMPI_OP_BASE_TYPE_INTEGER16;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL2] = OMPI_OP_BASE_TYPE_REAL2;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL4] = OMPI_OP_BASE_TYPE_REAL4;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL8] = OMPI_OP_BASE_TYPE_REAL8;
+    ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL16] = OMPI_OP_BASE_TYPE_REAL16;

     /* Create the intrinsic ops */

diff -r e82b914000bd -r 1a40aee2925c opal/datatype/opal_datatype.h
--- a/opal/datatype/opal_datatype.h     Thu Dec 03 04:46:31 2009 +0000
+++ b/opal/datatype/opal_datatype.h     Fri Dec 04 19:59:26 2009 +0100
@@ -56,7 +56,7 @@
  *
  * XXX TODO Adapt to whatever the OMPI-layer needs
  */
-#define OPAL_DATATYPE_MAX_SUPPORTED  46
+#define OPAL_DATATYPE_MAX_SUPPORTED  56


 /* flags for the datatypes. */

On Fri, 4 Dec 2009, Sylvain Jeaugey wrote:

For the record, and to try to explain why all MTT tests may have missed this "bug", configuring without --enable-debug makes the bug disappear.

Still trying to figure out why.

Sylvain

On Thu, 3 Dec 2009, Sylvain Jeaugey wrote:

Hi list,

I hope this time I won't be the only one to suffer this bug :)

It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, working fine on v1.3.

From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in ompi/datatype/ompi_datatype_internal.h), but operations do not have an index going as far as 54 (0x36), leading to a crash when looking for op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() (or, if I disable mpi_param_check, in ompi_op_reduce()).

Here is a reproducer, just in case :
program main
use mpi
integer ierr
real(8) myreal, realsum
call MPI_INIT(ierr)
call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, ierr)
call MPI_FINALIZE(ierr)
stop
end

Has anyone an idea on how to fix this ? Or am I doing something wrong ?

Thanks for any help,
Sylvain



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to