There is definetly something wrong in types.
OMPI_DATATYPE_MAX_PREDEFINED is set to 45, while there are 55 predefined
types. When accessing ompi_op_ddt_map[ddt->id] with MPI_REAL8
(ddt->id=54), we're reading the ompi_mpi_op_bxor struct.
Depending on various things (padding, uninitialized memory), we may get 0
and not crash. If you're not lucky, you get a random value and crash soon
afterwards.
So, I extended things a bit and it seems to fix my problem. I'm not sure
all types are now handled, I just added some that are not defined.
Sylvain
diff -r e82b914000bd -r 1a40aee2925c ompi/datatype/ompi_datatype.h
--- a/ompi/datatype/ompi_datatype.h Thu Dec 03 04:46:31 2009 +0000
+++ b/ompi/datatype/ompi_datatype.h Fri Dec 04 19:59:26 2009 +0100
@@ -57,7 +57,7 @@
#define OMPI_DATATYPE_FLAG_DATA_FORTRAN 0xC000
#define OMPI_DATATYPE_FLAG_DATA_LANGUAGE 0xC000
-#define OMPI_DATATYPE_MAX_PREDEFINED 45
+#define OMPI_DATATYPE_MAX_PREDEFINED 55
#if OMPI_DATATYPE_MAX_PREDEFINED > OPAL_DATATYPE_MAX_SUPPORTED
#error Need to increase the number of supported dataypes by OPAL (value
OPAL_DATATYPE_MAX_SUPPORTED).
diff -r e82b914000bd -r 1a40aee2925c ompi/op/op.c
--- a/ompi/op/op.c Thu Dec 03 04:46:31 2009 +0000
+++ b/ompi/op/op.c Fri Dec 04 19:59:26 2009 +0100
@@ -137,6 +137,14 @@
ompi_op_ddt_map[OMPI_DATATYPE_MPI_2INTEGER] = OMPI_OP_BASE_TYPE_2INTEGER;
ompi_op_ddt_map[OMPI_DATATYPE_MPI_LONG_DOUBLE_INT] =
OMPI_OP_BASE_TYPE_LONG_DOUBLE_INT;
ompi_op_ddt_map[OMPI_DATATYPE_MPI_WCHAR] = OMPI_OP_BASE_TYPE_WCHAR;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER2] = OMPI_OP_BASE_TYPE_INTEGER2;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER4] = OMPI_OP_BASE_TYPE_INTEGER4;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER8] = OMPI_OP_BASE_TYPE_INTEGER8;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER16] = OMPI_OP_BASE_TYPE_INTEGER16;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL2] = OMPI_OP_BASE_TYPE_REAL2;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL4] = OMPI_OP_BASE_TYPE_REAL4;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL8] = OMPI_OP_BASE_TYPE_REAL8;
+ ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL16] = OMPI_OP_BASE_TYPE_REAL16;
/* Create the intrinsic ops */
diff -r e82b914000bd -r 1a40aee2925c opal/datatype/opal_datatype.h
--- a/opal/datatype/opal_datatype.h Thu Dec 03 04:46:31 2009 +0000
+++ b/opal/datatype/opal_datatype.h Fri Dec 04 19:59:26 2009 +0100
@@ -56,7 +56,7 @@
*
* XXX TODO Adapt to whatever the OMPI-layer needs
*/
-#define OPAL_DATATYPE_MAX_SUPPORTED 46
+#define OPAL_DATATYPE_MAX_SUPPORTED 56
/* flags for the datatypes. */
On Fri, 4 Dec 2009, Sylvain Jeaugey wrote:
For the record, and to try to explain why all MTT tests may have missed this
"bug", configuring without --enable-debug makes the bug disappear.
Still trying to figure out why.
Sylvain
On Thu, 3 Dec 2009, Sylvain Jeaugey wrote:
Hi list,
I hope this time I won't be the only one to suffer this bug :)
It is very simple indeed, just perform an allreduce with MPI_REAL8
(fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk
and v1.5, working fine on v1.3.
From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in
ompi/datatype/ompi_datatype_internal.h), but operations do not have an
index going as far as 54 (0x36), leading to a crash when looking for
op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid()
(or, if I disable mpi_param_check, in ompi_op_reduce()).
Here is a reproducer, just in case :
program main
use mpi
integer ierr
real(8) myreal, realsum
call MPI_INIT(ierr)
call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD,
ierr)
call MPI_FINALIZE(ierr)
stop
end
Has anyone an idea on how to fix this ? Or am I doing something wrong ?
Thanks for any help,
Sylvain
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel