Dear User Group,
I'm currently trying to run a simple MPI-program using Netcdf4.3.3 and
hdf5-1.8.16 on multiple Xeon Phi nodes.
Each node contains a host compute node (Ivy-Bridge) and 2 Intel Xeon Phi
coprocessors.
The important part of the source code is as follows (the whole code is
attached):
// ...
checkNcError(nc_open_par(meshFile, NC_NETCDF4 | NC_MPIIO,
MPI_COMM_WORLD, MPI_INFO_NULL, &ncFile));
checkNcError(nc_inq_varid(ncFile, "element_size", &ncVarElemSize));
checkNcError(nc_var_par_access(ncFile, ncVarElemSize,NC_COLLECTIVE));
size_t start[1] = {rank};
int elem_size;
checkNcError(nc_get_var1_int(ncFile, ncVarElemSize, start , &elem_size));
printf("Rank %d> Number of elements: %d\n",rank, elem_size);
//...
Where
meshFile is an unstructured Grid in NetCDF format, containing a variable
element_size for each rank.
(checkNcError is used to check a nc call for errors)
Running the program on a single node (i.e one host and two coprocessors,
3 ranks) succeeds:
mpiexec -host host -n 1 prog.host cube_36_10_10_3_1_1.nc : -host
host-mic1 -n 1 prog.mic cube_36_10_10_3_1_1.nc : -host host-mic0 -n 1
prog.mic cube_36_10_10_3_1_1.nc
Reading file: cube_36_10_10_3_1_1.nc
Rank 0> Number of elements: 6000
Rank 2> Number of elements: 6000
Rank 1> Number of elements: 6000
However executing on two nodes (i.e. 6 ranks with an adjusted mesh file)
fails :
mpiexec -host host1 -n 1 prog.host cube_36_10_10_6_1_1.nc : -host
host1-mic1 -n 1 prog.mic cube_36_10_10_6_1_1.nc : -host host1-mic0 -n 1
prog.mic cube_36_10_10_6_1_1.nc : -host host2 -n 1 prog.host
cube_36_10_10_6_1_1.nc : -host host2-mic1 -n 1 prog.mic
cube_36_10_10_6_1_1.nc : -host host2-mic0 -n 1 prog.mic
cube_36_10_10_6_1_1.nc
Reading file: cube_36_10_10_6_1_1.nc
Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(2434)..................: MPI_Bcast(buf=0x10b3ffc, count=1,
MPI_INT, root=0, comm=0x84000000) failed
MPIR_Bcast_impl(1807).............:
MPIR_Bcast(1835)..................:
I_MPIR_Bcast_intra(2016)..........: Failure during collective
MPIR_Bcast_intra(1665)............: Failure during collective
MPIR_Bcast_intra(1634)............:
MPIR_Bcast_binomial(245)..........:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2
truncated; 24 bytes received but buffer size is 4
Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(2434)..................: MPI_Bcast(buf=0x208bfdc, count=1,
MPI_INT, root=0, comm=0x84000000) failed
MPIR_Bcast_impl(1807).............:
MPIR_Bcast(1835)..................:
I_MPIR_Bcast_intra(2016)..........: Failure during collective
MPIR_Bcast_intra(1665)............: Failure during collective
MPIR_Bcast_intra(1634)............:
MPIR_Bcast_binomial(245)..........:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2
truncated; 24 bytes received but buffer size is 4
To test the environment for 6 ranks I executed the program on just 6
compute nodes, what again succeeded:
mpiexec -host host1 -n 1 prog.host cube_36_10_10_6_1_1.nc: -host host2
-n 1 prog.host cube_36_10_10_6_1_1.nc : -host host3 -n 1 prog.host
cube_36_10_10_6_1_1.nc : -host host4 -n 1 prog.host
cube_36_10_10_6_1_1.nc : -host host5 -n 1 prog.host
cube_36_10_10_6_1_1.nc : -host host6 -n 1 prog.host cube_36_10_10_6_1_1.nc ;
Reading file: cube_36_10_10_6_1_1.nc
Rank 1> Number of elements: 3000
Rank 4> Number of elements: 3000
Rank 3> Number of elements: 3000
Rank 2> Number of elements: 3000
Rank 5> Number of elements: 3000
Rank 0> Number of elements: 3000
Using DDT I was able to determine that the error is somewhere in
HD5F_open().
I'm very thankful for any help, kind regards,
Leo
#include <unistd.h>
#include <stdio.h>
#include <mpi.h>
#include <netcdf.h>
#include <netcdf_par.h>
static void checkNcError(int error){
if (error != NC_NOERR)
printf("Error while reading netCDF file: %s", nc_strerror(error));
}
int main (int argc, char* argv[]) {
char hostname[100];
char * meshFile;
meshFile=argv[1];
int rank, size, ncFile, error, ncVarElemSize;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
if(rank==0){
printf("Reading file: %s \n", meshFile);
}
checkNcError(nc_open_par(meshFile, NC_NETCDF4 | NC_MPIIO, MPI_COMM_WORLD, MPI_INFO_NULL, &ncFile));
checkNcError(nc_inq_varid(ncFile, "element_size", &ncVarElemSize));
checkNcError(nc_var_par_access(ncFile, ncVarElemSize,NC_COLLECTIVE));
size_t start[1] = {rank};
int elem_size;
checkNcError(nc_get_var1_int(ncFile, ncVarElemSize, start , &elem_size));
printf("Rank %d> Number of elements: %d\n",rank, elem_size);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5