Hello,
After whole day of coding I'm fighting little bit with one small
fragment which seems strange for me.
For testing I have one head node and two worker nodes on localhost.
Having this code (with debug stuff added like sleeps, barriers, etc):
void CImageData::SpreadToNodes()
{
sleep(5);
logger->debug("CImageData::SpreadToNodes, w=%d h=%d type=%d",
this->width, this->height, this->type);
logger->debug("head barrier");
MPI_Barrier(MPI_COMM_WORLD);
sleep(2);
MPI_Barrier(MPI_COMM_WORLD);
// debug 'sync' test
logger->debug("head send SYNC str");
char buf[5];
buf[0] = 'S'; buf[1] = 'Y'; buf[2] = 'N'; buf[3] = 'C';
for (int nodeId = 1; nodeId < g_NumProcesses; nodeId++)
{
MPI_Send(buf, 4, MPI_CHAR, nodeId, DEF_MSG_TAG, MPI_COMM_WORLD);
}
logger->debug("head bcast width: %d", this->width);
MPI_Bcast(&(this->width), 1, MPI_INT, 0, MPI_COMM_WORLD);
logger->debug("head bcast height: %d", this->height);
MPI_Bcast(&(this->height), 1, MPI_INT, 0, MPI_COMM_WORLD);
logger->debug("head bcast type: %d", this->type);
MPI_Bcast(&(this->type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
logger->debug("head sleep 10s");
sleep(10);
logger->debug("finished CImageData::SpreadToNodes");
}
// this function is decleared static:
CImageData *CImageData::ReceiveFromHead()
{
sleep(5);
logger->debug("CImageData::ReceiveFromHead");
MPI_Status status;
int _width;
int _height;
byte _type;
logger->debug("worker barrier");
MPI_Barrier(MPI_COMM_WORLD);
sleep(2);
MPI_Barrier(MPI_COMM_WORLD);
char buf[5];
MPI_Recv(buf, 4, MPI_CHAR, HEAD_NODE, DEF_MSG_TAG, MPI_COMM_WORLD,
&status);
logger->debug("worker received sync str: '%c' '%c' '%c' '%c'",
buf[0], buf[1], buf[2], buf[3]);
logger->debug("worker bcast width");
MPI_Bcast(&(_width), 1, MPI_INT, 0, MPI_COMM_WORLD);
logger->debug("worker bcast height");
MPI_Bcast(&(_height), 1, MPI_INT, 0, MPI_COMM_WORLD);
logger->debug("worker bcast type");
MPI_Bcast(&(_type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
logger->debug("width=%d height=%d type=%d", _width, _height, _type);
// TODO: create CImageData object, return...
return NULL;
}
That part of code gives me an error:
RANK 0 -> PID 17115
RANK 1 -> PID 17116
RANK 2 -> PID 17117
2007-10-02 19:50:37,829 [17115] DEBUG: CImageData::SpreadToNodes, w=768
h=576 type=1
2007-10-02 19:50:37,829 [17117] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17115] DEBUG: head barrier
2007-10-02 19:50:37,829 [17116] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17116] DEBUG: worker barrier
2007-10-02 19:50:37,829 [17117] DEBUG: worker barrier
2007-10-02 19:50:39,836 [17115] DEBUG: head send SYNC str
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast width: 768
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast height: 576
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast type: 1
2007-10-02 19:50:39,836 [17115] DEBUG: head sleep 10s
2007-10-02 19:50:39,836 [17116] DEBUG: worker received sync str: 'S' 'Y'
'N' 'C'
2007-10-02 19:50:39,836 [17116] DEBUG: worker bcast width
[pc801:17116] *** An error occurred in MPI_Bcast
[pc801:17116] *** on communicator MPI_COMM_WORLD
[pc801:17116] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17116] *** MPI_ERRORS_ARE_FATAL (goodbye)
2007-10-02 19:50:39,836 [17117] DEBUG: worker received sync str: 'S' 'Y'
'N' 'C'
2007-10-02 19:50:39,836 [17117] DEBUG: worker bcast width
[pc801:17117] *** An error occurred in MPI_Bcast
[pc801:17117] *** on communicator MPI_COMM_WORLD
[pc801:17117] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17117] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 17115 on node pc801 exited on
signal 15 (Terminated).
Could it be that somewhere before this part the data stream was out of
sync? The project is quite large and I have a lot of communication
between processes before CImageData::SpreadToNodes() so whole debugging
could take hours/days, however it seems that data flow before this
particular fragment is ok. How could it be that MPI_Send/Recv gave me
good buffer (4 chars - SYNC) and MPI_Bcast of MPI_INT is truncated? I
tested the code on Valgrind - it didn't complain and gave me exactly the
same result. Can I assume that possibly I have somewhere memory-acces
error before this part and I destroyed the MPI structures? How exactly
MPI_Bcast is working?
Sorry for disturb, but I'm little bit confused.
Thank you & greetings, Marcin
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users