Hello,

After whole day of coding I'm fighting little bit with one small fragment which seems strange for me. For testing I have one head node and two worker nodes on localhost. Having this code (with debug stuff added like sleeps, barriers, etc):

void CImageData::SpreadToNodes()
{
   sleep(5);
   logger->debug("CImageData::SpreadToNodes, w=%d h=%d type=%d",
                   this->width, this->height, this->type);

   logger->debug("head barrier");
   MPI_Barrier(MPI_COMM_WORLD);
   sleep(2);
   MPI_Barrier(MPI_COMM_WORLD);

   // debug 'sync' test
   logger->debug("head send SYNC str");
   char buf[5];
   buf[0] = 'S'; buf[1] = 'Y'; buf[2] = 'N'; buf[3] = 'C';
   for (int nodeId = 1; nodeId < g_NumProcesses; nodeId++)
{ MPI_Send(buf, 4, MPI_CHAR, nodeId, DEF_MSG_TAG, MPI_COMM_WORLD);
   }

   logger->debug("head bcast width: %d", this->width);
   MPI_Bcast(&(this->width), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("head bcast height: %d", this->height);
   MPI_Bcast(&(this->height), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("head bcast type: %d", this->type);
   MPI_Bcast(&(this->type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

   logger->debug("head sleep 10s");
   sleep(10);

logger->debug("finished CImageData::SpreadToNodes"); }

// this function is decleared static:
CImageData *CImageData::ReceiveFromHead()
{
   sleep(5);

   logger->debug("CImageData::ReceiveFromHead");
   MPI_Status status;
   int _width;
   int _height;
   byte _type;

   logger->debug("worker barrier");
   MPI_Barrier(MPI_COMM_WORLD);
   sleep(2);
   MPI_Barrier(MPI_COMM_WORLD);

   char buf[5];
MPI_Recv(buf, 4, MPI_CHAR, HEAD_NODE, DEF_MSG_TAG, MPI_COMM_WORLD, &status); logger->debug("worker received sync str: '%c' '%c' '%c' '%c'", buf[0], buf[1], buf[2], buf[3]);

   logger->debug("worker bcast width");
   MPI_Bcast(&(_width), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("worker bcast height");
   MPI_Bcast(&(_height), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("worker bcast type");
   MPI_Bcast(&(_type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

   logger->debug("width=%d height=%d type=%d", _width, _height, _type);

   // TODO: create CImageData object, return...
   return NULL;
}


That part of code gives me an error:
RANK 0 -> PID 17115
RANK 1 -> PID 17116
RANK 2 -> PID 17117

2007-10-02 19:50:37,829 [17115] DEBUG: CImageData::SpreadToNodes, w=768 h=576 type=1
2007-10-02 19:50:37,829 [17117] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17115] DEBUG: head barrier
2007-10-02 19:50:37,829 [17116] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17116] DEBUG: worker barrier
2007-10-02 19:50:37,829 [17117] DEBUG: worker barrier
2007-10-02 19:50:39,836 [17115] DEBUG: head send SYNC str
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast width: 768
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast height: 576
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast type: 1
2007-10-02 19:50:39,836 [17115] DEBUG: head sleep 10s
2007-10-02 19:50:39,836 [17116] DEBUG: worker received sync str: 'S' 'Y' 'N' 'C'
2007-10-02 19:50:39,836 [17116] DEBUG: worker bcast width
[pc801:17116] *** An error occurred in MPI_Bcast
[pc801:17116] *** on communicator MPI_COMM_WORLD
[pc801:17116] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17116] *** MPI_ERRORS_ARE_FATAL (goodbye)
2007-10-02 19:50:39,836 [17117] DEBUG: worker received sync str: 'S' 'Y' 'N' 'C'
2007-10-02 19:50:39,836 [17117] DEBUG: worker bcast width
[pc801:17117] *** An error occurred in MPI_Bcast
[pc801:17117] *** on communicator MPI_COMM_WORLD
[pc801:17117] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17117] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 17115 on node pc801 exited on signal 15 (Terminated).


Could it be that somewhere before this part the data stream was out of sync? The project is quite large and I have a lot of communication between processes before CImageData::SpreadToNodes() so whole debugging could take hours/days, however it seems that data flow before this particular fragment is ok. How could it be that MPI_Send/Recv gave me good buffer (4 chars - SYNC) and MPI_Bcast of MPI_INT is truncated? I tested the code on Valgrind - it didn't complain and gave me exactly the same result. Can I assume that possibly I have somewhere memory-acces error before this part and I destroyed the MPI structures? How exactly MPI_Bcast is working?

Sorry for disturb, but I'm little bit confused.
Thank you & greetings, Marcin

Reply via email to