Hello,
I am still using LAM/MPI on an old cluster and wonder if I can get
some help from this mail list. Here is the problem. I am using a 18
node cluster, each node has 2 CPU and each CPU supports up to 2
threads. So I assume I can use 18*4 number of processors. As running
the following code, an error message will always pops up for np=30 or
np=60. But works fine for np=12, np=1. The error message is always the
same, something like: one of  the processor n15, exit with (0), ip
192......,

Here is a part of the code, where the n15 exit. All other PE can
finish writing the file, except PE15. Then I see the error message
about n15 and the written of file by PE15 is not completed.  An quick
question here, is PE15 necessarily generated by node 15 on the
cluster? Appreciate if anyone would share experiences in debuging
errors like this.

code:
....
sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
each PE opens a different file
        if ((fp=fopen(p_obsfile,"w"))==NULL)
                printf("PE_%d: The file %s cannot be 
opened\n",my_rank,p_obsfile);

        for (int id=loc*my_rank;id<loc*(my_rank+1);id++){  // 
loc=TotalNum/NumofPE
                //call a function to calculate U, the function will return the
finishing message
               // no communication is needed among processors
                for (int j=0;j<NUM;j++)
                        fprintf (fp, "%f\n",U[j]); //output updated U
        }

Reply via email to