Hello, I am still using LAM/MPI on an old cluster and wonder if I can get some help from this mail list. Here is the problem. I am using a 18 node cluster, each node has 2 CPU and each CPU supports up to 2 threads. So I assume I can use 18*4 number of processors. As running the following code, an error message will always pops up for np=30 or np=60. But works fine for np=12, np=1. The error message is always the same, something like: one of the processor n15, exit with (0), ip 192......,
Here is a part of the code, where the n15 exit. All other PE can finish writing the file, except PE15. Then I see the error message about n15 and the written of file by PE15 is not completed. An quick question here, is PE15 necessarily generated by node 15 on the cluster? Appreciate if anyone would share experiences in debuging errors like this. code: .... sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID, each PE opens a different file if ((fp=fopen(p_obsfile,"w"))==NULL) printf("PE_%d: The file %s cannot be opened\n",my_rank,p_obsfile); for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ // loc=TotalNum/NumofPE //call a function to calculate U, the function will return the finishing message // no communication is needed among processors for (int j=0;j<NUM;j++) fprintf (fp, "%f\n",U[j]); //output updated U }