A co-worker here was seeing the following MPI error from his job:

[1] Abort: [ldev2:1] Got completion with error, code=1
 at line 2148 in file viacheck.c

After some tracking down he found that apparently if he used a "system" call
[int system(const char *string)] the next MPI command will fail.

I have been able to reproduce this with the attached simple "hello" program.

Perhaps someone has seen this type of error?  Here is the output from 2 runs:

[EMAIL PROTECTED]:~/ior-test
17:04:04 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello x
ldev1
[0] Abort: [ldev1:0] Got completion with error, code=1
 at line 2148 in file viacheck.c
ldev2
mpirun_rsh: Abort signaled from [0]
done.
[EMAIL PROTECTED]:~/ior-test
17:05:23 > mpirun_rsh -rsh -hostfile hostfile -np 2 ./hello
now = 0.000000
now = 0.000052
now = 0.000094
now = 0.000121
now = 0.000151
now = 0.001072
now = 0.001102
now = 0.001118
now = 0.001141
now = 0.001160
done.

We are running mvapich 0.9.7 and the openib trunk rev 6829.

Thanks,
Ira

Attachment: hello.c
Description: Binary data

_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to