Hi
I've finally managed to fix the problems I had!!! Apart from the MPI problem
I've previously reported the same occurred when submitting single jobs (not
MPI). Again, just the client node executed the jobs. This was caused because
the
file "/var/spool/pbs/server_priv/nodes" didn't include the server's hostname and
because the pbs_mom wasn't set to execute at startup. After adding the server's
hostname in the file (or add it with `qmgr`) and starting pbs_mom, everything
seemed to work fine for regular job submission.
I hoped that fixing this problem would fix the one in MPI also, but no. I kept
looking and managed to advance a little bit more. It looks like that there were
several things missing when I installed OSCAR. I found out the MPI libraries
("lam-libs.x86_64") weren't installed on the client node (Is this normal? I
thought the OSCAR image had all the needed libraries..) and that the
LD_LIBRARY_PATH environment variable wasn't set. I installed the MPI libraries
and set the variable and the MPI worked BUT just on the client node!!!
Now the problem had to be with LAM/MPI. Surfed through the site and found out
that the $PBS_NODEFILE must include ALL computation nodes. I added the server
hostname to a new file (I cannot change the original one since it it
dynamically
generated) but still no good. At last the problem was that the LAM/MPI version
that comes with Fedora doesn't support the "ssi boot tm" option so I just had
to
change the "ssi boot" to "rsh". In the end one just has to boot LAM/MPI with
the
command: "lamboot -ssi boot rsh -v node.file".
In resume:
/var/spool/pbs/server_priv/nodes - must include all execution nodes
pbs_mom - run at startup on every execution host (server included)
lam-libs.x86_64 - install in every host
$LD_LIBRARY_PATH - set to include MPI libraries
$PBS_NODEFILE file - include all execution hosts hostname
PBS script - use "lamboot -ssi boot rsh -v node.file" instead of the command
presented on the samples
hope someone fixes this issues on the next OSCAR release.
FG
PS - some of this fixes may be inaccurate but it was how I managed to put the
OSCAR cluster to work. I would appreciate if someone from the OSCAR development
team could check them.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users