Hopefully, the problem is caused by the xauth error output, because that's easy to solve. That error has always been a serious thorn in my side. I get it when coming from Windows ssh clients which don't seem to pass along forwarding information correctly or something. How to solve it? Disable X11 forwarding. This can be done on either a system or per user level. For systemwide, just edit the /etc/ssh/ssh_config file, and comment/edit the line that says:
ForwardX11 yes


Just change it to no or comment it, as I think the default is no. If you want to do it on a user level, just create a ~/.ssh/config file, with that same line in it, which will override any global configuration paramether. e.g.

ForwardX11 no

I'm cc'ing the OSCAR users list to keep them in the loop as well.

Jeremy

At 03:29 PM 5/8/2004, Alexander V Shirokov wrote:
Dear Jeremy,

I have been at the Beowulf cluster workshop at
MIT that you were presenting two years ago. Since then I have been using
beowulf clusters all over time. I have been trying to solve a
problem (a bug) for two weeks already. I am supposed to defend
a PhD in August, time is short. I would really appreciate your help,
since it will make things move then. Please help me solve
this problem, if possible.


When I run the code, the program stops crashes after about 40 timesteps when ran without submitting to PBSPro by qsub.

When I run the code by submitting by qsub of PBSPro, I get this
error diagnostics after about 10 timesteps, and the run dies:

1) Standard error PBSPro file int2.pbs.e919:

Warning: No xauth data; using fake authentication data for X11 forwarding.
=>> PBS: job killed: node 17 (node18) requested job die, code 15009

2) File /var/spool/PBS/mom_logs/20040508 on node18:

13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;JOIN JOB as node 17
15:04:46;0004;pbs_mom;Job;919.antares.mit.edu;polling stopped
15:04:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job

3) File /var/spool/PBS/mom_logs/20040508 on node1:

11:43:54;0008;pbs_mom;Job;790.antares.mit.edu;Started, pid = 13919
13:18:12;0008;pbs_mom;Job;844.antares.mit.edu;Started, pid = 12919
13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;Started, pid = 14043
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;send POLL failed
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;node 17 (node18) requested job die, code 15009
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job
15:06:48;0080;pbs_mom;Job;919.antares.mit.edu;task 1 terminated
15:06:48;0008;pbs_mom;Job;919.antares.mit.edu;Terminated
15:06:58;0008;pbs_mom;Job;919.antares.mit.edu;kill_job
15:06:58;0100;pbs_mom;Job;919.antares.mit.edu;Obit sent


4)
The error messages in the standard output files on these nodes look the same:

p67_5862: p4_error: net_recv read: probable EOF on socket: 1

However on node16, it is
 p64_6016: (5813.998720) net_recv failed for fd = 3
 p64_6016:  p4_error: net_recv read, errno = : 104
on node4 it is
 p16_6446: (5832.189857) net_recv failed for fd = 3
 p16_6446:  p4_error: net_recv read, errno = : 104


Thank you, and I would really appreciate your help.


Regards,
Alex



------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to