Hola Arnau,
this can be a result of bunch of simultaneous connection
from WNs to CE. Check on the CE "MaxStartups" in /etc/sshd_config
and try to increase it to 100, the default is 50 wich can be
too low in some situations.
On Thu, 8 Nov 2007, Arnau Bria wrote:
Hi,
a couple of days I sent this e-mail to torque list. I got no reply, so
I decided to post here too, maybe someone has seen this error before.
Sorry in advance for the cross-posting.
we're getting sporadic errors when jobs finishes running in a WN and
has to copy its output to submitter host.
We've configured ssh in our submitter/executer in order to avoid
requesting password, so for example:
[EMAIL PROTECTED] ~]# su - ops006
[EMAIL PROTECTED] ~]$ ssh ce07 date
Scientific Linux CERN Release 3.0.8 (SL)
Tue Nov 6 12:09:05 CET 2007
[EMAIL PROTECTED] ~]$
But looking job's log in WN we find:
Oct 24 02:20:48 td237 pbs_mom: req_cpyfile, Unable to copy file
[EMAIL PROTECTED]:/home/ops006/.lcgjm/globus-cache-export.Q18475/globus-cache-e
xport.Q18475.gpg to globus-cache-export.Q18475.gpg
and in pbs server:
[EMAIL PROTECTED] root]# grep
3145425 /var/spool/pbs/server_priv/accounting/200710*
/var/spool/pbs/server_priv/accounting/20071024:10/24/2007
02:18:46;Q;3145425.pbs01.pic.es;queue=gshort
/var/spool/pbs/server_priv/accounting/20071024:10/24/2007
02:21:44;D;3145425.pbs01.pic.es;[EMAIL PROTECTED]
finally, maui's log:
[EMAIL PROTECTED] root]# grep 3145425 /var/log/maui.log*
/var/log/maui.log.1:10/24 02:20:44 INFO: job '3145425' loaded:
1 ops006 ops 86400 Idle 0 1193185126 [NONE] [NONE] [NONE]
= 0 >= 0 [slc4] 1193185244
/var/log/maui.log.1:10/24 02:20:44 MRMJobStart(3145425,Msg,SC)
/var/log/maui.log.1:10/24 02:20:44 MPBSJobStart(3145425,base,Msg,SC)
/var/log/maui.log.1:10/24 02:20:44
MPBSJobModify(3145425,Resource_List,Resource,td237.pic.es)
/var/log/maui.log.1:10/24 02:20:44
MPBSJobModify(3145425,Resource_List,Resource,1)
/var/log/maui.log.1:10/24 02:20:44 WARNING: cannot set job
'3145425.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001
'Unknown Job Id')
/var/log/maui.log.1:10/24 02:20:44 INFO: job '3145425' successfully
started /var/log/maui.log.1:10/24 02:22:45 INFO: active PBS job
3145425 has been removed from the queue. assuming successful completion
AS I commented at the beginnig of the mail, errors are sporadic, but we
find lots certain days, i certain WN. All wn share conf, so no
difference between them a part of the job that are running.
Versions:
in WN:
[EMAIL PROTECTED] ~]# rpm -qa|grep torque
torque-devel-2.1.8-1cri_sl4_1st.i386
torque-mom-2.1.8-1cri_sl4_1st.i386
torque-2.1.8-1cri_sl4_1st.i386
torque-client-2.1.8-1cri_sl4_1st.i386
torque-docs-2.1.8-1cri_sl4_1st.i386
in server:
[EMAIL PROTECTED] root]# rpm -qa|grep torque
torque-gui-2.1.8-1cri_sl3_1st
torque-client-2.1.8-1cri_sl3_1st
torque-server-2.1.8-1cri_sl3_1st
torque-2.1.8-1cri_sl3_1st
TIA,
Arnau
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers
--
Best regards,
Valery Mitsyn
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers