Hi,

a couple of days I sent this e-mail to torque list. I got no reply, so
I decided to post here too, maybe someone has seen this error before.

Sorry in advance for the cross-posting.


we're getting sporadic errors when jobs finishes running in a WN and
has to copy its output to submitter host. 

We've configured ssh in our submitter/executer in order to avoid
requesting password, so for example:

[EMAIL PROTECTED] ~]# su - ops006
[EMAIL PROTECTED] ~]$ ssh ce07 date
Scientific Linux CERN Release 3.0.8 (SL)
Tue Nov  6 12:09:05 CET 2007
[EMAIL PROTECTED] ~]$

But looking job's log in WN we find:
Oct 24 02:20:48 td237 pbs_mom: req_cpyfile, Unable to copy file
[EMAIL PROTECTED]:/home/ops006/.lcgjm/globus-cache-export.Q18475/globus-cache-e
xport.Q18475.gpg to globus-cache-export.Q18475.gpg

and in pbs server:
[EMAIL PROTECTED] root]# grep
3145425 /var/spool/pbs/server_priv/accounting/200710* 
/var/spool/pbs/server_priv/accounting/20071024:10/24/2007
02:18:46;Q;3145425.pbs01.pic.es;queue=gshort
/var/spool/pbs/server_priv/accounting/20071024:10/24/2007
02:21:44;D;3145425.pbs01.pic.es;[EMAIL PROTECTED]


finally, maui's log:

[EMAIL PROTECTED] root]# grep 3145425 /var/log/maui.log*
/var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425' loaded:
1   ops006 ops  86400       Idle   0 1193185126   [NONE] [NONE] [NONE]
>=      0 >= 0 [slc4] 1193185244
/var/log/maui.log.1:10/24 02:20:44 MRMJobStart(3145425,Msg,SC)
/var/log/maui.log.1:10/24 02:20:44 MPBSJobStart(3145425,base,Msg,SC)
/var/log/maui.log.1:10/24 02:20:44
MPBSJobModify(3145425,Resource_List,Resource,td237.pic.es)
/var/log/maui.log.1:10/24 02:20:44
MPBSJobModify(3145425,Resource_List,Resource,1)
/var/log/maui.log.1:10/24 02:20:44 WARNING:  cannot set job
'3145425.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001
'Unknown Job Id')
/var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425' successfully
started /var/log/maui.log.1:10/24 02:22:45 INFO:     active PBS job
3145425 has been removed from the queue.  assuming successful completion


AS I commented at the beginnig of the mail, errors are sporadic, but we
find lots certain days, i certain WN. All wn share conf, so no
difference between them a part of the job that are running.

Versions:
in WN:
[EMAIL PROTECTED] ~]# rpm -qa|grep torque
torque-devel-2.1.8-1cri_sl4_1st.i386
torque-mom-2.1.8-1cri_sl4_1st.i386
torque-2.1.8-1cri_sl4_1st.i386
torque-client-2.1.8-1cri_sl4_1st.i386
torque-docs-2.1.8-1cri_sl4_1st.i386

in server:
[EMAIL PROTECTED] root]# rpm -qa|grep torque
torque-gui-2.1.8-1cri_sl3_1st
torque-client-2.1.8-1cri_sl3_1st
torque-server-2.1.8-1cri_sl3_1st
torque-2.1.8-1cri_sl3_1st

TIA,
Arnau
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to