Hi Valery,

thanks for your quick reply.

We have 3 CEs, and in two of them we already have set MaxStartups to
100 (and we get errors there too).

Do you think we still need to increase this value?

Cheers,
Arnau

On Thu, 8 Nov 2007 19:07:54 +0300 (MSK)
Valery Mitsyn wrote:

> Hola Arnau,
> 
> this can be a result of bunch of simultaneous connection
> from WNs to CE. Check on the CE "MaxStartups" in /etc/sshd_config
> and try to increase it to 100, the default is 50 wich can be
> too low in some situations.
> 
> On Thu, 8 Nov 2007, Arnau Bria wrote:
> 
> > Hi,
> >
> >
> > a couple of days I sent this e-mail to torque list. I got no reply,
> > so I decided to post here too, maybe someone has seen this error
> > before.
> >
> > Sorry in advance for the cross-posting.
> >
> >
> > we're getting sporadic errors when jobs finishes running in a WN and
> > has to copy its output to submitter host.
> >
> > We've configured ssh in our submitter/executer in order to avoid
> > requesting password, so for example:
> >
> > [EMAIL PROTECTED] ~]# su - ops006
> > [EMAIL PROTECTED] ~]$ ssh ce07 date
> > Scientific Linux CERN Release 3.0.8 (SL)
> > Tue Nov  6 12:09:05 CET 2007
> > [EMAIL PROTECTED] ~]$
> >
> > But looking job's log in WN we find:
> > Oct 24 02:20:48 td237 pbs_mom: req_cpyfile, Unable to copy file
> > [EMAIL 
> > PROTECTED]:/home/ops006/.lcgjm/globus-cache-export.Q18475/globus-cache-e
> > xport.Q18475.gpg to globus-cache-export.Q18475.gpg
> >
> > and in pbs server:
> > [EMAIL PROTECTED] root]# grep
> > 3145425 /var/spool/pbs/server_priv/accounting/200710* 
> > /var/spool/pbs/server_priv/accounting/20071024:10/24/2007
> > 02:18:46;Q;3145425.pbs01.pic.es;queue=gshort
> > /var/spool/pbs/server_priv/accounting/20071024:10/24/2007
> > 02:21:44;D;3145425.pbs01.pic.es;[EMAIL PROTECTED]
> >
> >
> > finally, maui's log:
> >
> > [EMAIL PROTECTED] root]# grep 3145425 /var/log/maui.log*
> > /var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425' loaded:
> > 1   ops006 ops  86400       Idle   0 1193185126   [NONE] [NONE]
> > [NONE]
> >> =      0 >= 0 [slc4] 1193185244
> > /var/log/maui.log.1:10/24 02:20:44 MRMJobStart(3145425,Msg,SC)
> > /var/log/maui.log.1:10/24 02:20:44 MPBSJobStart(3145425,base,Msg,SC)
> > /var/log/maui.log.1:10/24 02:20:44
> > MPBSJobModify(3145425,Resource_List,Resource,td237.pic.es)
> > /var/log/maui.log.1:10/24 02:20:44
> > MPBSJobModify(3145425,Resource_List,Resource,1)
> > /var/log/maui.log.1:10/24 02:20:44 WARNING:  cannot set job
> > '3145425.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc:
> > 15001 'Unknown Job Id')
> > /var/log/maui.log.1:10/24 02:20:44 INFO:     job '3145425'
> > successfully started /var/log/maui.log.1:10/24 02:22:45 INFO:
> > active PBS job 3145425 has been removed from the queue.  assuming
> > successful completion
> >
> >
> > AS I commented at the beginnig of the mail, errors are sporadic,
> > but we find lots certain days, i certain WN. All wn share conf, so
> > no difference between them a part of the job that are running.
> >
> > Versions:
> > in WN:
> > [EMAIL PROTECTED] ~]# rpm -qa|grep torque
> > torque-devel-2.1.8-1cri_sl4_1st.i386
> > torque-mom-2.1.8-1cri_sl4_1st.i386
> > torque-2.1.8-1cri_sl4_1st.i386
> > torque-client-2.1.8-1cri_sl4_1st.i386
> > torque-docs-2.1.8-1cri_sl4_1st.i386
> >
> > in server:
> > [EMAIL PROTECTED] root]# rpm -qa|grep torque
> > torque-gui-2.1.8-1cri_sl3_1st
> > torque-client-2.1.8-1cri_sl3_1st
> > torque-server-2.1.8-1cri_sl3_1st
> > torque-2.1.8-1cri_sl3_1st
> >
> > TIA,
> > Arnau
> > _______________________________________________
> > mauiusers mailing list
> > mauiusers@supercluster.org
> > http://www.supercluster.org/mailman/listinfo/mauiusers
> >
> 
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to