Re: [gridengine users] help

Reuti Thu, 22 Nov 2012 06:09:15 -0800

Am 22.11.2012 um 14:42 schrieb jan roels:

> I work on an nfs share that is also available on the node. I'm currently 
> testing with only one node so it's unique...


Just be aware, that in this case the job script will be send by SGE to the 
execd on the node which stores it in turn on the NFS server (which might be the 
same machine as the master).

I'm not sure about the error message: is it mounted with "noexec" and/or 
"allsquash"/"rootsquash"?

But the error should be "permission denied" in these cases.

-- Reuti


> 
> 2012/11/22 Reuti <[email protected]>
> Am 22.11.2012 um 14:29 schrieb jan roels:
> 
> > I tried it with the root-account and with another account... both the same 
> > error
> 
> Is the directory local on "camilla" and the nodename is unqiue?
> 
> -- Reuti
> 
> 
> >
> > 2012/11/22 Reuti <[email protected]>
> > Am 22.11.2012 um 12:30 schrieb jan roels:
> >
> > > Hi,
> > >
> > > qstat -j <jobid> didn't show the full error message, this one is the full 
> > > error message:
> > >
> > > 11/22/2012 12:26:11|  main|camilla|E|shepherd of job 76.226 exited with 
> > > exit status = 27
> > > 11/22/2012 12:26:11|  main|camilla|E|can't open usage file 
> > > "active_jobs/76.226/usage" for job 76.226: No such file or directory
> > > 11/22/2012 12:26:11|  main|camilla|E|11/22/2012 12:26:10 [0:11412]: 
> > > execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, 
> > > "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such 
> > > file or directory
> >
> > Could be a permission problem. Everyone needs read-access to this directory 
> > as the jobscript is executed from there.
> >
> > -- Reuti
> >
> >
> > >
> > >
> > > 2012/11/22 jan roels <[email protected]>
> > > Hi,
> > >
> > > Do you guys now what this error could be:
> > >
> > > error reason    2:          11/22/2012 11:12:25 [0:31220]: 
> > > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
> > > error reason    3:          11/22/2012 11:12:25 [0:31221]: 
> > > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
> > >
> > > this goes on as long as iets running... and my state went to:
> > >
> > >      69 0.50000 SA         root         Eqw   11/22/2012 09:12:05     1 
> > > 1-500:1
> > >      69 0.00000 SA         root         qw    11/22/2012 09:12:05     1 
> > > 501-4200:1
> > >
> > > This is the script i was running:
> > >
> > > #!/bin/bash
> > > #$-cwd
> > > #$-N SA
> > > #$-t 1-4200:1
> > >
> > > /var/software/packages/Mathematica/7.0/Executables/math -run 
> > > "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
> > >
> > > Hope somebody can fix the problem.
> > >
> > > Kind Regards
> > >
> > >
> > > 2012/11/14 Reuti <[email protected]>
> > > Am 14.11.2012 um 10:08 schrieb jan roels:
> > >
> > > > I got it working again, there was already a proces of execd running 
> > > > that needed to be killed and then restart the services.
> > > >
> > > > I'm trying to run a script now:
> > > >
> > > >
> > > > #!/bin/bash
> > > > #$-cwd
> > > > #$-N SA
> > > > #$-S /bin/sh
> > > > #$-t 1-4200:
> > >
> > > Don't run scripts at root. If something goes wring it might trash your 
> > > machine(s).
> > >
> > >
> > > > /var/software/packages/Mathematica/7.0/Executables/math -run 
> > > > "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
> > > >
> > > > but it gives the following output:
> > > >
> > > > stdin: is not a tty
> > >
> > > It's just a warning - unless someone complains I would suggest to ignore 
> > > it.
> > >
> > >
> > > > and this is the output of my qstat -f:
> > > >
> > > > queuename                      qtype resv/used/tot. load_avg arch       
> > > >    states
> > > > ---------------------------------------------------------------------------------
> > > > [email protected]        BIP   0/1/1          0.70     lx26-amd64
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 1
> > > > ---------------------------------------------------------------------------------
> > > > main.q@node0                   BIP   0/24/24        27.71    lx26-amd64
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 2
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 3
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 4
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 5
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 6
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 7
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 8
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 9
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 10
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 11
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 12
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 13
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 14
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 15
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 16
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 17
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 18
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 19
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 20
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 21
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 22
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 23
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 24
> > > >      35 0.50000 SA         root         r     11/14/2012 09:57:47     1 
> > > > 25
> > > >
> > > > ############################################################################
> > > >  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING 
> > > > JOBS
> > > > ############################################################################
> > > >      35 0.50000 SA         root         qw    11/14/2012 09:57:38     1 
> > > > 26-4200:1
> > > >
> > > >
> > > > root@camilla:/nfs/share/sge#  qstat -explain c -j 35
> > > > ==============================================================
> > > > job_number:                 35
> > > > exec_file:                  job_scripts/35
> > > > submission_time:            Wed Nov 14 09:57:38 2012
> > > > owner:                      root
> > > > uid:                        0
> > > > group:                      root
> > > > gid:                        0
> > > > sge_o_home:                 /root
> > > > sge_o_log_name:             root
> > > > sge_o_path:                 
> > > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > > > sge_o_shell:                /bin/bash
> > > > sge_o_workdir:              /nfs/share/sge
> > > > sge_o_host:                 camilla
> > > > account:                    sge
> > > > cwd:                        /nfs/share/sge
> > > > mail_list:                  root@camilla
> > > > notify:                     FALSE
> > > > job_name:                   SA
> > > > jobshare:                   0
> > > > shell_list:                 NONE:/bin/sh
> > > > env_list:
> > > > script_file:                HistDisCaCO31.sh
> > > > job-array tasks:            1-4200:1
> > > > usage    1:                 cpu=00:05:20, mem=105.16135 GBs, 
> > > > io=0.01537, vmem=1.110G, maxvmem=1.110G
> > > > usage    2:                 cpu=00:04:17, mem=179.44371 GBs, 
> > > > io=0.01395, vmem=3.643G, maxvmem=3.643G
> > > > usage    3:                 cpu=00:04:37, mem=191.69532 GBs, 
> > > > io=0.01394, vmem=3.657G, maxvmem=3.657G
> > > > usage    4:                 cpu=00:04:34, mem=188.12645 GBs, 
> > > > io=0.01394, vmem=3.655G, maxvmem=3.655G
> > > > usage    5:                 cpu=00:04:16, mem=180.18292 GBs, 
> > > > io=0.01394, vmem=3.636G, maxvmem=3.636G
> > > > usage    6:                 cpu=00:04:22, mem=183.47616 GBs, 
> > > > io=0.01394, vmem=3.644G, maxvmem=3.644G
> > > > usage    7:                 cpu=00:04:15, mem=179.89624 GBs, 
> > > > io=0.01400, vmem=3.640G, maxvmem=3.640G
> > > > usage    8:                 cpu=00:04:55, mem=207.28643 GBs, 
> > > > io=0.01394, vmem=3.669G, maxvmem=3.669G
> > > > usage    9:                 cpu=00:04:27, mem=184.86707 GBs, 
> > > > io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   10:                 cpu=00:04:14, mem=179.09446 GBs, 
> > > > io=0.01394, vmem=3.635G, maxvmem=3.635G
> > > > usage   11:                 cpu=00:04:47, mem=195.80372 GBs, 
> > > > io=0.01400, vmem=3.668G, maxvmem=3.668G
> > > > usage   12:                 cpu=00:04:49, mem=203.43895 GBs, 
> > > > io=0.01394, vmem=3.665G, maxvmem=3.665G
> > > > usage   13:                 cpu=00:04:45, mem=196.67175 GBs, 
> > > > io=0.01394, vmem=3.663G, maxvmem=3.663G
> > > > usage   14:                 cpu=00:04:24, mem=185.68047 GBs, 
> > > > io=0.01400, vmem=3.648G, maxvmem=3.648G
> > > > usage   15:                 cpu=00:04:40, mem=195.96253 GBs, 
> > > > io=0.01394, vmem=3.656G, maxvmem=3.656G
> > > > usage   16:                 cpu=00:04:11, mem=179.84016 GBs, 
> > > > io=0.01394, vmem=3.633G, maxvmem=3.633G
> > > > usage   17:                 cpu=00:04:43, mem=196.21689 GBs, 
> > > > io=0.01394, vmem=3.662G, maxvmem=3.662G
> > > > usage   18:                 cpu=00:04:37, mem=197.39875 GBs, 
> > > > io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   19:                 cpu=00:04:35, mem=191.55982 GBs, 
> > > > io=0.01394, vmem=3.653G, maxvmem=3.653G
> > > > usage   20:                 cpu=00:04:26, mem=191.62928 GBs, 
> > > > io=0.01394, vmem=3.643G, maxvmem=3.643G
> > > > usage   21:                 cpu=00:04:42, mem=197.87398 GBs, 
> > > > io=0.01394, vmem=3.660G, maxvmem=3.660G
> > > > usage   22:                 cpu=00:04:36, mem=193.43107 GBs, 
> > > > io=0.01394, vmem=3.652G, maxvmem=3.652G
> > > > usage   23:                 cpu=00:04:32, mem=193.12103 GBs, 
> > > > io=0.01394, vmem=3.652G, maxvmem=3.652G
> > > > usage   24:                 cpu=00:04:25, mem=186.56485 GBs, 
> > > > io=0.01400, vmem=3.644G, maxvmem=3.644G
> > > > usage   25:                 cpu=00:04:51, mem=201.81706 GBs, 
> > > > io=0.01400, vmem=3.669G, maxvmem=3.669G
> > > > scheduling info:            queue instance "main.q@camilla" dropped 
> > > > because it is full
> > > >                             queue instance "main.q@node0" dropped 
> > > > because it is full
> > > >                             All queues dropped because of overload or 
> > > > full
> > > >                             not all array task may be started due to 
> > > > 'max_aj_instances'
> > >
> > > The machine is just full.
> > >
> > > -- Reuti
> > >
> > >
> > > > You guys know how this can be solved?
> > > >
> > > >
> > > >
> > > > 2012/11/13 Reuti <[email protected]>
> > > > Am 13.11.2012 um 13:42 schrieb jan roels:
> > > >
> > > > > Hi,
> > > > >
> > > > > I followed the following tutorial:
> > > > >
> > > > > http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html
> > > > >  on how to install the SGE. It all went fine on my masternode but on 
> > > > > my exec node i have some troubles.
> > > > >
> > > > > First it gave the following error:
> > > > >
> > > > > 11/13/2012 13:44:43|  main|node0|E|communication error for 
> > > > > "node0/execd/1" running on port 6445: "can't bind socket"
> > > >
> > > > Is there already something running on this port - any older version of 
> > > > the execd?
> > > >
> > > >
> > > > > 11/13/2012 13:44:44|  main|node0|E|commlib error: can't bind socket 
> > > > > (no additional information available)
> > > > > 11/13/2012 13:45:12|  main|node0|C|abort qmaster registration due to 
> > > > > communication errors
> > > > > 11/13/2012 13:45:14|  main|node0|W|daemonize error: child exited 
> > > > > before sending daemonize state
> > > > >
> > > > > but then i killed the proces and restarted the gridengine-execd but 
> > > > > then i get the following:
> > > > >
> > > > > /etc/init.d/gridengine-exec restart
> > > > > * Restarting Sun Grid Engine Execution Daemon sge_execd               
> > > > >                                      error: can't resolve host name
> > > > > error: can't get configuration from qmaster -- backgrounding
> > > > >
> > > > > What can i do to fix this?
> > > >
> > > > Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
> > > >
> > > > -- Reuti
> > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > [email protected]
> > > > > https://gridengine.org/mailman/listinfo/users
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] help

Reply via email to