Hi, qstat -j <jobid> didn't show the full error message, this one is the full error message:
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with exit status = 27 11/22/2012 12:26:11| main|camilla|E|can't open usage file "active_jobs/76.226/usage" for job 76.226: No such file or directory 11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file or directory 2012/11/22 jan roels <[email protected]> > Hi, > > Do you guys now what this error could be: > > error reason 2: 11/22/2012 11:12:25 [0:31220]: > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool > error reason 3: 11/22/2012 11:12:25 [0:31221]: > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool > > this goes on as long as iets running... and my state went to: > > 69 0.50000 SA root Eqw 11/22/2012 09:12:05 1 > 1-500:1 > 69 0.00000 SA root qw 11/22/2012 09:12:05 1 > 501-4200:1 > > This is the script i was running: > > #!/bin/bash > #$-cwd > #$-N SA > #$-t 1-4200:1 > > /var/software/packages/Mathematica/7.0/Executables/math -run > "teller=$SGE_TASK_ID;<< ModelCaCO31.m" > > Hope somebody can fix the problem. > > Kind Regards > > > 2012/11/14 Reuti <[email protected]> > >> Am 14.11.2012 um 10:08 schrieb jan roels: >> >> > I got it working again, there was already a proces of execd running >> that needed to be killed and then restart the services. >> > >> > I'm trying to run a script now: >> > >> > >> > #!/bin/bash >> > #$-cwd >> > #$-N SA >> > #$-S /bin/sh >> > #$-t 1-4200: >> >> Don't run scripts at root. If something goes wring it might trash your >> machine(s). >> >> >> > /var/software/packages/Mathematica/7.0/Executables/math -run >> "teller=$SGE_TASK_ID;<< ModelCaCO31.m" >> > >> > but it gives the following output: >> > >> > stdin: is not a tty >> >> It's just a warning - unless someone complains I would suggest to ignore >> it. >> >> >> > and this is the output of my qstat -f: >> > >> > queuename qtype resv/used/tot. load_avg arch >> states >> > >> --------------------------------------------------------------------------------- >> > [email protected] BIP 0/1/1 0.70 lx26-amd64 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 1 >> > >> --------------------------------------------------------------------------------- >> > main.q@node0 BIP 0/24/24 27.71 lx26-amd64 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 2 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 3 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 4 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 5 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 6 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 7 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 8 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 9 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 10 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 11 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 12 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 13 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 14 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 15 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 16 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 17 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 18 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 19 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 20 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 21 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 22 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 23 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 24 >> > 35 0.50000 SA root r 11/14/2012 09:57:47 1 >> 25 >> > >> > >> ############################################################################ >> > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING >> JOBS >> > >> ############################################################################ >> > 35 0.50000 SA root qw 11/14/2012 09:57:38 1 >> 26-4200:1 >> > >> > >> > root@camilla:/nfs/share/sge# qstat -explain c -j 35 >> > ============================================================== >> > job_number: 35 >> > exec_file: job_scripts/35 >> > submission_time: Wed Nov 14 09:57:38 2012 >> > owner: root >> > uid: 0 >> > group: root >> > gid: 0 >> > sge_o_home: /root >> > sge_o_log_name: root >> > sge_o_path: >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin >> > sge_o_shell: /bin/bash >> > sge_o_workdir: /nfs/share/sge >> > sge_o_host: camilla >> > account: sge >> > cwd: /nfs/share/sge >> > mail_list: root@camilla >> > notify: FALSE >> > job_name: SA >> > jobshare: 0 >> > shell_list: NONE:/bin/sh >> > env_list: >> > script_file: HistDisCaCO31.sh >> > job-array tasks: 1-4200:1 >> > usage 1: cpu=00:05:20, mem=105.16135 GBs, >> io=0.01537, vmem=1.110G, maxvmem=1.110G >> > usage 2: cpu=00:04:17, mem=179.44371 GBs, >> io=0.01395, vmem=3.643G, maxvmem=3.643G >> > usage 3: cpu=00:04:37, mem=191.69532 GBs, >> io=0.01394, vmem=3.657G, maxvmem=3.657G >> > usage 4: cpu=00:04:34, mem=188.12645 GBs, >> io=0.01394, vmem=3.655G, maxvmem=3.655G >> > usage 5: cpu=00:04:16, mem=180.18292 GBs, >> io=0.01394, vmem=3.636G, maxvmem=3.636G >> > usage 6: cpu=00:04:22, mem=183.47616 GBs, >> io=0.01394, vmem=3.644G, maxvmem=3.644G >> > usage 7: cpu=00:04:15, mem=179.89624 GBs, >> io=0.01400, vmem=3.640G, maxvmem=3.640G >> > usage 8: cpu=00:04:55, mem=207.28643 GBs, >> io=0.01394, vmem=3.669G, maxvmem=3.669G >> > usage 9: cpu=00:04:27, mem=184.86707 GBs, >> io=0.01394, vmem=3.653G, maxvmem=3.653G >> > usage 10: cpu=00:04:14, mem=179.09446 GBs, >> io=0.01394, vmem=3.635G, maxvmem=3.635G >> > usage 11: cpu=00:04:47, mem=195.80372 GBs, >> io=0.01400, vmem=3.668G, maxvmem=3.668G >> > usage 12: cpu=00:04:49, mem=203.43895 GBs, >> io=0.01394, vmem=3.665G, maxvmem=3.665G >> > usage 13: cpu=00:04:45, mem=196.67175 GBs, >> io=0.01394, vmem=3.663G, maxvmem=3.663G >> > usage 14: cpu=00:04:24, mem=185.68047 GBs, >> io=0.01400, vmem=3.648G, maxvmem=3.648G >> > usage 15: cpu=00:04:40, mem=195.96253 GBs, >> io=0.01394, vmem=3.656G, maxvmem=3.656G >> > usage 16: cpu=00:04:11, mem=179.84016 GBs, >> io=0.01394, vmem=3.633G, maxvmem=3.633G >> > usage 17: cpu=00:04:43, mem=196.21689 GBs, >> io=0.01394, vmem=3.662G, maxvmem=3.662G >> > usage 18: cpu=00:04:37, mem=197.39875 GBs, >> io=0.01394, vmem=3.653G, maxvmem=3.653G >> > usage 19: cpu=00:04:35, mem=191.55982 GBs, >> io=0.01394, vmem=3.653G, maxvmem=3.653G >> > usage 20: cpu=00:04:26, mem=191.62928 GBs, >> io=0.01394, vmem=3.643G, maxvmem=3.643G >> > usage 21: cpu=00:04:42, mem=197.87398 GBs, >> io=0.01394, vmem=3.660G, maxvmem=3.660G >> > usage 22: cpu=00:04:36, mem=193.43107 GBs, >> io=0.01394, vmem=3.652G, maxvmem=3.652G >> > usage 23: cpu=00:04:32, mem=193.12103 GBs, >> io=0.01394, vmem=3.652G, maxvmem=3.652G >> > usage 24: cpu=00:04:25, mem=186.56485 GBs, >> io=0.01400, vmem=3.644G, maxvmem=3.644G >> > usage 25: cpu=00:04:51, mem=201.81706 GBs, >> io=0.01400, vmem=3.669G, maxvmem=3.669G >> > scheduling info: queue instance "main.q@camilla" dropped >> because it is full >> > queue instance "main.q@node0" dropped >> because it is full >> > All queues dropped because of overload or >> full >> > not all array task may be started due to >> 'max_aj_instances' >> >> The machine is just full. >> >> -- Reuti >> >> >> > You guys know how this can be solved? >> > >> > >> > >> > 2012/11/13 Reuti <[email protected]> >> > Am 13.11.2012 um 13:42 schrieb jan roels: >> > >> > > Hi, >> > > >> > > I followed the following tutorial: >> > > >> > > >> http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon >> how to install the SGE. It all went fine on my masternode but on my exec >> node i have some troubles. >> > > >> > > First it gave the following error: >> > > >> > > 11/13/2012 13:44:43| main|node0|E|communication error for >> "node0/execd/1" running on port 6445: "can't bind socket" >> > >> > Is there already something running on this port - any older version of >> the execd? >> > >> > >> > > 11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket >> (no additional information available) >> > > 11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to >> communication errors >> > > 11/13/2012 13:45:14| main|node0|W|daemonize error: child exited >> before sending daemonize state >> > > >> > > but then i killed the proces and restarted the gridengine-execd but >> then i get the following: >> > > >> > > /etc/init.d/gridengine-exec restart >> > > * Restarting Sun Grid Engine Execution Daemon sge_execd >> error: can't resolve host name >> > > error: can't get configuration from qmaster -- backgrounding >> > > >> > > What can i do to fix this? >> > >> > Any firewall on the machines? Ports 6444 and 6445 need to be excluded. >> > >> > -- Reuti >> > >> > > _______________________________________________ >> > > users mailing list >> > > [email protected] >> > > https://gridengine.org/mailman/listinfo/users >> > >> > >> >> >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
