Am 22.11.2012 um 14:42 schrieb jan roels: > I work on an nfs share that is also available on the node. I'm currently > testing with only one node so it's unique...
Just be aware, that in this case the job script will be send by SGE to the execd on the node which stores it in turn on the NFS server (which might be the same machine as the master). I'm not sure about the error message: is it mounted with "noexec" and/or "allsquash"/"rootsquash"? But the error should be "permission denied" in these cases. -- Reuti > > 2012/11/22 Reuti <[email protected]> > Am 22.11.2012 um 14:29 schrieb jan roels: > > > I tried it with the root-account and with another account... both the same > > error > > Is the directory local on "camilla" and the nodename is unqiue? > > -- Reuti > > > > > > 2012/11/22 Reuti <[email protected]> > > Am 22.11.2012 um 12:30 schrieb jan roels: > > > > > Hi, > > > > > > qstat -j <jobid> didn't show the full error message, this one is the full > > > error message: > > > > > > 11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with > > > exit status = 27 > > > 11/22/2012 12:26:11| main|camilla|E|can't open usage file > > > "active_jobs/76.226/usage" for job 76.226: No such file or directory > > > 11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: > > > execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, > > > "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such > > > file or directory > > > > Could be a permission problem. Everyone needs read-access to this directory > > as the jobscript is executed from there. > > > > -- Reuti > > > > > > > > > > > > > 2012/11/22 jan roels <[email protected]> > > > Hi, > > > > > > Do you guys now what this error could be: > > > > > > error reason 2: 11/22/2012 11:12:25 [0:31220]: > > > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool > > > error reason 3: 11/22/2012 11:12:25 [0:31221]: > > > execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool > > > > > > this goes on as long as iets running... and my state went to: > > > > > > 69 0.50000 SA root Eqw 11/22/2012 09:12:05 1 > > > 1-500:1 > > > 69 0.00000 SA root qw 11/22/2012 09:12:05 1 > > > 501-4200:1 > > > > > > This is the script i was running: > > > > > > #!/bin/bash > > > #$-cwd > > > #$-N SA > > > #$-t 1-4200:1 > > > > > > /var/software/packages/Mathematica/7.0/Executables/math -run > > > "teller=$SGE_TASK_ID;<< ModelCaCO31.m" > > > > > > Hope somebody can fix the problem. > > > > > > Kind Regards > > > > > > > > > 2012/11/14 Reuti <[email protected]> > > > Am 14.11.2012 um 10:08 schrieb jan roels: > > > > > > > I got it working again, there was already a proces of execd running > > > > that needed to be killed and then restart the services. > > > > > > > > I'm trying to run a script now: > > > > > > > > > > > > #!/bin/bash > > > > #$-cwd > > > > #$-N SA > > > > #$-S /bin/sh > > > > #$-t 1-4200: > > > > > > Don't run scripts at root. If something goes wring it might trash your > > > machine(s). > > > > > > > > > > /var/software/packages/Mathematica/7.0/Executables/math -run > > > > "teller=$SGE_TASK_ID;<< ModelCaCO31.m" > > > > > > > > but it gives the following output: > > > > > > > > stdin: is not a tty > > > > > > It's just a warning - unless someone complains I would suggest to ignore > > > it. > > > > > > > > > > and this is the output of my qstat -f: > > > > > > > > queuename qtype resv/used/tot. load_avg arch > > > > states > > > > --------------------------------------------------------------------------------- > > > > [email protected] BIP 0/1/1 0.70 lx26-amd64 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 1 > > > > --------------------------------------------------------------------------------- > > > > main.q@node0 BIP 0/24/24 27.71 lx26-amd64 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 2 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 3 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 4 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 5 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 6 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 7 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 8 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 9 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 10 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 11 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 12 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 13 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 14 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 15 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 16 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 17 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 18 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 19 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 20 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 21 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 22 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 23 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 24 > > > > 35 0.50000 SA root r 11/14/2012 09:57:47 1 > > > > 25 > > > > > > > > ############################################################################ > > > > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING > > > > JOBS > > > > ############################################################################ > > > > 35 0.50000 SA root qw 11/14/2012 09:57:38 1 > > > > 26-4200:1 > > > > > > > > > > > > root@camilla:/nfs/share/sge# qstat -explain c -j 35 > > > > ============================================================== > > > > job_number: 35 > > > > exec_file: job_scripts/35 > > > > submission_time: Wed Nov 14 09:57:38 2012 > > > > owner: root > > > > uid: 0 > > > > group: root > > > > gid: 0 > > > > sge_o_home: /root > > > > sge_o_log_name: root > > > > sge_o_path: > > > > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin > > > > sge_o_shell: /bin/bash > > > > sge_o_workdir: /nfs/share/sge > > > > sge_o_host: camilla > > > > account: sge > > > > cwd: /nfs/share/sge > > > > mail_list: root@camilla > > > > notify: FALSE > > > > job_name: SA > > > > jobshare: 0 > > > > shell_list: NONE:/bin/sh > > > > env_list: > > > > script_file: HistDisCaCO31.sh > > > > job-array tasks: 1-4200:1 > > > > usage 1: cpu=00:05:20, mem=105.16135 GBs, > > > > io=0.01537, vmem=1.110G, maxvmem=1.110G > > > > usage 2: cpu=00:04:17, mem=179.44371 GBs, > > > > io=0.01395, vmem=3.643G, maxvmem=3.643G > > > > usage 3: cpu=00:04:37, mem=191.69532 GBs, > > > > io=0.01394, vmem=3.657G, maxvmem=3.657G > > > > usage 4: cpu=00:04:34, mem=188.12645 GBs, > > > > io=0.01394, vmem=3.655G, maxvmem=3.655G > > > > usage 5: cpu=00:04:16, mem=180.18292 GBs, > > > > io=0.01394, vmem=3.636G, maxvmem=3.636G > > > > usage 6: cpu=00:04:22, mem=183.47616 GBs, > > > > io=0.01394, vmem=3.644G, maxvmem=3.644G > > > > usage 7: cpu=00:04:15, mem=179.89624 GBs, > > > > io=0.01400, vmem=3.640G, maxvmem=3.640G > > > > usage 8: cpu=00:04:55, mem=207.28643 GBs, > > > > io=0.01394, vmem=3.669G, maxvmem=3.669G > > > > usage 9: cpu=00:04:27, mem=184.86707 GBs, > > > > io=0.01394, vmem=3.653G, maxvmem=3.653G > > > > usage 10: cpu=00:04:14, mem=179.09446 GBs, > > > > io=0.01394, vmem=3.635G, maxvmem=3.635G > > > > usage 11: cpu=00:04:47, mem=195.80372 GBs, > > > > io=0.01400, vmem=3.668G, maxvmem=3.668G > > > > usage 12: cpu=00:04:49, mem=203.43895 GBs, > > > > io=0.01394, vmem=3.665G, maxvmem=3.665G > > > > usage 13: cpu=00:04:45, mem=196.67175 GBs, > > > > io=0.01394, vmem=3.663G, maxvmem=3.663G > > > > usage 14: cpu=00:04:24, mem=185.68047 GBs, > > > > io=0.01400, vmem=3.648G, maxvmem=3.648G > > > > usage 15: cpu=00:04:40, mem=195.96253 GBs, > > > > io=0.01394, vmem=3.656G, maxvmem=3.656G > > > > usage 16: cpu=00:04:11, mem=179.84016 GBs, > > > > io=0.01394, vmem=3.633G, maxvmem=3.633G > > > > usage 17: cpu=00:04:43, mem=196.21689 GBs, > > > > io=0.01394, vmem=3.662G, maxvmem=3.662G > > > > usage 18: cpu=00:04:37, mem=197.39875 GBs, > > > > io=0.01394, vmem=3.653G, maxvmem=3.653G > > > > usage 19: cpu=00:04:35, mem=191.55982 GBs, > > > > io=0.01394, vmem=3.653G, maxvmem=3.653G > > > > usage 20: cpu=00:04:26, mem=191.62928 GBs, > > > > io=0.01394, vmem=3.643G, maxvmem=3.643G > > > > usage 21: cpu=00:04:42, mem=197.87398 GBs, > > > > io=0.01394, vmem=3.660G, maxvmem=3.660G > > > > usage 22: cpu=00:04:36, mem=193.43107 GBs, > > > > io=0.01394, vmem=3.652G, maxvmem=3.652G > > > > usage 23: cpu=00:04:32, mem=193.12103 GBs, > > > > io=0.01394, vmem=3.652G, maxvmem=3.652G > > > > usage 24: cpu=00:04:25, mem=186.56485 GBs, > > > > io=0.01400, vmem=3.644G, maxvmem=3.644G > > > > usage 25: cpu=00:04:51, mem=201.81706 GBs, > > > > io=0.01400, vmem=3.669G, maxvmem=3.669G > > > > scheduling info: queue instance "main.q@camilla" dropped > > > > because it is full > > > > queue instance "main.q@node0" dropped > > > > because it is full > > > > All queues dropped because of overload or > > > > full > > > > not all array task may be started due to > > > > 'max_aj_instances' > > > > > > The machine is just full. > > > > > > -- Reuti > > > > > > > > > > You guys know how this can be solved? > > > > > > > > > > > > > > > > 2012/11/13 Reuti <[email protected]> > > > > Am 13.11.2012 um 13:42 schrieb jan roels: > > > > > > > > > Hi, > > > > > > > > > > I followed the following tutorial: > > > > > > > > > > http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html > > > > > on how to install the SGE. It all went fine on my masternode but on > > > > > my exec node i have some troubles. > > > > > > > > > > First it gave the following error: > > > > > > > > > > 11/13/2012 13:44:43| main|node0|E|communication error for > > > > > "node0/execd/1" running on port 6445: "can't bind socket" > > > > > > > > Is there already something running on this port - any older version of > > > > the execd? > > > > > > > > > > > > > 11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket > > > > > (no additional information available) > > > > > 11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to > > > > > communication errors > > > > > 11/13/2012 13:45:14| main|node0|W|daemonize error: child exited > > > > > before sending daemonize state > > > > > > > > > > but then i killed the proces and restarted the gridengine-execd but > > > > > then i get the following: > > > > > > > > > > /etc/init.d/gridengine-exec restart > > > > > * Restarting Sun Grid Engine Execution Daemon sge_execd > > > > > error: can't resolve host name > > > > > error: can't get configuration from qmaster -- backgrounding > > > > > > > > > > What can i do to fix this? > > > > > > > > Any firewall on the machines? Ports 6444 and 6445 need to be excluded. > > > > > > > > -- Reuti > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > [email protected] > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
