Re: [gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread William Deegan
Reuti, On Apr 1, 2011, at 5:07 PM, Reuti wrote: > Am 01.04.2011 um 23:54 schrieb William Deegan: > >> Greetings, >> >> Here's the line in question >> >> 04/01/2011 14:49:13| main|hosta|W|Core binding: Couldn't determine core >> binding string for config file! > > Ignore it. I have it also in

Re: [gridengine users] shared nothing install did I do it right?

2011-04-01 Thread William Deegan
Reuti, On Apr 1, 2011, at 5:12 PM, Reuti wrote: > Am 02.04.2011 um 01:41 schrieb William Deegan: > >> Greetings, >> >> Here's what I did. >> 1) unpack ge tarballs into /opt/ge on all hosts >> 2) configure grid master >> 3) scp /opt/ge/default to all hosts >> 4) verify ssh works back and forth

Re: [gridengine users] shared nothing install did I do it right?

2011-04-01 Thread Reuti
Am 02.04.2011 um 01:41 schrieb William Deegan: > Greetings, > > Here's what I did. > 1) unpack ge tarballs into /opt/ge on all hosts > 2) configure grid master > 3) scp /opt/ge/default to all hosts > 4) verify ssh works back and forth among all hosts as root Do you need X11 forwarding? > 5)

Re: [gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread Reuti
Am 01.04.2011 um 23:54 schrieb William Deegan: > Greetings, > > Here's the line in question > > 04/01/2011 14:49:13| main|hosta|W|Core binding: Couldn't determine core > binding string for config file! Ignore it. I have it also in 6.2u5 in case no core binding was requested, but binding was

[gridengine users] shared nothing install did I do it right?

2011-04-01 Thread William Deegan
Greetings, Here's what I did. 1) unpack ge tarballs into /opt/ge on all hosts 2) configure grid master 3) scp /opt/ge/default to all hosts 4) verify ssh works back and forth among all hosts as root 5) run ./start_gui_installer -debug 6) Install all execution hosts This is shared nothing, so the

[gridengine users] qlogin fails when requestion centos5.5 hosts (except 1)

2011-04-01 Thread William Deegan
Greetings, New gridengine install with binaries from here: here's the output: [deegan@hotan2 ~]$ qlogin -l hostname=host12 Your job 1191 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 1191 has been successfully scheduled. Establishing builtin s

Re: [gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread Stephen Dennis
Nope. I am using a binary built at Univa, but its about the same vintage. qrsh working kind of ok. haven't tried qsh or qlogin yet. # Stephen Dennis : Senior Sales Engineer # Univa Corporation: univa.com/products/grid

Re: [gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread William Deegan
Stephen, Have you been using the current master build from here: http://bioteam.net/dag/gridengine-courtesy-binaries/? Anyone else? qsh I can get working, but qrsh and qlogin fail. (I'll send another email in a minute with details). -Bill On Apr 1, 2011, at 3:23 PM, Stephen Dennis wrote: > Me

Re: [gridengine users] One exec host hangs at /etc/init.d/sgeexecd.BLAH start

2011-04-01 Thread William Deegan
Resolved this. One host had bad entry for its own hostname in /etc/hosts. -Bill On Apr 1, 2011, at 11:41 AM, William Deegan wrote: > Greetings, > > I noticed one node (using Chris's ge-8.0.0.alph binaries) wouldn't take: > qsh -l hostname=this_host > > So I stopped the execd via: > /etc/init.d/

Re: [gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread Stephen Dennis
Me too. It looks like that string is not in the current master though. Possibly something has been fixed already. Maybe this from today: - commit 8c74d05904d05e214768a0686c8bc2259d8c4e31 Author: Daniel Gruber Date: Fri Apr 1 15:25:07 2011 +0200 Remvoed unwan

[gridengine users] Getting many "Core binding: Couldn't determine core binding string for config file!" in exec hosts log files

2011-04-01 Thread William Deegan
Greetings, Here's the line in question 04/01/2011 14:49:13| main|hosta|W|Core binding: Couldn't determine core binding string for config file! Any idea how to resolve this? Centos 5.5 Thanks, Bill ___ users mailing list users@gridengine.org https://

[gridengine users] One exec host hangs at /etc/init.d/sgeexecd.BLAH start

2011-04-01 Thread William Deegan
Greetings, I noticed one node (using Chris's ge-8.0.0.alph binaries) wouldn't take: qsh -l hostname=this_host So I stopped the execd via: /etc/init.d/sgeexecd.BLAH stop Then tried /etc/init.d/sgeexecd.BLAH start And it just sits there. First time I did this I saw: 04/01/2011 10:26:05| main|n

Re: [gridengine users] Does an "exclusive" resource request use the actual usage by accident while submitting?

2011-04-01 Thread Mark Dixon
On Thu, 31 Mar 2011, Reuti wrote: ... Does your problem feel like it's related? Or is this a new issue? In my case only one of these exclusive jobs was running in the cluster (some other jobs running), around 18 exclusive jobs waiting (submitted until the 1st started), and the 19th got "no su

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Am 01.04.2011 um 16:57 schrieb lars van der bijl: > core file size (blocks, -c) 0 > > file locks (-x) unlimited Fine. > I think it might be the machine killing them. because where not putting any > other limits anywhere. unless it's the application where running.

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 193056 max locked memory (kbytes, -l) 256 max memory size (kbytes, -m) unlim

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Add on: you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there. -- Reuti Am 01.04.2011 um 16:39 schrieb lars van der bijl: > the problem is that i don't have any such limit's enforced currently on > submission. the submission to qsub a

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Am 01.04.2011 um 16:39 schrieb lars van der bijl: > the problem is that i don't have any such limit's enforced currently on > submission. the submission to qsub are hidden from the user so i know there > not adding them.. the only thing we have is a load/suspended theshold in the > grid it self

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Am 01.04.2011 um 12:54 schrieb lars van der bijl: > in this case yes. > > however on the jobs running on our farm we put no memory limits as of yet. > just request amount of procs > > is the it usual behaviour that if it fails with this code that the subsequent > dependencies start regardless?

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
also is there anyway of catching this and raising 100? ones the job is finished and it's dependencies start it's causing major havok on our system looking for file that aren’t there. are there other things the grid uses the SIGKILL for? not just memory limits? Lars On 1 April 2011 11:54, lars va

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
in this case yes. however on the jobs running on our farm we put no memory limits as of yet. just request amount of procs is the it usual behaviour that if it fails with this code that the subsequent dependencies start regardless? Lars On 1 April 2011 11:41, Reuti wrote: > Hi, > > Am 01.04.2

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread Reuti
Hi, Am 01.04.2011 um 12:33 schrieb lars van der bijl: > Hey everyone. > > Where having some issues with job's being killed with exit status 137. 137 = 128 + 9 $ kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIG

[gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
Hey everyone. Where having some issues with job's being killed with exit status 137. This causes the task to finish and start it dependent task which is causing all kind of havoc. submitting a job with a very small max memory limit gives me this this as a example. $ qacct -j 21141 ==

Re: [gridengine users] 6.2u5p1 arch.dist script...

2011-04-01 Thread James Abbott
Ah - that explains it, thanks. I just saw the big comment at the top of the arch script explaining how changes must be propagated to arch.dist, and assumed someone forgot! I guess this could bite a few folk though since it detected 2.6 at compile time but 2.4 at runtime (this was on CentOS 5.5/x86