Re: [gridengine users] Unable to find pe_start file

2012-04-04 Thread Ursula Winkler
Reuti wrote: Ursula Winkler wrote: Well, obviously the SGE doesn't find ANY path - the epilog routine is also not found. Has anybody seen such an behaviour before? Can you submit a simple job with: df -h The job script is executed with the same environment like the prolog,...

Re: [gridengine users] Unable to find pe_start file

2012-04-04 Thread Ursula Winkler
Reuti wrote: Am 03.04.2012 um 16:33 schrieb Ursula Winkler: Well, obviously the SGE doesn't find ANY path - the epilog routine is also not found. Has anybody seen such an behaviour before? Can you submit a simple job with: df -h The job script is executed with the same environment

Re: [gridengine users] Unable to find pe_start file

2012-04-04 Thread Reuti
Am 04.04.2012 um 09:12 schrieb Ursula Winkler: Reuti wrote: Ursula Winkler wrote: Well, obviously the SGE doesn't find ANY path - the epilog routine is also not found. Has anybody seen such an behaviour before? Can you submit a simple job with: df -h The job script is

Re: [gridengine users] Unable to find pe_start file

2012-04-04 Thread Ursula Winkler
Reuti wrote: This is also some kind of personal taste. Some prefer classic spooling as you can check all the information of a job as they are just stored as text files. And it even handles a large number of nodes before it gets performance problems. Maybe Chris can make a statement about it

Re: [gridengine users] Resource quota and PEs

2012-04-04 Thread Esztermann, Ansgar
On Mar 28, 2012, at 17:31 , Reuti wrote: Hi, Am 27.03.2012 um 15:42 schrieb Esztermann, Ansgar: Hi everyone, while in general, all users are equal in our installation, I would like some nodes to have a longer maximum runtime for some users. In order to avoid oversubscription, we

Re: [gridengine users] Resource quota and PEs

2012-04-04 Thread Reuti
Am 04.04.2012 um 14:28 schrieb Esztermann, Ansgar: On Mar 28, 2012, at 17:31 , Reuti wrote: Hi, Am 27.03.2012 um 15:42 schrieb Esztermann, Ansgar: Hi everyone, while in general, all users are equal in our installation, I would like some nodes to have a longer maximum runtime for

[gridengine users] [Fwd: Re: Unable to find pe_start file]

2012-04-04 Thread Ursula Winkler
---BeginMessage--- Reuti wrote: Yes, it expects exactly one argument: $pe_hostfile (besides any number of options prefixed by a dash). So the complete string specified for start_proc_args is limited this number of characters. To be honest: I have no clue for the cause of this issue, it

Re: [gridengine users] Unable to find pe_start file

2012-04-04 Thread Ursula Winkler
Reuti wrote: Yes, it expects exactly one argument: $pe_hostfile (besides any number of options prefixed by a dash). So the complete string specified for start_proc_args is limited this number of characters. To be honest: I have no clue for the cause of this issue, it never happened to

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Reuti
Well, in both cases it is killed of course. You could set loglevel to log_info and search the messages file of the qmaster for entries like: 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 rescheduling because: manual/auto rescheduling 04/04/2012

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Lars van der bijl
Hey Reuti On 4 April 2012 17:14, Reuti re...@staff.uni-marburg.de wrote: Well, in both cases it is killed of course. You could set loglevel to log_info and search the messages file of the qmaster for entries like: 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Reuti
Am 04.04.2012 um 17:42 schrieb Lars van der bijl: Hey Reuti On 4 April 2012 17:14, Reuti re...@staff.uni-marburg.de wrote: Well, in both cases it is killed of course. You could set loglevel to log_info and search the messages file of the qmaster for entries like: 04/04/2012

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Lars van der bijl
in our case the application has no checkpointing capabilities. for us a reschedule is just run from start on a new host. so a checkpoint with a signal 9 should be enough? On 4 April 2012 17:50, Reuti re...@staff.uni-marburg.de wrote: Am 04.04.2012 um 17:42 schrieb Lars van der bijl: Hey

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Reuti
Am 04.04.2012 um 18:09 schrieb Lars van der bijl: in our case the application has no checkpointing capabilities. for us a reschedule is just run from start on a new host. so a checkpoint with a signal 9 should be enough? No, the signal will be send to create a checkpoint in min_cpu_interval

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Tru Huynh
On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote: .. Yes. We have the SGE commlib errors, and the Open MPI routed:binomial errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing that will also fix the MPI issue. could it be related to

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Joshua Baker-LePain
On Wed, 4 Apr 2012 at 6:33pm, Tru Huynh wrote On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote: Yes. We have the SGE commlib errors, and the Open MPI routed:binomial errors. I'm mainly focusing on the SGE problem right now, as I think (hope) that fixing that will also fix

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.
I did not know that you can have shadow master and not using classic spool?? regards On 4/4/2012 4:37 PM, Joshua Baker-LePain wrote: That being said, our SGE directory isn't NFS shared. We use local spool directories and local SGE installations on all the nodes. The only thing that's NFS

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Reuti
Am 04.04.2012 um 23:15 schrieb Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.: I did not know that you can have shadow master and not using classic spool?? regards http://gridengine.org/pipermail/users/2011-March/000508.html -- Reuti On 4/4/2012 4:37 PM, Joshua Baker-LePain wrote: That being said,

Re: [gridengine users] Parallel jobs failure after OS upgrade

2012-04-04 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.
thx On 4/4/2012 5:23 PM, Reuti wrote: Am 04.04.2012 um 23:15 schrieb Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.: I did not know that you can have shadow master and not using classic spool?? regards http://gridengine.org/pipermail/users/2011-March/000508.html -- Reuti On 4/4/2012 4:37 PM, Joshua

[gridengine users] Shadow master NFSv4 dependency (was: Parallel jobs failure after OS upgrade)

2012-04-04 Thread Rayson Ho
Note that dependency on NFSv4 was removed in Grid Engine 2011.11: http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf You can use any versions of NFS to back the spool directory. Rayson On Wed, Apr 4, 2012 at 5:23 PM, Reuti re...@staff.uni-marburg.de wrote: Am 04.04.2012