Am 31.10.2015 um 19:36 schrieb Leior Varon:

> Hi Yuri,
> I think it's best that you be the contact with the forum but I think it would 
> be good to answer something similar to the below:
> 
> Hi Reuti,
> 
> Thank you for the reply.
> We also noticed that requesting "-q bar@foo" works around this behavior and 
> it looks like we will need to use this method.
> 
> There are many cases where it is hard to implement this work around (for 
> instance if a program default flag requests -q bar and the user wants to 
> specify the host foo it requires removing the default flag...)

It should work to mangle this in a JSR to request "-q bar@foo", i.e. read the 
already set "-q bar", change this by adding the requested host and remove "-l 
h=foo".

Even without a default for a queue it could be submitted to "-q *@foo".

-- Reuti


> 
> We are still looking for additional fixes/WAs. And we are also interested in 
> any information regarding similar strange behavior .
> 
> Thank you,
> 
> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de] 
> Sent: Friday, October 30, 2015 12:17 PM
> To: Yuri Burmachenko <yur...@mellanox.com>
> Cc: users@gridengine.org; EdaIt <ed...@mellanox.com>; Leior Varon 
> <lei...@mellanox.com>
> Subject: Re: [gridengine users] SoGE 8.1.8 - Qsub issue when using request 
> able variable and parallel environment - need your help.
> 
> 
>> Am 30.10.2015 um 10:40 schrieb Yuri Burmachenko <yur...@mellanox.com>:
>> 
>> Hallo to distinguished forum members,
>> 
>> Recently we have a need to submit jobs in way that qsub request both 
>> requestable variable hostname and parallel environment.
>> 
>> For example if we submit ‘xterm’ job:
>> ·         $SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -l hostname=host_in_grid 
>> -pe somePe 1 xterm
>> 
>> This kind of request results in a strange behavior of the scheduler – this 
>> requests results to one of the below states of the submission:
>> 
>> 1.      xterm job opened as expected.
>> 2.      There is a very long delay and then xterm opened.
>> 3.      Job enters ‘qw’ state with similar to below error:
>> cannot run because it exceeds limit "/////" in rule "some_rule/1"            
>>                                    
>> cannot run in PE "somePe" because it only offers 0 slots
>> 
>> In all of the above states the “host_in_grid” has enough free slots and the 
>> quota rule “some_rule” is not related in any way to the consumable/request 
>> able variable in the job submission request.
>> If we try to remove “some_rule” quota from the SGE quotas, then this error 
>> picks up another rule and again states that its limit was exceeded.
>> NOTE: somePe parallel environment has enough free slots – it is initially 
>> defined with 999 slots.
>> 
>> Basically these “cannot run” messages do not reflect the real reason why the 
>> job can’t be run, since all conditions are actually met – this is very 
>> confusing, why this happen?
>> 
>> We also found a workaround without the requestable variable “hostname” like 
>> below when it ALWAYS work:
>> $SGE_ROOT/bin/lx-amd64/qsub -V -cwd -b y -q host_in_grid -pe testpe 1 xterm
>> 
>> Any ideas why does this strange behavior occur? Is this some kind of a bug? 
>> How this can be resolved?
> 
> Unfortunately I have no idea, but I observed already in former versions that 
> instead of:
> 
> -l h=foo -q bar
> 
> it's better to request:
> 
> -q bar@foo
> 
> Maybe it is similar to the issue you faced.
> 
> -- Reuti
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to