Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-22 Thread William Hay
On Wed, Jun 22, 2016 at 08:39:35AM +, sudha.penme...@wipro.com wrote:
> Hi,
> 
> We have added the below qmaster params in the SGE configuration
> 
> qmaster_params   gdi_timeout=240 gdi_retries=-1 cl_ping=true
> 
> Could you let me know the difference between gdi_timeout and gdi_retries. Why 
> is there gdi_retries parameter? Why can't we use gdi_timeout alone to retry 
> permanently like allowing an option -1 for gdi-timeout. I don't get the 
> specific purpose of having extra parameter gdi_retries.
> 
The difference is in the manual page.  gdi_timeout specifies how long to wait 
between retries, gdi_retries specifies how many times to retry.
The timeout setting prevents you from bombarding a slow server with repeated 
requests while the retries setting ensures that things will progress
even if the odd request gets lost for some reason.  If you used a single magic 
value in gdi_timeout to represent try forever then there would be 
no way to specify how long to wait between retries.

> Because when we have NFS latency issue we receive the error "failed receiving 
> gdi request" but yet the job is submitted which is causing confusion.
> 
It has been my practice to have the file system with the grid-engine config be 
local to the qmaster and exported
to the rest of the cluster via NFS precisely because the speed with which the 
qmaster accesses these filesystems
matters a lot more than it does for other nodes.  This does mean our current 
setup lacks a shadow master but one 
of my colleagues is currently setting up a pair of servers with DRBD so we can 
support failover in the event of 
hardware failure.

William


signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-22 Thread sudha.penmetsa
Hi,

We have added the below qmaster params in the SGE configuration

qmaster_params   gdi_timeout=240 gdi_retries=-1 cl_ping=true

Could you let me know the difference between gdi_timeout and gdi_retries. Why 
is there gdi_retries parameter? Why can't we use gdi_timeout alone to retry 
permanently like allowing an option -1 for gdi-timeout. I don't get the 
specific purpose of having extra parameter gdi_retries.

Because when we have NFS latency issue we receive the error "failed receiving 
gdi request" but yet the job is submitted which is causing confusion.

Regards,
Sudha

-Original Message-
From: William Hay [mailto:w@ucl.ac.uk]
Sent: Wednesday, June 22, 2016 1:31 PM
To: Sudha Padmini Penmetsa (BAS) 
Cc: users@gridengine.org; Jeevan Behara Patnaik (GIS) 
Subject: Re: [gridengine users] Error message- failed receving gdi request when 
calling qsub, but job is started

On Tue, Jun 21, 2016 at 04:12:35PM +, sudha.penme...@wipro.com wrote:
>Hi,
>
>Since this morning, sometimes users are facing an issue in grid while
>submitting qsub jobs.
>
>When submitting the job, it displays error message: "Unable to run job:
>failed receiving gdi request. Exiting"
>
>But the job runs successfully when it is seen later with qstat.
>
That sounds like some sort of connectivity problem to me.  The job is 
successfully submitted but the acknowledgement doesn't make it back to the 
client.  I'd try poking at things with qping and checking the network config on 
the qmaster and submit hosts.

William

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-22 Thread William Hay
On Tue, Jun 21, 2016 at 04:12:35PM +, sudha.penme...@wipro.com wrote:
>Hi,
> 
>Since this morning, sometimes users are facing an issue in grid while
>submitting qsub jobs.
> 
>When submitting the job, it displays error message: "Unable to run job:
>failed receiving gdi request. Exiting"
> 
>But the job runs successfully when it is seen later with qstat.
> 
That sounds like some sort of connectivity problem to me.  The job is 
successfully submitted
but the acknowledgement doesn't make it back to the client.  I'd try poking at 
things with qping 
and checking the network config on the qmaster and submit hosts.

William



signature.asc
Description: Digital signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-21 Thread Reuti
Hi,

> Am 21.06.2016 um 18:12 schrieb  
> :
> 
> Hi,
> 
> Since this morning, sometimes users are facing an issue in grid while 
> submitting qsub jobs.
>  
> When submitting the job, it displays error message: “Unable to run job: 
> failed receiving gdi request. Exiting”

Are all machines running the same version of SGE? Do the jobs later succeed on 
exactly the same nodes they failed before?

-- Reuti


> But the job runs successfully when it is seen later with qstat.
>  
> We tried to find the details in qmaster/messages and also 
> qmaster/schedd/messages, but we could find nothing.
>  
> Could you help in letting us know what could be the reason for this weird 
> behavior?
>  
> This is observed now at least for two users.
>  
> Grid version: N1GE 6.1
>  
> Regards,
> Sudha
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email. 
> www.wipro.com___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users