[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-10-05 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943809#comment-14943809
 ] 

Anubhav Dhoot commented on YARN-4185:
-

I don't think option 2 where you restart from 1 makes sense. Its also not a 
goal to minimize the total wait time. The goal should be to minimize the time 
to recover for short intermittent failure while also waiting long enough for 
long failures before giving up. Would it be better for us to ramp up to 10 sec 
exponentially and then do the n retries for 10 sec or do totally n retries 
including the ramp up.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-10-05 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944384#comment-14944384
 ] 

Neelesh Srinivas Salian commented on YARN-4185:
---

[~adhoot], thanks for the clarification.
So, the initial retries can be done with backoff times of 1,2,4,8 that is still 
less then 10 and thus give the opportunity to retry for a short-lived NM 
restart (under 10 seconds)
We can continue to wait 10 seconds of backoff incrementally to accomodate a 
larger failure time.

Thus, the failure times can be under 1,2,4,8,10,10 and so on till the number of 
retries is exhausted.
My only concern is that if the failure lasts longer than the total wait time 
and the number of retries, there won't be a chance to retry.

I'll write up a patch to exhibit this.
Thank you.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-10-05 Thread Anubhav Dhoot
I don't think option 2 where you restart from 1 makes sense. Its also not a
goal to minimize the total wait time. He goal should minimize the time to
recover for short intermittent failure while also waiting  long enough for
long failures before giving up.
On Oct 3, 2015 6:43 PM, "Neelesh Srinivas Salian (JIRA)" 
wrote:

>
> [
> https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942528#comment-14942528
> ]
>
> Neelesh Srinivas Salian commented on YARN-4185:
> ---
>
> Thoughts:
> 1) Using the exponentialBackoffRetry policy will have a progression of
> wait time starting at 1sec per retry assuming it takes a second for the NM
> to come up.
> Hence exponentially, the backoff time increases 2,4,8,16...till 512 as we
> approach 10 retries.
>
> 2) In the current strategy, the wait time is 10 seconds which causes an NM
> that restarted in 1 second to wait for a retry.
>
> 3) In the event of the retries going forward, at the 3rd retry ( the wait
> time is collectively 7 seconds (1+2+4) as per the exponential strategy) and
> (30 (10+10+10) seconds as the current static retry)
>
> 4) If you keep retrying, collectively the waiting static retry has now
> waited for 60 seconds versus 2^6 = 64 seconds in the exponential strategy
> at the 6th retry attempt.
>
> Logic for the Design:
> 1) In the event of retries being default to 10,
>a. I propose after the 3rd attempt, we continue to keep the wait time
> as 4 seconds and continue the same.
>Thus the total time comes up to 1,2,4,4,4,4,4,4,4,4 = 35 seconds.
>b. Versus collectively spending 100 seconds on waiting time in the
> static retry strategy.
>
> 2) Alternatively, the logic could be:
>a. Have the 1st 3 attempts of retry. If further needed, fall back to
> the 1sec start of the same logic.
>   So, it looks like this.. (1,2,4)  (1,2,4)  (1,2,4) (1) for 10
> retries.
>b. Thus we get the 10 retries done in collectively 22 seconds versus
> 100 seconds.
>
> Requesting feedback.
> Thank you.
>
> > Retry interval delay for NM client can be improved from the fixed static
> retry
> >
> ---
> >
> > Key: YARN-4185
> > URL: https://issues.apache.org/jira/browse/YARN-4185
> > Project: Hadoop YARN
> >  Issue Type: Bug
> >Reporter: Anubhav Dhoot
> >Assignee: Neelesh Srinivas Salian
> >
> > Instead of having a fixed retry interval that starts off very high and
> stays there, we are better off using an exponential backoff that has the
> same fixed max limit. Today the retry interval is fixed at 10 sec that can
> be unnecessarily high especially when NMs could rolling restart within a
> sec.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-10-03 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942528#comment-14942528
 ] 

Neelesh Srinivas Salian commented on YARN-4185:
---

Thoughts:
1) Using the exponentialBackoffRetry policy will have a progression of wait 
time starting at 1sec per retry assuming it takes a second for the NM to come 
up.
Hence exponentially, the backoff time increases 2,4,8,16...till 512 as we 
approach 10 retries.

2) In the current strategy, the wait time is 10 seconds which causes an NM that 
restarted in 1 second to wait for a retry.

3) In the event of the retries going forward, at the 3rd retry ( the wait time 
is collectively 7 seconds (1+2+4) as per the exponential strategy) and (30 
(10+10+10) seconds as the current static retry)

4) If you keep retrying, collectively the waiting static retry has now waited 
for 60 seconds versus 2^6 = 64 seconds in the exponential strategy at the 6th 
retry attempt.

Logic for the Design:
1) In the event of retries being default to 10, 
   a. I propose after the 3rd attempt, we continue to keep the wait time as 4 
seconds and continue the same. 
   Thus the total time comes up to 1,2,4,4,4,4,4,4,4,4 = 35 seconds.
   b. Versus collectively spending 100 seconds on waiting time in the static 
retry strategy.

2) Alternatively, the logic could be:
   a. Have the 1st 3 attempts of retry. If further needed, fall back to the 
1sec start of the same logic.
  So, it looks like this.. (1,2,4)  (1,2,4)  (1,2,4) (1) for 10 retries.
   b. Thus we get the 10 retries done in collectively 22 seconds versus 100 
seconds.

Requesting feedback.
Thank you.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Neelesh Srinivas Salian
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-30 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938190#comment-14938190
 ] 

Anubhav Dhoot commented on YARN-4185:
-

can we try to reuse the existing values for retries 
(yarn.client.nodemanager-connect. ) and see if we can be mostly compatible? I 
am thinking its fine if its not exactly the same behavior

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-29 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935452#comment-14935452
 ] 

Neelesh Srinivas Salian commented on YARN-4185:
---

Writing up a patch for this.
Questions I had:
1) This would be included for the NMProxy and a new RetryPolicy setting
exponentialBackoffRetry(5,1000, TimeUnit.MILLISECONDS
What would be the value of the maxRetries?

I see the value set to 5 for NameNodeProxies. Is there an arbitrarily set value 
or does it need to be taken from the conf?
Thank you.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-28 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934350#comment-14934350
 ] 

Neelesh Srinivas Salian commented on YARN-4185:
---

I would like to work on this JIRA if no one has already begun.
If yes, could you please assign the JIRA to me?

Thank you.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> ---
>
> Key: YARN-4185
> URL: https://issues.apache.org/jira/browse/YARN-4185
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)