[jira] [Commented] (SLIDER-1236) Unnecessary 10 second sleep before installation

Gour Saha (JIRA) Thu, 17 Aug 2017 15:32:40 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131415#comment-16131415
 ]


Gour Saha commented on SLIDER-1236:
-----------------------------------

Right, the total no of containers is the real key here and technically not the 
no of apps. I used apps as a count, keeping a standard of set of containers for 
each app in mind. We can explicitly refer to no of containers, although if 
there are N containers across 1 app or 10 apps, chances of containers 
continuing to hit ZK is higher in the 10 apps scenario, since it would take 10 
AMs (instead of 1) to recover, to stop all N containers to stop hitting ZK.

While I was experimenting for an ideal value for HEARTBEAT_IDDLE_INTERVAL_SEC, 
at one point I had brought it down to 10 ms (=0.01) and the app came up fine. 
The agents had multiple no of NO_OP heartbeat logs between every 2 heartbeats 
in which it did something meaningful like INSTALL and START. So the AM-Agent 
protocol works fine with the reduced wait. However, I think we need to test 
with an app with 100s of containers (or more) to validate that the AM 
robustness does not degrade.

I think we need more testing on the reduction of heartbeat intervals from 10sec 
-> 1 sec. Also, as you mentioned there are several other exception code paths 
where there is a potential for up to 30 secs wait which are not addressed. In 
that case I think I will create a separate jira and address all the code blocks 
which are affected by HEARTBEAT_IDDLE_INTERVAL_SEC. We need to ensure an app 
recovers fast for all failure scenarios. Let's keep this jira for only the wait 
between registration and the first heartbeat start, as this bug was actually 
filed for.

Let me first back out the change where I reduced the value of 
HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1.

What do you think [~billie.rinaldi]?

> Unnecessary 10 second sleep before installation
> -----------------------------------------------
>
>                 Key: SLIDER-1236
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1236
>             Project: Slider
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_000002.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1236) Unnecessary 10 second sleep before installation

Reply via email to