[ https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131415#comment-16131415 ]
Gour Saha commented on SLIDER-1236: ----------------------------------- Right, the total no of containers is the real key here and technically not the no of apps. I used apps as a count, keeping a standard of set of containers for each app in mind. We can explicitly refer to no of containers, although if there are N containers across 1 app or 10 apps, chances of containers continuing to hit ZK is higher in the 10 apps scenario, since it would take 10 AMs (instead of 1) to recover, to stop all N containers to stop hitting ZK. While I was experimenting for an ideal value for HEARTBEAT_IDDLE_INTERVAL_SEC, at one point I had brought it down to 10 ms (=0.01) and the app came up fine. The agents had multiple no of NO_OP heartbeat logs between every 2 heartbeats in which it did something meaningful like INSTALL and START. So the AM-Agent protocol works fine with the reduced wait. However, I think we need to test with an app with 100s of containers (or more) to validate that the AM robustness does not degrade. I think we need more testing on the reduction of heartbeat intervals from 10sec -> 1 sec. Also, as you mentioned there are several other exception code paths where there is a potential for up to 30 secs wait which are not addressed. In that case I think I will create a separate jira and address all the code blocks which are affected by HEARTBEAT_IDDLE_INTERVAL_SEC. We need to ensure an app recovers fast for all failure scenarios. Let's keep this jira for only the wait between registration and the first heartbeat start, as this bug was actually filed for. Let me first back out the change where I reduced the value of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1. What do you think [~billie.rinaldi]? > Unnecessary 10 second sleep before installation > ----------------------------------------------- > > Key: SLIDER-1236 > URL: https://issues.apache.org/jira/browse/SLIDER-1236 > Project: Slider > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Gour Saha > > Noticed when starting LLAP on a 2-node cluster. Slider AM logs: > {noformat} > 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Registration response: > RegistrationResponse{response=OK, responseId=0, statusCommands=null} > ... > 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Registration response: > RegistrationResponse{response=OK, responseId=0, statusCommands=null} > {noformat} > Then nothing useful goes on for a while, until: > {noformat} > 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Installing LLAP on > container_1495490227300_0002_01_000002. > {noformat} > If you look at the corresponding logs from both agents, you can see that they > both have a gap that's pretty much exactly 10sec. > After the gap, they talk back to AM; after ~30ms for each container > (corresponding to the end of its gap), presumably after hearing from it, the > AM starts installing LLAP. > {noformat} > INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server > with {u'exitstatus': 0, > INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK > INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: > {'componentStatus': [], 'reports': []} > INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with > response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. > Components mapped: False > INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server > with {u'exitstatus': 0, > INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK > INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: > {'componentStatus': [], 'reports': []} > INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with > response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. > Components mapped: False > {noformat} > I've observed the same on multiple different clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029)