[ https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133172#comment-16133172 ]
Billie Rinaldi commented on SLIDER-1236: ---------------------------------------- I think it would be okay to leave in the change of HEARTBEAT_IDDLE_INTERVAL_SEC from 10 to 1 here, but I agree we should open an additional ticket to evaluate failure scenarios further. > Unnecessary 10 second sleep before installation > ----------------------------------------------- > > Key: SLIDER-1236 > URL: https://issues.apache.org/jira/browse/SLIDER-1236 > Project: Slider > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Gour Saha > > Noticed when starting LLAP on a 2-node cluster. Slider AM logs: > {noformat} > 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Registration response: > RegistrationResponse{response=OK, responseId=0, statusCommands=null} > ... > 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Registration response: > RegistrationResponse{response=OK, responseId=0, statusCommands=null} > {noformat} > Then nothing useful goes on for a while, until: > {noformat} > 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO > agent.AgentProviderService - Installing LLAP on > container_1495490227300_0002_01_000002. > {noformat} > If you look at the corresponding logs from both agents, you can see that they > both have a gap that's pretty much exactly 10sec. > After the gap, they talk back to AM; after ~30ms for each container > (corresponding to the end of its gap), presumably after hearing from it, the > AM starts installing LLAP. > {noformat} > INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server > with {u'exitstatus': 0, > INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK > INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: > {'componentStatus': [], 'reports': []} > INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with > response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. > Components mapped: False > INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server > with {u'exitstatus': 0, > INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK > INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: > {'componentStatus': [], 'reports': []} > INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with > response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. > Components mapped: False > {noformat} > I've observed the same on multiple different clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029)