[ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130893#comment-16130893
 ] 

Billie Rinaldi commented on SLIDER-1236:
----------------------------------------

It might be easiest to make it configurable through the agent.ini file (though 
configuring that is relatively inconvenient). The number of apps that are 
trying to reconnect to their AM is less important than the number of 
containers. Is it still okay when one AM fails in an app that has a large 
number of containers? Will the AM easily support 10x as many heartbeats when 
there are a large number of containers? What happens if the AM hasn't failed, 
and it just wasn't responding for 3 seconds? (I would imagine the AM is robust 
to unnecessary re-registration, but am not familiar with that part of the code.)

It also appears that there are random sleeps between 0 and 30 seconds if an 
Exception is thrown from a registration or heartbeat (presumably the sleep is 
randomized so that subsequent attempts are staggered when there are a lot of 
agents). I don't know what could cause the Exception, but perhaps the 30 should 
be lowered a bit as well?

> Unnecessary 10 second sleep before installation
> -----------------------------------------------
>
>                 Key: SLIDER-1236
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1236
>             Project: Slider
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_000002.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to