[ 
https://issues.apache.org/jira/browse/SLIDER-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130761#comment-16130761
 ] 

Gour Saha commented on SLIDER-1236:
-----------------------------------

[~billie.rinaldi] very good point. I tested this scenario as well. Most Slider 
applications, specifically the power-user ones like LLAP, where they serve 
end-user requests and run queries in the order of ms are complaining that 
Slider is actually slowing them down. It is slowing them during start and it is 
slowing them down during recovery on AM failures (this is where although the 
app containers are running fine but Ambari continues to report that HiveServer2 
Interactive is down). They are also complaining that other competing frameworks 
are starting applications in the order of 1-2 secs for a reasonable no of 
containers.

I think Slider is being over cautious here. I personally also think 30 secs is 
really high. With 30sec delay, 10 apps, all trying to failover to their AM, can 
quickly create the 3sec scenario. Having said that, I agree it might be an 
issue in problematic clusters where thousands of AMs fail frequently and 
simultaneously. I think it might be a good idea to expose a cluster-level 
configuration for Slider for this property, where admins can set a value which 
reflects the typical no of applications that run in the cluster. We can also 
document, suggested values for various ranges of running apps. What do you 
think [~billie.rinaldi]?

> Unnecessary 10 second sleep before installation
> -----------------------------------------------
>
>                 Key: SLIDER-1236
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1236
>             Project: Slider
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Gour Saha
>
> Noticed when starting LLAP on a 2-node cluster. Slider AM logs:
> {noformat}
> 2017-05-22 22:04:33,047 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> ...
> 2017-05-22 22:04:34,946 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Registration response: 
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> {noformat}
> Then nothing useful goes on for a while, until:
> {noformat}
> 2017-05-22 22:04:43,099 [956937652@qtp-624693846-4] INFO  
> agent.AgentProviderService - Installing LLAP on 
> container_1495490227300_0002_01_000002.
> {noformat}
> If you look at the corresponding logs from both agents, you can see that they 
> both have a gap that's pretty much exactly 10sec.
> After the gap, they talk back to AM; after ~30ms for each container 
> (corresponding to the end of its gap), presumably after hearing from it, the 
> AM starts installing LLAP.
> {noformat}
> INFO 2017-05-22 22:04:33,055 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:33,055 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:43,065 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490683064. Command(s) in progress: False. 
> Components mapped: False
> INFO 2017-05-22 22:04:34,948 Controller.py:180 - Registered with the server 
> with {u'exitstatus': 0,
> INFO 2017-05-22 22:04:34,948 Controller.py:630 - Response from server = OK
> INFO 2017-05-22 22:04:44,959 AgentToggleLogger.py:40 - Queue result: 
> {'componentStatus': [], 'reports': []}
> INFO 2017-05-22 22:04:44,960 AgentToggleLogger.py:40 - Sending heartbeat with 
> response id: 0 and timestamp: 1495490684959. Command(s) in progress: False. 
> Components mapped: False
> {noformat}
> I've observed the same on multiple different clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to