Can you check the slider agent logs and the application logs in those 
containers to see if they are failing with some exception?

The fishy thing I found in the AM log are messages like these saying 
"local-dirs are bad". Can you check what's going on with these dirs.?

2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster - onNodesUpdated(1)
2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454 } 
httpAddress: "***:8042" rackName: "/EI105" used { memory: 0 virtual_cores: 0 } 
capability { memory: 364544 virtual_cores: 38 } node_state: NS_UNHEALTHY 
health_report: "10/12 local-dirs are bad: 
/grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local;
 10/12 log-dirs are bad: 
/grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/hadoop/yarn/log,/grid/3/hadoop/yarn/log"
 last_health_report_time: 1522798707678]

-Gour

On 4/3/18, 10:49 PM, "David.Serafini" <david.seraf...@target.com> wrote:

    I've attached what I can find.  
    
    
    On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote:
    
        Can you share the logs of the dying containers and the AM to debug 
further?
        
        -Gour
        
        On 4/3/18, 6:49 PM, "David.Serafini" <david.seraf...@target.com> wrote:
        
            I've been using slider 0.91 for a year and it's been very stable 
lately.
            I built 0.92 to test it and my yarn containers are dying after 10 
minutes.
            Slider restarts them successfully, but this isn't acceptable 
behavior.
            Any thoughts on what could be going on?  
            
            I looked for some kind of release notes for 0.92, but didn't find 
anything except a list of ticket ids.
            Is there some configuration in my job that I should have changed to 
use 0.92?
            
            Thanks,
            -david
            
            
            
        
        
    
    

Reply via email to