[ https://issues.apache.org/jira/browse/SLIDER-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235567#comment-14235567 ]
Jonathan Maron commented on SLIDER-629: --------------------------------------- Enabled the failure window task. Is there a way to recreate this issue? There appear to be multiple code paths for container failure, so it would be easier to analyze a running system failing in this manner to ascertain why the failure count may increment and yet not be checked via one of these code paths. > Slider's count of failure threshold may not be accurate or it could be a > logging issue > -------------------------------------------------------------------------------------- > > Key: SLIDER-629 > URL: https://issues.apache.org/jira/browse/SLIDER-629 > Project: Slider > Issue Type: Bug > Components: appmaster > Affects Versions: Slider 0.50 > Reporter: Sumit Mohanty > Assignee: Jonathan Maron > Fix For: Slider 0.70 > > > One of the long running HBase tests failed with the following error: > {noformat} > 2014-11-08 01:07:26,407 [AmExecutor-008] ERROR appmaster.SliderAppMaster - > Cluster teardown triggered > org.apache.slider.core.exceptions.TriggerClusterTeardownException: Unstable > Application Instance : - failed with component H BASE_REGIONSERVER > failing 8 times (0 in startup); threshold is 5 - last failure: Failure > container_1415341585168_0005_01_000008 on host onprem-slider23: > http://onprem-slider21:19888/jobhistory/logs/onprem-slider23:45454/contai > ner_1415341585168_0005_01_000008/ctx/hadoop^M > {noformat} > However, there were total of "9" REGION_SERVERs created. > {noformat} > 2014-11-07 16:00:35,346 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000002, on onprem-slider25:45454, > 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000005, on onprem-slider24:45454, > 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000007, on onprem-slider22:45454, > 2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000008, on onprem-slider23:45454, > 2014-11-07 23:51:20,040 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000009, on onprem-slider22:45454, > 2014-11-07 23:58:44,810 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000013, on onprem-slider24:45454, > 2014-11-08 00:12:17,804 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000015, on onprem-slider22:45454, > 2014-11-08 00:15:57,373 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000018, on onprem-slider25:45454, > 2014-11-08 01:06:36,771 [AMRM Callback Handler Thread] INFO state.AppState - > Assigning role HBASE_REGIONSERVER to container > container_1415341585168_0005_01_000020, on onprem-slider25:45454, > {noformat} > As the ask was for 4 but 9 were created, obviously there are 5 failures. > Perhaps its a logging issue. Can we also print the Window - e.g. 5 failures > in X minutes or hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332)