Made the fix but still cannot make it.
Actually, the steps to reproduce in SLIDER-439 is different from mine.
What I do is first use "freeze" command and then kill one node manager. Wait long enough for the node manager leave the Yarn cluster. And then use "thaw" command to restart. However, the instance that was running on that killed node is not able to restart.

Here is part of the log.

14/10/28 18:25:42 INFO mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
14/10/28 18:25:42 INFO zookeeper.ClientCnxn: Session establishment complete on 
server vertica1/172.17.42.1:16433, sessionid = 0x14957f07d6f011f, negotiated 
timeout = 40000
14/10/28 18:25:42 INFO state.ConnectionStateManager: State change: CONNECTED
14/10/28 18:25:42 INFO mortbay.log: jetty-6.1.26
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.PackagesResourceConfig init
INFO: Scanning for root resource and provider classes in the packages:
  org.apache.slider.server.appmaster.web.rest.agent
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig 
logClasses
INFO: Root resource classes found:
  class org.apache.slider.server.appmaster.web.rest.agent.AgentWebServices
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig init
INFO: No provider classes found.
Oct 28, 2014 6:25:42 PM 
com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
14/10/28 18:25:43 INFO mortbay.log: Started 
SslSelectChannelConnector@0.0.0.0:46561
14/10/28 18:25:43 INFO mortbay.log: Started 
SslSelectChannelConnector@0.0.0.0:36451
14/10/28 18:25:43 INFO http.HttpRequestLog: Http request log for 
http.requests.slideram is not defined
14/10/28 18:25:43 INFO http.HttpServer2: Added global filter 'safety' 
(class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER 
(class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to context 
slideram
14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER 
(class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to context 
static
14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /slideram/*
14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /ws/*
14/10/28 18:25:43 INFO http.HttpServer2: Jetty bound to port 47481
14/10/28 18:25:43 INFO mortbay.log: jetty-6.1.26
14/10/28 18:25:43 INFO mortbay.log: Extract 
jar:file:/home/rzhang/Slider_Vertica/Linux64/Test/verticadb1000/HDP2_1/hadoop/local/usercache/rzhang/appcache/application_1414519516219_0002/filecache/18/slider.jar!/webapps/slideram
 to /tmp/Jetty_0_0_0_0_47481_slideram____.7p4s4g/webapp
14/10/28 18:25:43 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:47481
14/10/28 18:25:43 INFO webapp.WebApps: Web app /slideram started at 47481
14/10/28 18:25:43 INFO webapp.WebApps: Registered webapp guice modules
14/10/28 18:25:43 INFO appmaster.SliderAppMaster: Connecting to RM at 
46522,address tracking URL=http://vertica2.rzhang.com:47481
14/10/28 18:25:43 INFO agent.AgentUtils: Reading metainfo at 
hdfs://rzhang-HP-ZBook-15:10433/slider/slider_test.zip
14/10/28 18:25:44 INFO tools.SliderUtils: Reading metainfo.xml of size 3193
14/10/28 18:25:44 INFO agent.HeartbeatMonitor: Starting heartbeat monitor with 
interval 60000
14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_MASTER
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER assigned priority 1
14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_SLAVE
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE assigned priority 2
14/10/28 18:25:44 INFO state.AppState: Role slider-appmaster flexed from 0 to 1
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE flexed from 0 to 2
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER flexed from 0 to 1
14/10/28 18:25:44 INFO state.RoleHistory: loaded history from 
hdfs://rzhang-HP-ZBook-15:10433/user/rzhang/.slider/cluster/slider_test/history/rolehistory-0000014957f14d86.json
14/10/28 18:25:44 INFO appmaster.SliderAppMaster: service instances already 
running: []
14/10/28 18:25:44 INFO curator.RegistryBinderService: registering 
ServiceInstance{name='org-apache-slider', id='slider_test', 
address='172.17.0.3', port=47481, sslPort=null, 
payload=ServiceInstanceData{id='slider_test', serviceType='org-apache-slider'}, 
registrationTimeUTC=1414520744939, serviceType=DYNAMIC, 
uriSpec=org.apache.curator.x.discovery.UriSpec@54515c2}
14/10/28 18:25:45 INFO curator.RegistryBinderService: registration completed 
ServiceInstance{name='org-apache-slider', id='slider_test', 
address='172.17.0.3', port=47481, sslPort=null, 
payload=ServiceInstanceData{id='slider_test', serviceType='org-apache-slider'}, 
registrationTimeUTC=1414520744939, serviceType=DYNAMIC, 
uriSpec=org.apache.curator.x.discovery.UriSpec@54515c2}
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Chaos monkey disabled
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Adding Chaos Monkey scheduled 
every 0 seconds (0 hours)
14/10/28 18:25:45 INFO workflow.WorkflowCompositeService: Child service 
completed Service SliderAMProviderService in state SliderAMProviderService: 
STOPPED; current service null; queued service count=0
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Process has exited with exit 
code 0 mapped to 0 -ignoring
14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_SLAVE', key=2, 
desired=2, actual=0, requested=0, releasing=0, failed=0, started=0, 
startFailed=0, completed=0, failureMessage=''}
14/10/28 18:25:45 INFO state.AppState: VERTICA_SLAVE: Asking for 2 more 
nodes(s) for a total of 2
14/10/28 18:25:45 INFO state.RoleHistory: There're 2 nodes to consider for 
VERTICA_SLAVE
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for 
container on vertica2.rzhang.com
14/10/28 18:25:45 INFO state.AppState: Container ask is Capability[<memory:1024, 
vCores:1>]Priority[2]
14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for 
VERTICA_SLAVE
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for 
container on vertica0.rzhang.com
14/10/28 18:25:45 INFO state.AppState: Container ask is Capability[<memory:1024, 
vCores:1>]Priority[2]
14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_MASTER', key=1, 
desired=1, actual=0, requested=0, releasing=0, failed=0, started=0, 
startFailed=0, completed=0, failureMessage=''}
14/10/28 18:25:45 INFO state.AppState: VERTICA_MASTER: Asking for 1 more 
nodes(s) for a total of 1
14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for 
VERTICA_MASTER
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for 
container on vertica1
14/10/28 18:25:45 INFO state.AppState: Container ask is Capability[<memory:1024, 
vCores:1>]Priority[1]
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica2.rzhang.com to 
/default-rack
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica0.rzhang.com to 
/default-rack
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica1 to /default-rack
14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for : 
vertica0.rzhang.com:54106
14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for : 
vertica2.rzhang.com:41175
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: onContainersAllocated(2)
14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to 
container container_1414519516219_0002_01_000002, on vertica0.rzhang.com:54106,
14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to 
container container_1414519516219_0002_01_000003, on vertica2.rzhang.com:41175,
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Diagnostics: 
RoleStatus{name='slider-appmaster', key=0, desired=1, actual=0, requested=0, 
releasing=0, failed=0, started=0, startFailed=0, completed=0, failureMessage=''}
RoleStatus{name='VERTICA_SLAVE', key=2, desired=2, actual=2, requested=0, 
releasing=0, failed=0, started=0, startFailed=0, completed=0, failureMessage=''}
RoleStatus{name='VERTICA_MASTER', key=1, desired=1, actual=0, requested=1, 
releasing=0, failed=0, started=0, startFailed=0, completed=0, failureMessage=''}

14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context for 
Agent
14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context for 
Agent
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to $PWD
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to 
$LOG_DIRS
14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to 
./infra/agent/slider-agent/
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to $PWD
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to 
$LOG_DIRS
14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to 
./infra/agent/slider-agent/
14/10/28 18:25:46 INFO agent.AgentProviderService: Using 
./infra/agent/slider-agent/agent/main.py for agent.
14/10/28 18:25:46 INFO agent.AgentProviderService: Using 
./infra/agent/slider-agent/agent/main.py for agent.
14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container with 
command: python ./infra/agent/slider-agent/agent/main.py --label 
container_1414519516219_0002_01_000002___VERTICA_SLAVE --zk-quorum 
rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test ;
14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container with 
command: python ./infra/agent/slider-agent/agent/main.py --label 
container_1414519516219_0002_01_000003___VERTICA_SLAVE --zk-quorum 
rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test ;
14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1414519516219_0002_01_000002
14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1414519516219_0002_01_000003
14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
vertica0.rzhang.com:54106
14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
vertica2.rzhang.com:41175
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container 
container_1414519516219_0002_01_000002
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container 
container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of role 
VERTICA_SLAVE onto container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component 
container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
QUERY_CONTAINER for Container container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of role 
VERTICA_SLAVE onto container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component 
container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
QUERY_CONTAINER for Container container_1414519516219_0002_01_000003

Thanks,
Rui



On 10/28/2014 01:47 PM, Sumit Mohanty wrote:
There is a bug fix that went in few days back -
https://issues.apache.org/jira/browse/SLIDER-439 - that specifically fixed
this issue.

thanks
-Sumit

On Tue, Oct 28, 2014 at 10:36 AM, Rui Zhang <rzh...@vertica.com> wrote:

Hi,

When I killed a node manager manually and restart the application, it
seems that an instance previously ran on that node manager is not able to
restart. Why is this?  I think Yarn should allocate a container on a
different machine for this instance, right?

Thanks,
Rui

--
Rui Zhang
Software engineer Intern
Vertica, an HP Company
rzh...@vertica.com



--
Rui Zhang
Software engineer Intern
Vertica, an HP Company
rzh...@vertica.com

Reply via email to