I tried the latest one with hadoop 2.6 now. Still one of instances cannot restart. The steps to reproduce is as the same as my previous email. (only change "freeze" to "stop" and "thaw" to "start"). And I am sure the resources are enough.

On 10/28/2014 02:48 PM, Billie Rinaldi wrote:
Do your nodes have enough resources for all of the requested components to
start?

On Tue, Oct 28, 2014 at 11:40 AM, Rui Zhang <rzh...@vertica.com> wrote:

Made the fix but still cannot make it.
Actually, the steps to reproduce in SLIDER-439 is different from mine.
What I do is first use "freeze" command and then kill one node manager.
Wait long enough for the node manager leave the Yarn cluster. And then use
"thaw" command to restart.
However, the instance that was running on that killed node is not able to
restart.

Here is part of the log.

14/10/28 18:25:42 INFO mortbay.log: Logging to org.slf4j.impl.
Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
14/10/28 18:25:42 INFO zookeeper.ClientCnxn: Session establishment
complete on server vertica1/172.17.42.1:16433, sessionid =
0x14957f07d6f011f, negotiated timeout = 40000
14/10/28 18:25:42 INFO state.ConnectionStateManager: State change:
CONNECTED
14/10/28 18:25:42 INFO mortbay.log: jetty-6.1.26
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.PackagesResourceConfig
init
INFO: Scanning for root resource and provider classes in the packages:
   org.apache.slider.server.appmaster.web.rest.agent
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig
logClasses
INFO: Root resource classes found:
   class org.apache.slider.server.appmaster.web.rest.agent.AgentWebServices
Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig
init
INFO: No provider classes found.
Oct 28, 2014 6:25:42 PM 
com.sun.jersey.server.impl.application.WebApplicationImpl
_initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17
AM'
14/10/28 18:25:43 INFO mortbay.log: Started SslSelectChannelConnector@0.0.
0.0:46561
14/10/28 18:25:43 INFO mortbay.log: Started SslSelectChannelConnector@0.0.
0.0:36451
14/10/28 18:25:43 INFO http.HttpRequestLog: Http request log for
http.requests.slideram is not defined
14/10/28 18:25:43 INFO http.HttpServer2: Added global filter 'safety'
(class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER
(class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to
context slideram
14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER
(class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to
context static
14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /slideram/*
14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /ws/*
14/10/28 18:25:43 INFO http.HttpServer2: Jetty bound to port 47481
14/10/28 18:25:43 INFO mortbay.log: jetty-6.1.26
14/10/28 18:25:43 INFO mortbay.log: Extract jar:file:/home/rzhang/Slider_
Vertica/Linux64/Test/verticadb1000/HDP2_1/hadoop/local/usercache/rzhang/
appcache/application_1414519516219_0002/filecache/18/slider.jar!/webapps/slideram
to /tmp/Jetty_0_0_0_0_47481_slideram____.7p4s4g/webapp
14/10/28 18:25:43 INFO mortbay.log: Started SelectChannelConnector@0.0.0.
0:47481
14/10/28 18:25:43 INFO webapp.WebApps: Web app /slideram started at 47481
14/10/28 18:25:43 INFO webapp.WebApps: Registered webapp guice modules
14/10/28 18:25:43 INFO appmaster.SliderAppMaster: Connecting to RM at
46522,address tracking URL=http://vertica2.rzhang.com:47481
14/10/28 18:25:43 INFO agent.AgentUtils: Reading metainfo at
hdfs://rzhang-HP-ZBook-15:10433/slider/slider_test.zip
14/10/28 18:25:44 INFO tools.SliderUtils: Reading metainfo.xml of size 3193
14/10/28 18:25:44 INFO agent.HeartbeatMonitor: Starting heartbeat monitor
with interval 60000
14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_MASTER
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER assigned
priority 1
14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_SLAVE
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE assigned
priority 2
14/10/28 18:25:44 INFO state.AppState: Role slider-appmaster flexed from 0
to 1
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE flexed from 0 to
2
14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER flexed from 0
to 1
14/10/28 18:25:44 INFO state.RoleHistory: loaded history from
hdfs://rzhang-HP-ZBook-15:10433/user/rzhang/.slider/
cluster/slider_test/history/rolehistory-0000014957f14d86.json
14/10/28 18:25:44 INFO appmaster.SliderAppMaster: service instances
already running: []
14/10/28 18:25:44 INFO curator.RegistryBinderService: registering
ServiceInstance{name='org-apache-slider', id='slider_test',
address='172.17.0.3', port=47481, sslPort=null, 
payload=ServiceInstanceData{id='slider_test',
serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939,
serviceType=DYNAMIC, uriSpec=org.apache.curator.x.
discovery.UriSpec@54515c2}
14/10/28 18:25:45 INFO curator.RegistryBinderService: registration
completed ServiceInstance{name='org-apache-slider', id='slider_test',
address='172.17.0.3', port=47481, sslPort=null, 
payload=ServiceInstanceData{id='slider_test',
serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939,
serviceType=DYNAMIC, uriSpec=org.apache.curator.x.
discovery.UriSpec@54515c2}
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Chaos monkey disabled
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Adding Chaos Monkey
scheduled every 0 seconds (0 hours)
14/10/28 18:25:45 INFO workflow.WorkflowCompositeService: Child service
completed Service SliderAMProviderService in state SliderAMProviderService:
STOPPED; current service null; queued service count=0
14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Process has exited with
exit code 0 mapped to 0 -ignoring
14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_SLAVE',
key=2, desired=2, actual=0, requested=0, releasing=0, failed=0, started=0,
startFailed=0, completed=0, failureMessage=''}
14/10/28 18:25:45 INFO state.AppState: VERTICA_SLAVE: Asking for 2 more
nodes(s) for a total of 2
14/10/28 18:25:45 INFO state.RoleHistory: There're 2 nodes to consider for
VERTICA_SLAVE
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
container on vertica2.rzhang.com
14/10/28 18:25:45 INFO state.AppState: Container ask is
Capability[<memory:1024, vCores:1>]Priority[2]
14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for
VERTICA_SLAVE
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
container on vertica0.rzhang.com
14/10/28 18:25:45 INFO state.AppState: Container ask is
Capability[<memory:1024, vCores:1>]Priority[2]
14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_MASTER',
key=1, desired=1, actual=0, requested=0, releasing=0, failed=0, started=0,
startFailed=0, completed=0, failureMessage=''}
14/10/28 18:25:45 INFO state.AppState: VERTICA_MASTER: Asking for 1 more
nodes(s) for a total of 1
14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for
VERTICA_MASTER
14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
container on vertica1
14/10/28 18:25:45 INFO state.AppState: Container ask is
Capability[<memory:1024, vCores:1>]Priority[1]
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica2.rzhang.com to
/default-rack
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica0.rzhang.com to
/default-rack
14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica1 to
/default-rack
14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for :
vertica0.rzhang.com:54106
14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for :
vertica2.rzhang.com:41175
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: onContainersAllocated(2)
14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to
container container_1414519516219_0002_01_000002, on
vertica0.rzhang.com:54106,
14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to
container container_1414519516219_0002_01_000003, on
vertica2.rzhang.com:41175,
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Diagnostics:
RoleStatus{name='slider-appmaster', key=0, desired=1, actual=0,
requested=0, releasing=0, failed=0, started=0, startFailed=0, completed=0,
failureMessage=''}
RoleStatus{name='VERTICA_SLAVE', key=2, desired=2, actual=2, requested=0,
releasing=0, failed=0, started=0, startFailed=0, completed=0,
failureMessage=''}
RoleStatus{name='VERTICA_MASTER', key=1, desired=1, actual=0,
requested=1, releasing=0, failed=0, started=0, startFailed=0, completed=0,
failureMessage=''}

14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context
for Agent
14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context
for Agent
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to
$PWD
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to
$LOG_DIRS
14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to
./infra/agent/slider-agent/
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to
$PWD
14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to
$LOG_DIRS
14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to
./infra/agent/slider-agent/
14/10/28 18:25:46 INFO agent.AgentProviderService: Using
./infra/agent/slider-agent/agent/main.py for agent.
14/10/28 18:25:46 INFO agent.AgentProviderService: Using
./infra/agent/slider-agent/agent/main.py for agent.
14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container
with command: python ./infra/agent/slider-agent/agent/main.py --label
container_1414519516219_0002_01_000002___VERTICA_SLAVE --zk-quorum
rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test
;
14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container
with command: python ./infra/agent/slider-agent/agent/main.py --label
container_1414519516219_0002_01_000003___VERTICA_SLAVE --zk-quorum
rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test
;
14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType:
START_CONTAINER for Container container_1414519516219_0002_01_000002
14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType:
START_CONTAINER for Container container_1414519516219_0002_01_000003
14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening
proxy : vertica0.rzhang.com:54106
14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening
proxy : vertica2.rzhang.com:41175
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container
container_1414519516219_0002_01_000002
14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container
container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of
role VERTICA_SLAVE onto container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component
container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType:
QUERY_CONTAINER for Container container_1414519516219_0002_01_000002
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of
role VERTICA_SLAVE onto container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component
container_1414519516219_0002_01_000003
14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType:
QUERY_CONTAINER for Container container_1414519516219_0002_01_000003

Thanks,
Rui




On 10/28/2014 01:47 PM, Sumit Mohanty wrote:

There is a bug fix that went in few days back -
https://issues.apache.org/jira/browse/SLIDER-439 - that specifically
fixed
this issue.

thanks
-Sumit

On Tue, Oct 28, 2014 at 10:36 AM, Rui Zhang <rzh...@vertica.com> wrote:

  Hi,
When I killed a node manager manually and restart the application, it
seems that an instance previously ran on that node manager is not able to
restart. Why is this?  I think Yarn should allocate a container on a
different machine for this instance, right?

Thanks,
Rui

--
Rui Zhang
Software engineer Intern
Vertica, an HP Company
rzh...@vertica.com



--
Rui Zhang
Software engineer Intern
Vertica, an HP Company
rzh...@vertica.com



--
Rui Zhang
Software engineer Intern
Vertica, an HP Company
rzh...@vertica.com

Reply via email to