Ah, I think this is happening because you're using OS X. Normally YARN
kills the entire process group (including child processes) when the
container is killed, but it cannot do this on OS X. So the agent process is
living longer than it should.
A question we may want to look into would be why the AM is not waiting for
the stop command to be executed before releasing the container.

Billie

On Thu, Jul 7, 2016 at 4:28 PM, Sarthak Kukreti <skuk...@ncsu.edu> wrote:

> Thanks! That was helpful. (Strangely) As it turns out, the container
> is released (and cleaned up) even before the STOP command is queued.
> Some more logs:
>
> Node Manager:
> ---------------------
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState - Role
> ConnectD flexed from 2 to 1
> 2016-07-07 15:50:14,148 [AmExecutor-006] WARN  state.AppState -
> Resource requirements of ConnectD normalized from <memory:2000,
> vCores:1> to <memory:2048, vCores:1>
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of slider-appmaster; was 0
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of ConnectD; was 0
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.RoleHistory -
> Resetting failure history
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2,
> requested=0, releasing=0, failed=0, startFailed=0, started=2,
> completed=0, totalRequested=2, preempted=0, nodeFailed=0,
> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
> nodeFailureThreshold=3, placementTimeoutSeconds=30,
> labelExpression='null'}} :
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> ConnectD: Asking for 1 fewer node(s) for a total of 1
> 2016-07-07 15:50:14,150 [AmExecutor-006] INFO  state.AppState -
> Targeting for release: RoleInstance{role='ConnectD',
> id='container_1467829690678_0031_01_000003',
> container=ContainerID=container_1467829690678_0031_01_000003
> nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825
> resource=<memory:2048, vCores:1>, createTime=1467931727650,
> startTime=1467931727656, released=false, roleId=1, host=192.168.1.195,
> hostURL=http://192.168.1.195:8042, state=3, placement=null,
> exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py
> --label container_1467829690678_0031_01_000003___ConnectD --zk-quorum
> localhost:2181 --zk-reg-path
> /registry/users/sarthakk/services/org-apache-slider/kc >
> <LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null,
> environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk",
> PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>",
> SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0",
> LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]}
>
> 2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - onContainersCompleted([1]
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - Container Completion for
> containerID=container_1467829690678_0031_01_000003, state=COMPLETE,
> exitStatus=-100, diagnostics=Container released by application
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.AppState - Container was queued for release :
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.AppState - decrementing role count for role ConnectD to 1;
> releasing=0, completed=1
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.RoleHistory - Finished container for node 1, released=true,
> shortlived=false
> 2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO
> state.AppState - Removing node ID
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO
> agent.AgentProviderService - Removing component status for label
> container_1467829690678_0031_01_000003___ConnectD
>
> 2016-07-07 15:50:14,988 [AmExecutor-006] INFO
> appmaster.SliderAppMaster - Unregistering component
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,989 [AmExecutor-006] INFO  state.AppState -
> Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1,
> requested=0, releasing=0, failed=0, startFailed=0, started=2,
> completed=1, totalRequested=2, preempted=0, nodeFailed=0,
> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
> nodeFailureThreshold=3, placementTimeoutSeconds=30,
> labelExpression='null'}} :
> 2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO
> agent.AgentProviderService - Sending terminate signal to completed
> container (still heartbeating):
> container_1467829690678_0031_01_000003___ConnectD
> 2016-07-07 16:12:43,891 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of slider-appmaster; was 0
> 2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of ConnectD; was 0
> 2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.RoleHistory -
> Resetting failure history
>
>
> slider-agent Logs:
> ---------------------
> INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command:
> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams':
> {u'java_home': u'${JAVA_HOME}', u'container_id':
> u'container_1467829690678_0031_01_000003'}, u'commandType':
> u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'},
> u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams':
> {u'record_config': u'true', u'service_package_folder':
> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> u'4-1', u'containers': [], u'configurations': {u'global':
> {u'security_enabled': u'false', u'app_container_id':
> u'container_1467829690678_0031_01_000003', u'listen_port': u'51670',
> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> u'server': {}}}
> INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base
> dir:
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package
>
> ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 -
> Caught an exception while executing command: <class
> 'AgentException.AgentException'>: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
> Traceback (most recent call last):
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 115, in runCommand
>     script_path = self.resolve_script_path(self.base_dir, script,
> script_type)
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 199, in resolve_script_path
>     raise AgentException(message)
> AgentException: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
> INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received
> INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent
> command received from AM, stopping the agent ...
> INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file
> WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove
> pid file: [Errno 2] No such file or directory:
>
> '/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pid'
> INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files
>
> TL; DR: The callback to signal completion (and un-registration) of the
> container comes up around 15:50:14 whereas the call to
> CustomServiceOrchestrator (triggered by the terminate signal from the
> slider AM) is received around 15:50:23. I guess I am missing how the
> synchronization between termination of a container and cleanup of all
> its resources is handled by Slider (or whether this is handled by YARN
> itself).
>
> - Sarthak
>
> On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <billie.rina...@gmail.com>
> wrote:
> > If you look for the container ID in the nodemanager log on the host where
> > the container was running, you should be able to see when the container
> > stopped and was cleaned up. Looks like it even logs when it deletes the
> > container directories.
> >
> > On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <skuk...@ncsu.edu>
> wrote:
> >
> >> kafka,py is still present in the filecache directory: its just the
> >> "container_1467829690678_0022_01_000003" directory that seems to be
> >> deleted before the runCommand() call
> >>
> >> - Sarthak
> >>
> >> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
> >> <billie.rina...@gmail.com> wrote:
> >> > I think that
> >> >
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
> >> > is linked to
> >> >
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
> >> > so those are actually the same directory. I am not sure why it is
> saying
> >> > kafka.py does not exist when stopping the container; it definitely
> should
> >> > not clean up that directory while a container is still running. Can
> you
> >> > verify that app/definition/package/scripts/kafka.py exists for one of
> the
> >> > containers that is running?
> >> >
> >> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <skuk...@ncsu.edu>
> >> wrote:
> >> >
> >> >> Hello!
> >> >>
> >> >> I am trying to use Slider to distribute an application over a YARN
> >> >> cluster. While attempting to use "slider flex" to decrease the number
> >> >> of containers allocated for the application (using the kafka
> >> >> app-package as reference), I came across the following error:
> >> >>
> >> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
> >> >> Caught an exception while executing command: <class
> >> >> 'AgentException.AgentException'>: 'Script
> >> >>
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> >> does not exist'
> >> >> Traceback (most recent call last):
> >> >>   File
> >> >>
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> >> line 115, in runCommand
> >> >>     script_path = self.resolve_script_path(self.base_dir, script,
> >> >> script_type)
> >> >>   File
> >> >>
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> >> line 199, in resolve_script_path
> >> >>     raise AgentException(message)
> >> >> AgentException: 'Script
> >> >>
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> >> does not exist'
> >> >>
> >> >>
> >> >> (Seems like the directory is cleared up before the command)
> >> >>
> >> >> Additionally, I tried adding debug prints in the
> >> >> CustomServiceOrchestrator to see what base directory is used for
> >> >> invoking the script and found that the base directories for STATUS
> and
> >> >> STOP command differ:
> >> >>
> >> >> STATUS command:
> >> >>
> >> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
> >> >> STATUS_COMMAND for service kc of cluster kc to the queue.
> >> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
> >> >> dir:
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
> >> >>
> >> >>
> >> >> STOP command:
> >> >>
> >> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
> >> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
> >> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
> >> >> gracefully stop the application ...
> >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
> >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
> >> >> with id = 4-1 for role = Hello of cluster kc
> >> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
> >> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> >> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
> >> >> {u'java_home':
> >> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
> >> >> u'container_id': u'container_1467829690678_0022_01_000003'},
> >> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams':
> {u'auto_restart':
> >> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello',
> u'commandParams':
> >> >> {u'record_config': u'true', u'service_package_folder':
> >> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> >> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> >> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> >> >> u'4-1', u'containers': [], u'configurations': {u'global':
> >> >> {u'security_enabled': u'false', u'app_container_id':
> >> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
> >> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> >> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> >> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> >> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid',
> u'app_install_dir':
> >> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> >> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> >> >> u'server': {}}}
> >> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
> >> >> dir:
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
> >> >>
> >> >> For some reason, the STOP command attempts to pick up the script from
> >> >> the container specific location, where the STATUS command goes
> through
> >> >> an entirely different path (I am not sure though if this is the cause
> >> >> of the issue). Any more pointers to debug this would be really
> >> >> helpful.
> >> >>
> >> >> (For reference, platform: OS X, Python 2.7.11)
> >> >>
> >> >> Thank you
> >> >> Sarthak
> >> >>
> >>
>

Reply via email to