[ https://issues.apache.org/jira/browse/YARN-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431063#comment-16431063 ]
Wangda Tan commented on YARN-8116: ---------------------------------- [~csingh], thanks for working on the fix. It's better to include a simple UT to avoid regression since this is in a critical path of NM recovery. > Nodemanager fails with NumberFormatException: For input string: "" > ------------------------------------------------------------------ > > Key: YARN-8116 > URL: https://issues.apache.org/jira/browse/YARN-8116 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.1.0 > Reporter: Yesha Vora > Assignee: Chandni Singh > Priority: Critical > Attachments: YARN-8116.001.patch > > > Steps followed. > 1) Update nodemanager debug delay config > {code} > <property> > <name>yarn.nodemanager.delete.debug-delay-sec</name> > <value>350</value> > </property>{code} > 2) Launch distributed shell application multiple times > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn jar > hadoop-yarn-applications-distributedshell-*.jar -shell_command "sleep 120" > -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar > hadoop-yarn-applications-distributedshell-*.jar{code} > 3) restart NM > Nodemanager fails to start with below error. > {code} > {code:title=NM log} > 2018-03-23 21:32:14,437 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:serviceInit(181)) - ContainersMonitor enabled: > true > 2018-03-23 21:32:14,439 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceInit(130)) - rollingMonitorInterval is set > as 3600. The logs will be aggregated every 3600 seconds > 2018-03-23 21:32:14,455 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state INITED > java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:601) > at java.lang.Long.parseLong(Long.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960) > 2018-03-23 21:32:14,458 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(148)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2018-03-23 21:32:14,460 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state > INITED > java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:601) > at java.lang.Long.parseLong(Long.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960) > 2018-03-23 21:32:14,463 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(210)) - Stopping NodeManager metrics system... > 2018-03-23 21:32:14,464 INFO impl.MetricsSinkAdapter > (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread > interrupted. > 2018-03-23 21:32:14,468 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(216)) - NodeManager metrics system stopped. > 2018-03-23 21:32:14,508 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:shutdown(607)) - NodeManager metrics system shutdown > complete.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org