[jira] [Created] (YARN-7646) MR job (based on old version tarball) get failed due to incompatible resource request
Junping Du created YARN-7646: Summary: MR job (based on old version tarball) get failed due to incompatible resource request Key: YARN-7646 URL: https://issues.apache.org/jira/browse/YARN-7646 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Junping Du Priority: Blocker With quick workaround with fixing HDFS-12920 (set non time unit to hdfs-site.xml), the job still get failed with following error: {noformat} 2017-12-12 16:39:13,105 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=-1, maxMemory=8192 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:275) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:240) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:256) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:217) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy81.allocate(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:206) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:783) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:280) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:279) at java.lang.Thread.run(Thread.java:745) {noformat} It looks like incompatible change with communication between old MR client
[jira] [Created] (YARN-7230) Document DockerContainerRuntime for branch-2.8 with proper scope and claim as an experimental feature
Junping Du created YARN-7230: Summary: Document DockerContainerRuntime for branch-2.8 with proper scope and claim as an experimental feature Key: YARN-7230 URL: https://issues.apache.org/jira/browse/YARN-7230 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.8.1 Reporter: Junping Du Priority: Blocker YARN-5258 is to document new feature for docker container runtime which already get checked in trunk/branch-2. We need a similar one for branch-2.8. However, given we missed several patches, we need to define narrowed scope of these feature/improvements which match with existing patches landed in 2.8. Also, like YARN-6622, to document it as experimental. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7138) Fix incompatible API change for YarnScheduler involved by YARN-5521
Junping Du created YARN-7138: Summary: Fix incompatible API change for YarnScheduler involved by YARN-5521 Key: YARN-7138 URL: https://issues.apache.org/jira/browse/YARN-7138 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Junping Du Priority: Blocker >From JACC report for 2.8.2 against 2.7.4, it indicates that we have >incompatible changes happen in YarnScheduler: {noformat} hadoop-yarn-server-resourcemanager-2.7.4.jar, YarnScheduler.class package org.apache.hadoop.yarn.server.resourcemanager.scheduler YarnScheduler.allocate ( ApplicationAttemptId p1, List p2, List p3, List p4, List p5 ) [abstract] : Allocation {noformat} The root cause is YARN-5221. We should change it back or workaround this by adding back original API (mark as deprecated if not used any more). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7124) CLONE - Log aggregation deletes/renames while file is open
Junping Du created YARN-7124: Summary: CLONE - Log aggregation deletes/renames while file is open Key: YARN-7124 URL: https://issues.apache.org/jira/browse/YARN-7124 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.2 Reporter: Daryn Sharp Assignee: Jason Lowe Priority: Critical YARN-6288 changes the log aggregation writer to be an autoclosable. Unfortunately the try-with-resources block for the writer will either rename or delete the log while open. Assuming the NM's behavior is correct, deleting open files only results in ominous WARNs in the nodemanager log and increases the rate of logging in the NN when the implicit try-with-resource close fails. These red herrings complicate debugging efforts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7027) Log aggregation finish time should get logged for trouble shooting.
Junping Du created YARN-7027: Summary: Log aggregation finish time should get logged for trouble shooting. Key: YARN-7027 URL: https://issues.apache.org/jira/browse/YARN-7027 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Junping Du Assignee: Junping Du Now, RM track application log aggregation status in RMApp and the status change is triggered by NM heartbeat with log aggregation report. For each node's log aggregation status change from in-progress (NOT_START,RUNNING,RUNNING_WITH_FAILURE) to final status (SUCCEEDED,FAILED, TIMEOUT), it trigger an aggregation for log aggregation status: updateLogAggregationStatus(). The whole progress is log less and we cannot trace the log aggregation problem (delay of log aggregation, etc.) from RM (or NM) log. We should add more log here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-1038) LocalizationProtocolPBClientImpl RPC failing
[ https://issues-test.apache.org/jira/browse/YARN-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-1038. -- Resolution: Cannot Reproduce I don't think trunk branch has this problem now, just resolve as cannot reproduce. > LocalizationProtocolPBClientImpl RPC failing > > > Key: YARN-1038 > URL: https://issues-test.apache.org/jira/browse/YARN-1038 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha1 >Reporter: Alejandro Abdelnur >Priority: Blocker > > Trying to run an MR job in trunk is failing with: > {code} > 2013-08-06 22:24:21,498 WARN org.apache.hadoop.ipc.Client: interrupted > waiting to send rpc request to server > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279) > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) > at java.util.concurrent.FutureTask.get(FutureTask.java:83) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1019) > at org.apache.hadoop.ipc.Client.call(Client.java:1372) > at org.apache.hadoop.ipc.Client.call(Client.java:1352) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy25.heartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:250) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:977) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6891) Can kill other user's applications via RM UI
[ https://issues.apache.org/jira/browse/YARN-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-6891. -- Resolution: Duplicate > Can kill other user's applications via RM UI > > > Key: YARN-6891 > URL: https://issues.apache.org/jira/browse/YARN-6891 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Junping Du >Priority: Critical > > In a secured cluster with UI unsecured which has following config > {code} > "hadoop.http.authentication.simple.anonymous.allowed" => "true" > "hadoop.http.authentication.type" => kerberos > {code} > UI can be accessed without any security setting. > Also any user can kill other user's applications via UI -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6890) If UI is not secured, we allow user to kill other users' job even yarn cluster is secured.
Junping Du created YARN-6890: Summary: If UI is not secured, we allow user to kill other users' job even yarn cluster is secured. Key: YARN-6890 URL: https://issues.apache.org/jira/browse/YARN-6890 Project: Hadoop YARN Issue Type: Bug Reporter: Sumana Sathish Assignee: Junping Du Priority: Critical Configuring SPNEGO for web browser could be a head ache, so many production cluster choose to configure a unsecured UI access even for a secured cluster. In this setup, users (login as some random guy) could watch other users job which is expected. However, the kill button (added in YARN-3249 which enabled by default) shouldn't work in this situation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5007) Remove deprecated constructors of MiniYARNCluster and MiniMRYarnCluster
[ https://issues.apache.org/jira/browse/YARN-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-5007. -- Resolution: Later Assignee: (was: Andras Bokor) > Remove deprecated constructors of MiniYARNCluster and MiniMRYarnCluster > --- > > Key: YARN-5007 > URL: https://issues.apache.org/jira/browse/YARN-5007 > Project: Hadoop YARN > Issue Type: Test >Reporter: Andras Bokor > Labels: oct16-easy > Attachments: YARN-5007.01.patch, YARN-5007.02.patch, > YARN-5007.03.patch > > > MiniYarnCluster has a deprecated constructor which is called by the other > constructors and it causes javac warnings during the build. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6534) ResourceManager failed due to TimelineClient try to init SSLFactory even https is not enabled
Junping Du created YARN-6534: Summary: ResourceManager failed due to TimelineClient try to init SSLFactory even https is not enabled Key: YARN-6534 URL: https://issues.apache.org/jira/browse/YARN-6534 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3 Reporter: Junping Du Priority: Blocker In a non-secured cluster, RM get failed consistently due to TimelineServiceV1Publisher tries to init TimelineClient with SSLFactory without any checking on if https get used. {noformat} 2017-04-26 21:09:10,683 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1457)) - Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: java.io.FileNotFoundException: /etc/security/clientKeys/all.jks (No such file or directory) at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:131) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractSystemMetricsPublisher.serviceInit(AbstractSystemMetricsPublisher.java:59) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.serviceInit(TimelineServiceV1Publisher.java:67) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:344) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1453) Caused by: java.io.FileNotFoundException: /etc/security/clientKeys/all.jks (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.loadTrustManager(ReloadingX509TrustManager.java:168) at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.(ReloadingX509TrustManager.java:86) at org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory.init(FileBasedKeyStoresFactory.java:219) at org.apache.hadoop.security.ssl.SSLFactory.init(SSLFactory.java:179) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector.getSSLFactory(TimelineConnector.java:176) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector.serviceInit(TimelineConnector.java:106) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 11 more {noformat} CC [~rohithsharma] and [~gtCarrera9] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6336) Jenkins report YARN new UI build failure
Junping Du created YARN-6336: Summary: Jenkins report YARN new UI build failure Key: YARN-6336 URL: https://issues.apache.org/jira/browse/YARN-6336 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Priority: Blocker In Jenkins report of YARN-6313 (https://builds.apache.org/job/PreCommit-YARN-Build/15260/artifact/patchprocess/patch-compile-hadoop-yarn-project_hadoop-yarn.txt), we found following build failure due to YARN new UI: {noformat} /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/target/src/main/webapp/node_modules/ember-cli-htmlbars/node_modules/broccoli-persistent-filter/node_modules/async-disk-cache/node_modules/username/index.js:2 const os = require('os'); ^ Use of const in strict mode. SyntaxError: Use of const in strict mode. at Module._compile (module.js:439:25) at Object.Module._extensions..js (module.js:474:10) at Module.load (module.js:356:32) at Function.Module._load (module.js:312:12) at Module.require (module.js:364:17) at require (module.js:380:17) at Object. (/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/target/src/main/webapp/node_modules/ember-cli-htmlbars/node_modules/broccoli-persistent-filter/node_modules/async-disk-cache/index.js:24:16) at Module._compile (module.js:456:26) at Object.Module._extensions..js (module.js:474:10) at Module.load (module.js:356:32) DEPRECATION: Node v0.10.25 is no longer supported by Ember CLI. Please update to a more recent version of Node undefined version: 1.13.15 Could not find watchman, falling back to NodeWatcher for file system events. Visit http://www.ember-cli.com/user-guide/#watchman for more info. Building[INFO] {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6294) ATS client should better handle Socket closed case
Junping Du created YARN-6294: Summary: ATS client should better handle Socket closed case Key: YARN-6294 URL: https://issues.apache.org/jira/browse/YARN-6294 Project: Hadoop YARN Issue Type: Bug Components: timelineclient Reporter: Sumana Sathish Assignee: Li Lu Exception stack: {noformat} 17/02/06 07:11:30 INFO distributedshell.ApplicationMaster: Container completed successfully., containerId=container_1486362713048_0037_01_02 17/02/06 07:11:30 ERROR distributedshell.ApplicationMaster: Error in RMCallbackHandler: com.sun.jersey.api.client.ClientHandlerException: java.net.SocketException: Socket closed at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:236) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:185) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:248) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPostingObject(TimelineWriter.java:154) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:115) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:112) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPosting(TimelineWriter.java:112) at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:92) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:346) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishContainerEndEvent(ApplicationMaster.java:1145) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.access$400(ApplicationMaster.java:169) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster$RMCallbackHandler.onContainersCompleted(ApplicationMaster.java:779) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:296) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:204) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) ... 20 more Exception in thread "AMRM Callback Handler Thread" {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6079) simple spelling errors in yarn test code
[ https://issues.apache.org/jira/browse/YARN-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-6079. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha2 2.9.0 > simple spelling errors in yarn test code > > > Key: YARN-6079 > URL: https://issues.apache.org/jira/browse/YARN-6079 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Grant Sohn >Assignee: vijay >Priority: Trivial > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-6079.001.patch > > > charactor -> character > hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java: > Assert.assertTrue("invalid label charactor should not add to repo", > caught); > expteced -> expected > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java: > Assert.fail("Exception is not expteced."); > Exepected -> Expected > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java: > "Exepected AbsoluteUsedCapacity > 0.95, got: " > expteced -> expected > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java: > Assert.fail("Exception is not expteced."); > macthing -> matching > hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java: > assertEquals("Expected no macthing requests.", matches.size(), 0); > propogated -> propagated > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java: > Assert.assertTrue("Node script time out message not propogated", > protential -> potential > hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/BasePBImplRecordsTest.java: > LOG.info(String.format("Exclude protential property: %s\n", > gsp.propertyName)); > recevied -> received > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java: > throw new Exception("Unexpected resource recevied."); > shouldnt -> shouldn't > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServiceAppsNodelabel.java: > fail("resourceInfo object shouldnt be available for finished apps"); > Transistion -> Transition > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java: > Assert.fail("Transistion to Active should have failed for > refreshAll()"); > Unhelathy -> Unhealthy > hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java: > Assert.assertEquals("Unhelathy Nodes", initialUnHealthy, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6071) Fix incompatible API change on AM-RM protocol due to YARN-3866 (trunk only)
Junping Du created YARN-6071: Summary: Fix incompatible API change on AM-RM protocol due to YARN-3866 (trunk only) Key: YARN-6071 URL: https://issues.apache.org/jira/browse/YARN-6071 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Wangda Tan Priority: Blocker In YARN-3866, we have addendum patch to fix incompatible API change on branch-2 and branch-2.8. For trunk, we need a similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6068) Log aggregation get failed when NM restart even with recovery
Junping Du created YARN-6068: Summary: Log aggregation get failed when NM restart even with recovery Key: YARN-6068 URL: https://issues.apache.org/jira/browse/YARN-6068 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du Priority: Critical The exception log is as following: {noformat} 2017-01-05 19:16:36,352 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1483640789847_0001 2017-01-05 19:16:36,352 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(399)) - Aggregation did not complete for application application_1483640789847_0001 2017-01-05 19:16:36,353 WARN application.ApplicationImpl (ApplicationImpl.java:handle(461)) - Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FAILED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:459) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:64) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1084) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1076) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2017-01-05 19:16:36,355 INFO application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application application_1483640789847_0001 transitioned from RUNNING to null {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5718) TimelineClient (and other places in YARN) shouldn't over-write HDFS client retry settings which could cause unexpected behavior
Junping Du created YARN-5718: Summary: TimelineClient (and other places in YARN) shouldn't over-write HDFS client retry settings which could cause unexpected behavior Key: YARN-5718 URL: https://issues.apache.org/jira/browse/YARN-5718 Project: Hadoop YARN Issue Type: Bug Components: timelineclient, resourcemanager Reporter: Junping Du Assignee: Junping Du In one HA cluster, after NN failed over, we noticed that job is getting failed as TimelineClient failed to retry connection to proper NN. This is because we are overwrite hdfs client settings that hard code retry policy to be enabled that conflict NN failed-over case - hdfs client should fail fast so can retry on another NN. We shouldn't assume any retry policy for hdfs client at all places in YARN. This should keep consistent with HDFS settings that has different retry polices in different deployment case. Thus, we should clean up these hard code settings in YARN, include: FileSystemTimelineWriter, FileSystemRMStateStore and FileSystemNodeLabelsStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5536) Multiple format support (JSON, etc.) for exclude node file in NM graceful decommission with timeout
Junping Du created YARN-5536: Summary: Multiple format support (JSON, etc.) for exclude node file in NM graceful decommission with timeout Key: YARN-5536 URL: https://issues.apache.org/jira/browse/YARN-5536 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Priority: Blocker Per discussion in YARN-4676, we agree that multiple format (other than xml) should be supported to decommission nodes with timeout values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5475) Test failed for TestAggregatedLogFormat on trunk
Junping Du created YARN-5475: Summary: Test failed for TestAggregatedLogFormat on trunk Key: YARN-5475 URL: https://issues.apache.org/jira/browse/YARN-5475 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Tests run: 3, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 1.114 sec <<< FAILURE! - in org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat testReadAcontainerLogs1(org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat) Time elapsed: 0.012 sec <<< ERROR! java.io.IOException: Unable to create directory : /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/TestAggregatedLogFormat/testReadAcontainerLogs1/srcFiles/application_1_0001/container_1_0001_01_01/subDir at org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.getOutputStreamWriter(TestAggregatedLogFormat.java:403) at org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.writeSrcFile(TestAggregatedLogFormat.java:382) at org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.testReadAcontainerLog(TestAggregatedLogFormat.java:211) at org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.testReadAcontainerLogs1(TestAggregatedLogFormat.java:185) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5416) TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed intermittently due to not wait SchedulerApplicationAttempt to be stopped
Junping Du created YARN-5416: Summary: TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed intermittently due to not wait SchedulerApplicationAttempt to be stopped Key: YARN-5416 URL: https://issues.apache.org/jira/browse/YARN-5416 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Minor The test failure stack is: Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Tests run: 54, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 385.338 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart testRMRestartWaitForPreviousAMToFinish[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 43.134 sec <<< FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected: but was: at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:86) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:594) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:1008) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:530) This is due to the same issue that partially fixed in YARN-4968 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5311) Document graceful decommission CLI and usage
Junping Du created YARN-5311: Summary: Document graceful decommission CLI and usage Key: YARN-5311 URL: https://issues.apache.org/jira/browse/YARN-5311 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Junping Du Assignee: Junping Du -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5217) Close FileInputStream in NMWebServices#getLogs in branch-2.8
[ https://issues.apache.org/jira/browse/YARN-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-5217. -- Resolution: Duplicate > Close FileInputStream in NMWebServices#getLogs in branch-2.8 > > > Key: YARN-5217 > URL: https://issues.apache.org/jira/browse/YARN-5217 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Critical > Attachments: YARN-5217.branch-2.8.patch > > > In https://issues.apache.org/jira/browse/YARN-5199, we close LogReader in in > AHSWebServices#getStreamingOutput and FileInputStream in > NMWebServices#getLogs. We should do the same thing in branch-2.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Junping Du created YARN-5214: Summary: Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater Key: YARN-5214 URL: https://issues.apache.org/jira/browse/YARN-5214 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Critical In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked: 1. Node Status Updater thread get blocked by 0x8065eae8 {noformat} "Node Status Updater" #191 prio=5 os_prio=0 tid=0x7f0354194000 nid=0x26fa waiting for monitor entry [0x7f035945a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170) - waiting to lock <0x8065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287) at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643) at java.lang.Thread.run(Thread.java:745) {noformat} 2. The actual holder of this lock is DiskHealthMonitor: {noformat} "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x7f0397393000 nid=0x26bd runnable [0x7f035e511000] java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.createDirectory(Native Method) at java.io.File.mkdir(File.java:1316) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231) - locked <0x8065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {noformat} This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here. The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-4955) Add retry for SocketTimeoutException in TimelineClient
[ https://issues.apache.org/jira/browse/YARN-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-4955. -- Resolution: Fixed > Add retry for SocketTimeoutException in TimelineClient > -- > > Key: YARN-4955 > URL: https://issues.apache.org/jira/browse/YARN-4955 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4955.1.patch, YARN-4955.2.patch, YARN-4955.3.patch, > YARN-4955.4-1.patch, YARN-4955.4.patch, YARN-4955.5.patch, YARN-4955.6.patch > > > We saw this exception several times when we tried to getDelegationToken from > ATS. > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > java.net.SocketTimeoutException: Read timed out > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:569) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:234) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:582) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:291) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:290) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at > org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128) > at > org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194) > at java.lang.Thread.run(Thread.java:745) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276) > Caused by: > org.apache.hadoop.security.authentication.client.AuthenticationException: > java.net.SocketTimeoutException: Read timed out > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:332) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:205) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:128) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:215) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:285) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.getDelegationToken(DelegationTokenAuthenticator.java:166) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.getDelegationToken(DelegationTokenAuthenticatedURL.java:371) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:475) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:467) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at >
[jira] [Created] (YARN-4984) LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
Junping Du created YARN-4984: Summary: LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak. Key: YARN-4984 URL: https://issues.apache.org/jira/browse/YARN-4984 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.7.2 Reporter: Junping Du Assignee: Junping Du Priority: Critical Due to YARN-4325, many stale applications still exists in NM state store and get recovered after NM restart. The app initiation will get failed due to token invalid, but exception is swallowed and aggregator thread is still created for invalid app. Exception is: {noformat} 158 2016-04-19 23:38:33,039 ERROR logaggregation.LogAggregationService (LogAggregationService.java:run(300)) - Failed to setup application log directory for application_1448060878692_11842 159 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 1380589 for hdfswrite) can't be fo und in cache 160 at org.apache.hadoop.ipc.Client.call(Client.java:1427) 161 at org.apache.hadoop.ipc.Client.call(Client.java:1358) 162 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) 163 at com.sun.proxy.$Proxy13.getFileInfo(Unknown Source) 164 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) 165 at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source) 166 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 167 at java.lang.reflect.Method.invoke(Method.java:606) 168 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252) 169 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) 170 at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) 171 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116) 172 at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315) 173 at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311) 174 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 175 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311) 176 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:248) 177 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:67) 178 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) 179 at java.security.AccessController.doPrivileged(Native Method) 180 at javax.security.auth.Subject.doAs(Subject.java:415) 181 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) 182 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:261) 183 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:367) 184 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320) 185 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:447) 186 at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4932) (Umbrella) YARN test failures on Windows
Junping Du created YARN-4932: Summary: (Umbrella) YARN test failures on Windows Key: YARN-4932 URL: https://issues.apache.org/jira/browse/YARN-4932 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Junping Du We found several test failures related to Windows. Here is Umbrella jira to track them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4893) Fix some intermittent test failures in TestRMAdminService2
Junping Du created YARN-4893: Summary: Fix some intermittent test failures in TestRMAdminService2 Key: YARN-4893 URL: https://issues.apache.org/jira/browse/YARN-4893 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du As discussion in YARN-998, we need to add rm.drainEvents() after rm.registerNode() or some of test could get failed intermittently. Also, we can consider to add rm.drainEvents() within rm.registerNode() that could be more convenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4863) AHS Security login should be in serviceInit() instead of serviceStart()
Junping Du created YARN-4863: Summary: AHS Security login should be in serviceInit() instead of serviceStart() Key: YARN-4863 URL: https://issues.apache.org/jira/browse/YARN-4863 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Junping Du Assignee: Junping Du Like other daemons, doSecureLogin() should get called in serviceInit but not serviceStart, otherwise, some FS operations could have problem during composite services get started. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4832) NM side resource value should get updated if change applied in RM side
Junping Du created YARN-4832: Summary: NM side resource value should get updated if change applied in RM side Key: YARN-4832 URL: https://issues.apache.org/jira/browse/YARN-4832 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Junping Du Assignee: Junping Du Priority: Critical Now, if we execute CLI to update node (single or multiple) resource in RM side, NM will not receive any notification. It doesn't affect resource scheduling but will make resource usage metrics reported by NM a bit weird. We should sync up new resource between RM and NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4791) Per user blacklist node for user specific error for container launch failure.
Junping Du created YARN-4791: Summary: Per user blacklist node for user specific error for container launch failure. Key: YARN-4791 URL: https://issues.apache.org/jira/browse/YARN-4791 Project: Hadoop YARN Issue Type: Bug Components: applications Reporter: Junping Du Assignee: Junping Du There are some user specific error for container launch failure, like: when enabling LinuxContainerExecutor, but some node doesn't have such user exists, so container launch should get failed with following information: {noformat} 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1434045496283_0036 failed 2 times due to AM Container for appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: Application application_1434045496283_0036 initialization failed (exitCode=255) with output: User jdu not found {noformat} Obviously, this node is not suitable for launching container for this user's other applications. We need a per user blacklist track mechanism rather than per application now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4790) Per user blacklist node for user specific error for container launch failure.
Junping Du created YARN-4790: Summary: Per user blacklist node for user specific error for container launch failure. Key: YARN-4790 URL: https://issues.apache.org/jira/browse/YARN-4790 Project: Hadoop YARN Issue Type: Bug Components: applications Reporter: Junping Du Assignee: Junping Du There are some user specific error for container launch failure, like: when enabling LinuxContainerExecutor, but some node doesn't have such user exists, so container launch should get failed with following information: {noformat} 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 2016-02-14 15:37:03,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1434045496283_0036 failed 2 times due to AM Container for appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: Application application_1434045496283_0036 initialization failed (exitCode=255) with output: User jdu not found {noformat} Obviously, this node is not suitable for launching container for this user's other applications. We need a per user blacklist track mechanism rather than per application now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4638) Node whitelist support for AM launching
Junping Du created YARN-4638: Summary: Node whitelist support for AM launching Key: YARN-4638 URL: https://issues.apache.org/jira/browse/YARN-4638 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.
Junping Du created YARN-4636: Summary: Make blacklist tracking policy pluggable for more extensions. Key: YARN-4636 URL: https://issues.apache.org/jira/browse/YARN-4636 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4635) Add global blacklist tracking for AM container failure.
Junping Du created YARN-4635: Summary: Add global blacklist tracking for AM container failure. Key: YARN-4635 URL: https://issues.apache.org/jira/browse/YARN-4635 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Junping Du Assignee: Junping Du Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4600) More general services provided to application/container by YARN
Junping Du created YARN-4600: Summary: More general services provided to application/container by YARN Key: YARN-4600 URL: https://issues.apache.org/jira/browse/YARN-4600 Project: Hadoop YARN Issue Type: New Feature Components: applications, resourcemanager Reporter: Junping Du Priority: Critical There are more general services like HA, message/notification, should be supported by YARN to containers to better support colorful applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4601) HA as a general YARN service to highlighted container by application.
Junping Du created YARN-4601: Summary: HA as a general YARN service to highlighted container by application. Key: YARN-4601 URL: https://issues.apache.org/jira/browse/YARN-4601 Project: Hadoop YARN Issue Type: Sub-task Components: applications Reporter: Junping Du Assignee: Junping Du Priority: Critical For LRS (long running services) on YARN, get rid of single point failure for critical container failure may not be necessary. Some applications would like to build its own HA architecture. However, it would be ideal to provide some fundamental support to HA service in YARN, like: launching container marked with active/standby, monitor/trigger out failed over, provide end point for shring information between active/standby container, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4602) Message/notification service between containers
Junping Du created YARN-4602: Summary: Message/notification service between containers Key: YARN-4602 URL: https://issues.apache.org/jira/browse/YARN-4602 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Currently, mostly communications among YARN daemons, services and applications are go through RPC. In almost all cases, logic running inside of containers are RPC client but not server because it get launched inflight. The only special case is AM container, because it get launched earlier than any other containers so it can be RPC server and tell new coming containers server address in application logic (like MR AM). The side effects are: 1. When AM container get failed, the new AM attempts will get launched with new address/port, so previous RPC are broken. 2. Application's requirement are variable, there could be other dependency between containers (not AM), so some container failed over will affect other containers' running logic. It is better to have some message/notification mechanism between containers for handle above cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4576) Extend blacklist mechanism to protect AM failed multiple times on failure nodes
Junping Du created YARN-4576: Summary: Extend blacklist mechanism to protect AM failed multiple times on failure nodes Key: YARN-4576 URL: https://issues.apache.org/jira/browse/YARN-4576 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Junping Du Assignee: Junping Du Priority: Critical Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried to launch containers on a specific node get failed for several times, AM will blacklist this node in future resource asking. This mechanism works fine for normal containers. However, from our observation on behaviors of clusters: if this problematic node launch AM failed, then RM could pickup this problematic node to launch next AM attempts again and again that cause application failure in case other functional nodes are busy. In normal case, the customized healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed. However, in RM side, we can blacklist these nodes for launching AM for a certain time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4552) NM ResourceLocalizationService should check and initialize local filecache dir (and log dir) even if NM recover is enabled.
Junping Du created YARN-4552: Summary: NM ResourceLocalizationService should check and initialize local filecache dir (and log dir) even if NM recover is enabled. Key: YARN-4552 URL: https://issues.apache.org/jira/browse/YARN-4552 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Critical In some cases, user are cleanup localized file cache for debugging/trouble shooting purpose during NM down time. However, after bring back NM (with recovery enabled), the job submission could be failed for exception like below: {noformat} Diagnostics: java.io.FileNotFoundException: File /disk/12/yarn/local/filecache does not exist. {noformat} This is due to we only create filecache dir when recover is not enabled during ResourceLocalizationService get initialized/started. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4542) Cleanup AHS code and configuration
Junping Du created YARN-4542: Summary: Cleanup AHS code and configuration Key: YARN-4542 URL: https://issues.apache.org/jira/browse/YARN-4542 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du ATS (many versions so far) is designed to replace AHS. We should consider to cleanup AHS related configuration and code later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4527) Possible thread leak if TimelineClient.start() get called multiple times.
Junping Du created YARN-4527: Summary: Possible thread leak if TimelineClient.start() get called multiple times. Key: YARN-4527 URL: https://issues.apache.org/jira/browse/YARN-4527 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Junping Du Since YARN-4234, we involve TimelineClient start and stop that would create different TimelineWriter according to the configuration. serviceStart() will create a TimelineWriter instance every time which will spawn several timer threads afterwards. If one TimelineClient get call start() multiple times for some reason (application bug or intentionally in some cases), the spawned timer threads will get leak. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4501) Document new put APIs in TimelineClient for ATS 1.5
Junping Du created YARN-4501: Summary: Document new put APIs in TimelineClient for ATS 1.5 Key: YARN-4501 URL: https://issues.apache.org/jira/browse/YARN-4501 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Junping Du Assignee: Xuan Gong In YARN-4234, we are adding new put APIs in TimelineClient, we should document it properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4466) ResourceManager should tolerate unexpected exceptions to happen in non-critical subsystem/services like SystemMetricsPublisher
Junping Du created YARN-4466: Summary: ResourceManager should tolerate unexpected exceptions to happen in non-critical subsystem/services like SystemMetricsPublisher Key: YARN-4466 URL: https://issues.apache.org/jira/browse/YARN-4466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Junping Du >From my comment in >YARN-4452(https://issues.apache.org/jira/browse/YARN-4452?focusedCommentId=15059805=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15059805), > we should make RM more robust with ignore (but log) unexpected exception in >its non-critical subsystems/services. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4429) RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on retry failed reason or the log from RMProxy's retry could be very misleading.
Junping Du created YARN-4429: Summary: RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on retry failed reason or the log from RMProxy's retry could be very misleading. Key: YARN-4429 URL: https://issues.apache.org/jira/browse/YARN-4429 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.6.0 Reporter: Junping Du Assignee: Junping Du In debugging a NM retry connection to RM (non-HA), the NM log during RM down time is very misleading: {noformat} 2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) {noformat} It actually only log client side retry on NetworkConnection failure but not include any info on RetryInvocationHandler where the real retry policy works. From the code below in RetryInvocationHandler.java, even the retry ends, we don't put warn messages to include how much/many time/ counts we spent on retry logic that make it harder to debug. {code} if (failAction != null) {
[jira] [Created] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM
Junping Du created YARN-4431: Summary: Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM Key: YARN-4431 URL: https://issues.apache.org/jira/browse/YARN-4431 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du {noformat} 2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-12-07 12:16:58,876 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unregistration of the Node 10.200.10.53:25454 failed. java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at org.apache.hadoop.ipc.Client.call(Client.java:1452) at org.apache.hadoop.ipc.Client.call(Client.java:1385) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377) {noformat} If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the connection with proper retry policy. After retry the maximum times (15 minutes by default), it will send NodeManagerEventType.SHUTDOWN to shutdown NM. But NM shutdown will call NodeStatusUpdaterImpl.serviceStop() which will call unRegisterNM() to unregister NM from RM and get retry again (another 15 minutes). This is completely unnecessary and we should skip unRegisterNM when NM get shutdown because of connection issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
Junping Du created YARN-4403: Summary: (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period Key: YARN-4403 URL: https://issues.apache.org/jira/browse/YARN-4403 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du Priority: Critical Currently, (AM/NM/Container)LivelinessMonitor use current system time to calculate a duration of expire which could be broken by settimeofday. We should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster
Junping Du created YARN-4389: Summary: "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster Key: YARN-4389 URL: https://issues.apache.org/jira/browse/YARN-4389 Project: Hadoop YARN Issue Type: Bug Components: applications Reporter: Junping Du Priority: Critical "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be application specific rather than a setting in cluster level, or we should't maintain amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We should allow each am to override this config, i.e. via submissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4388) Cleanup "mapreduce.job.hdfs-servers" from yarn-default.xml
Junping Du created YARN-4388: Summary: Cleanup "mapreduce.job.hdfs-servers" from yarn-default.xml Key: YARN-4388 URL: https://issues.apache.org/jira/browse/YARN-4388 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Priority: Minor It is obviously that "mapreduce.job.hdfs-servers" shouldn't belongs to yarn configuration so we should move it to mapred-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4351) Tests in h.y.c.TestGetGroups get failed on trunk
Junping Du created YARN-4351: Summary: Tests in h.y.c.TestGetGroups get failed on trunk Key: YARN-4351 URL: https://issues.apache.org/jira/browse/YARN-4351 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du >From test report: >https://builds.apache.org/job/PreCommit-YARN-Build/9661/testReport/, we can >see there are several test failures for TestGetGroups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient
Junping Du created YARN-4352: Summary: Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient Key: YARN-4352 URL: https://issues.apache.org/jira/browse/YARN-4352 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du >From >https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt, > we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get >timeout which can be reproduced locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-1949) Add admin ACL check to AdminService#updateNodeResource()
[ https://issues.apache.org/jira/browse/YARN-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-1949. -- Resolution: Duplicate The core change for ACL update is already included in YARN-1506, so I close this JIRA as duplicated. [~kj-ki], if you would like to continue your patch with keep test part only, please reopen this JIRA and update your patch to sync with trunk. Thanks! > Add admin ACL check to AdminService#updateNodeResource() > > > Key: YARN-1949 > URL: https://issues.apache.org/jira/browse/YARN-1949 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima > Attachments: YARN-1949.patch > > > At present, updateNodeResource() doesn't check ACL. We should call > checkAcls() before setResourceOption(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4346) Test committer.commitJob() behavior during committing when MR AM get failed.
Junping Du created YARN-4346: Summary: Test committer.commitJob() behavior during committing when MR AM get failed. Key: YARN-4346 URL: https://issues.apache.org/jira/browse/YARN-4346 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du In MAPREDUCE-5485, we are adding additional API (isCommitJobRepeatable) to allow job commit can tolerate AM failure in some cases (like FileOutputCommitter in v2 algorithm). Although we have unit test to cover most of flows, we may want a completed end to end test to verify the whole work flow. The scenario include: 1. For FileOutputCommitter (or some sub class), emulate a MR AM failure or restart during commitJob() in progress 2. Check different behavior for v1 and v2 (support isCommitJobRepeatable() or not) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4288) NodeManager restart should keep retrying to register to RM while connection exception happens during RM restart
Junping Du created YARN-4288: Summary: NodeManager restart should keep retrying to register to RM while connection exception happens during RM restart Key: YARN-4288 URL: https://issues.apache.org/jira/browse/YARN-4288 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with RPC which could throw following exceptions when RM get restarted at the same time, like following exception shows: {noformat} 2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - Unexpected error rebooting NodeStatusUpdater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215) at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) 2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager (NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223) at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at
[jira] [Created] (YARN-4274) NodeStatusUpdaterImpl should register to RM again after a non-fatal exception happen before
Junping Du created YARN-4274: Summary: NodeStatusUpdaterImpl should register to RM again after a non-fatal exception happen before Key: YARN-4274 URL: https://issues.apache.org/jira/browse/YARN-4274 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Assignee: Junping Du >From YARN-3896, an non-fatal exception like response ID mismatch between NM >and RM (due to a race condition) will cause NM stop working. I think we should >make it more robust to tolerant a few times failure in registering to RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4160) Dynamic NM Resources Configuration file should be simplified.
Junping Du created YARN-4160: Summary: Dynamic NM Resources Configuration file should be simplified. Key: YARN-4160 URL: https://issues.apache.org/jira/browse/YARN-4160 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du In YARN-313, we provide CLI to refresh NMs' resources dynamically. The format of dynamic-resources.xml is something like following: {noformat} yarn.resource.dynamic.node_id_1.vcores 16 yarn.resource.dynamic.node_id_1.memory 1024 {noformat} This looks too redundant from review comments of YARN-313. We should have a better, concisely format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4031) Add JvmPauseMonitor to ApplicationHistoryServer
Junping Du created YARN-4031: Summary: Add JvmPauseMonitor to ApplicationHistoryServer Key: YARN-4031 URL: https://issues.apache.org/jira/browse/YARN-4031 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Junping Du Assignee: Junping Du We should add the JvmPauseMonitor to ApplicationHistoryServer and WebAppProxyServer like what we do in yarn-4019 for ResourceManager and NodeManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3959) Store application related configurations in Timeline Service v2
Junping Du created YARN-3959: Summary: Store application related configurations in Timeline Service v2 Key: YARN-3959 URL: https://issues.apache.org/jira/browse/YARN-3959 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du We already have configuration field in HBase schema for application entity. We need to make sure AM write it out when it get launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
Junping Du created YARN-3815: Summary: [Aggregation] Application/Flow/User/Queue Level Aggregations Key: YARN-3815 URL: https://issues.apache.org/jira/browse/YARN-3815 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Critical Per previous discussions in some design documents for YARN-2928, the basic scenario is the query for stats can happen on: - Application level, expect return: an application with aggregated stats - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow - User level, expect return: aggregated stats for applications submitted by user - Queue level, expect return: aggregated stats for applications within the Queue Application states is the basic building block for all other level aggregations. We can provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated table for application states is needed which is missing from previous design documents like HBase/Phoenix schema design). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3817) Flow and User level aggregation on Application States table
Junping Du created YARN-3817: Summary: Flow and User level aggregation on Application States table Key: YARN-3817 URL: https://issues.apache.org/jira/browse/YARN-3817 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du We need flow/user level aggregation to present flow/user related states to end users. Flow level aggregation involve three levels aggregations: - The first level is Flow_run level which represents one execution of a flow and shows exactly aggregated data for a run of flow. - The 2nd level is Flow_version level which represents summary info of a version of flow. - The 3rd level is Flow level which represents summary info of a specific flow. User level aggregation represents summary info of a specific user, it should include summary info of accumulated and statistic means (by two levels: application and flow), like: number of Flows, applications, resource consumption, resource means per app or flow, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3695) EOFException shouldn't be retry forever in RMProxy
Junping Du created YARN-3695: Summary: EOFException shouldn't be retry forever in RMProxy Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du YARN-3646 fix the retry forever policy that it only applies on limited exceptions rather than all exceptions. Here, we may want to review these exceptions. At least, exception EOFException shouldn't retry forever. {code} exceptionToPolicyMap.put(EOFException.class, retryPolicy); exceptionToPolicyMap.put(ConnectException.class, retryPolicy); exceptionToPolicyMap.put(NoRouteToHostException.class, retryPolicy); exceptionToPolicyMap.put(UnknownHostException.class, retryPolicy); exceptionToPolicyMap.put(ConnectTimeoutException.class, retryPolicy); exceptionToPolicyMap.put(RetriableException.class, retryPolicy); exceptionToPolicyMap.put(SocketException.class, retryPolicy); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3641) stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
Junping Du created YARN-3641: Summary: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du Priority: Critical If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a final block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3599) Fix the javadoc of DelegationTokenSecretManager in hadoop-yarn
[ https://issues.apache.org/jira/browse/YARN-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3599. -- Resolution: Duplicate Fix the javadoc of DelegationTokenSecretManager in hadoop-yarn -- Key: YARN-3599 URL: https://issues.apache.org/jira/browse/YARN-3599 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gabor Liptak Priority: Trivial Attachments: YARN-3599.1.patch, YARN-3599.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3596) Fix the javadoc of DelegationTokenSecretManager in hadoop-common
[ https://issues.apache.org/jira/browse/YARN-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3596. -- Resolution: Duplicate Fix the javadoc of DelegationTokenSecretManager in hadoop-common Key: YARN-3596 URL: https://issues.apache.org/jira/browse/YARN-3596 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gabor Liptak Priority: Trivial Attachments: YARN-3596.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3597) Fix the javadoc of DelegationTokenSecretManager in hadoop-hdfs
[ https://issues.apache.org/jira/browse/YARN-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3597. -- Resolution: Duplicate Fix the javadoc of DelegationTokenSecretManager in hadoop-hdfs -- Key: YARN-3597 URL: https://issues.apache.org/jira/browse/YARN-3597 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gabor Liptak Priority: Trivial Attachments: YARN-3597.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3598) Fix the javadoc of DelegationTokenSecretManager in hadoop-mapreduce
[ https://issues.apache.org/jira/browse/YARN-3598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3598. -- Resolution: Duplicate Fix the javadoc of DelegationTokenSecretManager in hadoop-mapreduce --- Key: YARN-3598 URL: https://issues.apache.org/jira/browse/YARN-3598 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gabor Liptak Priority: Trivial Attachments: YARN-3598.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3592) Fix typos in RMNodeLabelsManager
Junping Du created YARN-3592: Summary: Fix typos in RMNodeLabelsManager Key: YARN-3592 URL: https://issues.apache.org/jira/browse/YARN-3592 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Junping Du acccessibleNodeLabels = accessibleNodeLabels in many places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3586) RM only get back addresses of Collectors that NM needs to know.
Junping Du created YARN-3586: Summary: RM only get back addresses of Collectors that NM needs to know. Key: YARN-3586 URL: https://issues.apache.org/jira/browse/YARN-3586 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, timelineserver Reporter: Junping Du Assignee: Junping Du After YARN-3445, RM cache runningApps for each NM. So RM heartbeat back to NM should only include collectors' address for running applications against specific NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2470) A high value for yarn.nodemanager.delete.debug-delay-sec causes Nodemanager to crash. Slider needs this value to be high. Setting a very high value throws an exception an
[ https://issues.apache.org/jira/browse/YARN-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2470. -- Resolution: Won't Fix A high value for yarn.nodemanager.delete.debug-delay-sec causes Nodemanager to crash. Slider needs this value to be high. Setting a very high value throws an exception and nodemanager does not start -- Key: YARN-2470 URL: https://issues.apache.org/jira/browse/YARN-2470 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.1 Reporter: Shivaji Dutta Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2483) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to incorrect AppAttempt state
[ https://issues.apache.org/jira/browse/YARN-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2483. -- Resolution: Duplicate Target Version/s: (was: 2.6.0) Resolve this JIRA as duplicated. TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to incorrect AppAttempt state Key: YARN-2483 URL: https://issues.apache.org/jira/browse/YARN-2483 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/665/console : {code} testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart) Time elapsed: 49.686 sec FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:84) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:582) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:589) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:182) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:402) {code} TestApplicationMasterLauncher#testallocateBeforeAMRegistration fails with similar cause. These tests failed in build #664 as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity
[ https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2445. -- Resolution: Won't Fix Agree with [~billie.rinaldi]'s comments above, this is expected behavior. ATS does not reflect changes to uploaded TimelineEntity --- Key: YARN-2445 URL: https://issues.apache.org/jira/browse/YARN-2445 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Marcelo Vanzin Priority: Minor Attachments: ats2.java If you make a change to the TimelineEntity and send it to the ATS, that change is not reflected in the stored data. For example, in the attached code, an existing primary filter is removed and a new one is added. When you retrieve the entity from the ATS, it only contains the old value: {noformat} {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]} {noformat} Perhaps this is what the design wanted, but from an API user standpoint, it's really confusing, since to upload events I have to upload the entity itself, and the changes are not reflected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-670) Add an Exception to indicate 'Maintenance' for NMs and add this to the JavaDoc for appropriate protocols
[ https://issues.apache.org/jira/browse/YARN-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-670. - Resolution: Won't Fix Add an Exception to indicate 'Maintenance' for NMs and add this to the JavaDoc for appropriate protocols Key: YARN-670 URL: https://issues.apache.org/jira/browse/YARN-670 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-964) Give a parameter that can set AM retry interval
[ https://issues.apache.org/jira/browse/YARN-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-964. - Resolution: Won't Fix Agree with [~vinodkv]. Resolve this as won't fix. Give a parameter that can set AM retry interval Key: YARN-964 URL: https://issues.apache.org/jira/browse/YARN-964 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: qus-jiawei Our am retry number is 4. As one nodemanager 's disk is full,the container of am couldn't allocate on this nodemanager.But RM try this AM on the same NM every 3 secondes. i think there shoule be a params to set the AM retry interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2365) TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2
[ https://issues.apache.org/jira/browse/YARN-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2365. -- Resolution: Cannot Reproduce TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2 -- Key: YARN-2365 URL: https://issues.apache.org/jira/browse/YARN-2365 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails on branch-2 with the following errror {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 46.471 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart) Time elapsed: 46.354 sec FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:569) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:576) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3567) Document exit codes and their meanings used by LinuxContainerExecutor.
Junping Du created YARN-3567: Summary: Document exit codes and their meanings used by LinuxContainerExecutor. Key: YARN-3567 URL: https://issues.apache.org/jira/browse/YARN-3567 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Similar to YARN-2334, we should document exit codes and means for LinuxContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2334) Document exit codes and their meanings used by linux task controller.
[ https://issues.apache.org/jira/browse/YARN-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2334. -- Resolution: Not A Problem Document exit codes and their meanings used by linux task controller. - Key: YARN-2334 URL: https://issues.apache.org/jira/browse/YARN-2334 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sreekanth Ramakrishnan Attachments: HADOOP-5912.1.patch, MAPREDUCE-1318.1.patch, MAPREDUCE-1318.2.patch, MAPREDUCE-1318.patch Currently, linux task controller binary uses a set of exit code, which is not documented. These should be documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2364) TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy
[ https://issues.apache.org/jira/browse/YARN-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2364. -- Resolution: Duplicate Duplicated with YARN-1468 TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy Key: YARN-2364 URL: https://issues.apache.org/jira/browse/YARN-2364 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy. It fails intermittently on branch-2 with the following errors. Fails with any of these {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 26.836 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 26.687 sec FAILURE! java.lang.AssertionError: expected:4 but was:3 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:557) {noformat} or {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 51.326 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 51.055 sec FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:949) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:519) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2390) Investigating whether generic history service needs to support queue-acls
[ https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2390. -- Resolution: Won't Fix Investigating whether generic history service needs to support queue-acls - Key: YARN-2390 URL: https://issues.apache.org/jira/browse/YARN-2390 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Sunil G According YARN-1250, it's arguable whether queue-acls should be applied to the generic history service as well, because the queue admin may not need the access to the completed application that is removed from the queue. Create this ticket to tackle the discussion around. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2401) Rethinking of the HTTP method of TimelineWebServices#postEntities
[ https://issues.apache.org/jira/browse/YARN-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2401. -- Resolution: Won't Fix Rethinking of the HTTP method of TimelineWebServices#postEntities - Key: YARN-2401 URL: https://issues.apache.org/jira/browse/YARN-2401 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Now TimelineWebServices#postEntities is using POST. However, semantically, postEntities is creating an entity or append more data into it. POST may not be the most proper method to for this API. AFAIK, PUT is used to update the entire resource and supposed to be idempotent. Therefore, I'm not sure it's an idea to change the method to PUT because once the entity is created, the following updates are actually appending more data to the existing one. The best fit should be PATCH, however, it requires the additional implementation at the web services side. Hence, somebody online suggested using POST for partial non-idempotent update as well. We need to think more about it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2398. -- Resolution: Cannot Reproduce TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Test Reporter: Jason Lowe TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-880) Configuring map/reduce memory equal to nodemanager's memory, hangs the job execution
[ https://issues.apache.org/jira/browse/YARN-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-880. - Resolution: Not A Problem Assignee: (was: Omkar Vinit Joshi) Configuring map/reduce memory equal to nodemanager's memory, hangs the job execution Key: YARN-880 URL: https://issues.apache.org/jira/browse/YARN-880 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.1-alpha Reporter: Nishan Shetty Priority: Critical Scenario: = Cluster is installed with 2 Nodemanagers Configuraiton: NM memory (yarn.nodemanager.resource.memory-mb): 8 gb map and reduce memory : 8 gb Appmaster memory: 2 gb If map task is reserved on the same nodemanager where appmaster of the same job is running then job execution hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-943) RM starts 2 attempts of failed app even though am-max-retries is set to 1
[ https://issues.apache.org/jira/browse/YARN-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-943. - Resolution: Not A Problem Assignee: (was: Zhijie Shen) RM starts 2 attempts of failed app even though am-max-retries is set to 1 - Key: YARN-943 URL: https://issues.apache.org/jira/browse/YARN-943 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Bikas Saha Attachments: nm.log, rm.log, yarn-site.xml nameyarn.resourcemanager.am.max-retries/name is set to 1 but the RM still retries the AM 2 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3431. -- Resolution: Fixed Fix Version/s: YARN-2928 Hadoop Flags: Reviewed I have commit this to YARN-2928. Thanks [~zjshen] for contributing the patch! Also, thanks for review, [~sjlee0] and [~gtCarrera9]! Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: YARN-2928 Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, YARN-3431.4.patch, YARN-3431.5.patch, YARN-3431.6.patch, YARN-3431.7.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
Junping Du created YARN-3505: Summary: Node's Log Aggregation Report with SUCCEED should not cached in RMApps Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3488) AM get timeline service info from RM rather than Application specific configuration.
Junping Du created YARN-3488: Summary: AM get timeline service info from RM rather than Application specific configuration. Key: YARN-3488 URL: https://issues.apache.org/jira/browse/YARN-3488 Project: Hadoop YARN Issue Type: Sub-task Components: applications Reporter: Junping Du Assignee: Junping Du Since v1 timeline service, we have MR configuration to enable/disable putting history event to timeline service. For today's v2 timeline service ongoing effort, currently we have different methods/structures between v1 and v2 for consuming TimelineClient, so application have to be aware of which version timeline service get used there. There are basically two options here: First option is as current way in DistributedShell or MR to let application has specific configuration to point out that if enabling ATS and which version could be, like: MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA, etc. The other option is to let application to figure out timeline related info from YARN/RM, it can be done through registerApplicationMaster() in ApplicationMasterProtocol with return value for service off, v1_on, or v2_on. We prefer the latter option because application owner doesn't have to aware RM/YARN infrastructure details. Please note that we should keep compatible (consistent behavior with the same setting) with released configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3449. -- Resolution: Invalid Assignee: (was: Junping Du) Recover appTokenKeepAliveMap upon nodemanager restart - Key: YARN-3449 URL: https://issues.apache.org/jira/browse/YARN-3449 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Junping Du appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application alive after application is finished but NM still need app token to do log aggregation (when enable security and log aggregation). The applications are only inserted into this map when receiving getApplicationsToCleanup() from RM heartbeat response. And RM only send this info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM restart work preserving should put appTokenKeepAliveMap into NMStateStore and get recovered after restart. Without doing this, RM could terminate application earlier, so log aggregation could be failed if security is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-329) yarn CHANGES.txt link missing from docs Reference
[ https://issues.apache.org/jira/browse/YARN-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-329. - Resolution: Fixed Fix Version/s: 2.6.0 This get fixed since 2.6.0 release. Mark it as resolved. yarn CHANGES.txt link missing from docs Reference - Key: YARN-329 URL: https://issues.apache.org/jira/browse/YARN-329 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Thomas Graves Priority: Minor Fix For: 2.6.0 Looking at the hadoop 0.23 docs: http://hadoop.apache.org/docs/r0.23.5/ There is no link to the yarn CHANGES.txt in the Reference menu on the left side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-374) Job History Server doesn't show jobs which killed by ClientRMProtocol.forceKillApplication
[ https://issues.apache.org/jira/browse/YARN-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-374. - Resolution: Not a Problem Job History Server doesn't show jobs which killed by ClientRMProtocol.forceKillApplication -- Key: YARN-374 URL: https://issues.apache.org/jira/browse/YARN-374 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Affects Versions: 2.0.1-alpha Reporter: Nemon Lou After i kill a app by typing bin/yarn rmadmin app -kill APP_ID, no job info is kept on JHS web page. However, when i kill a job by typing bin/mapred job -kill JOB_ID , i can see a killed job left on JHS. Some hive users are confused by that their jobs been killed but nothing left on JHS ,and killed app's info on RM web page is not enough.(They kill job by clientRMProtocol) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2969) allocate resource on different nodes for task
[ https://issues.apache.org/jira/browse/YARN-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2969. -- Resolution: Duplicate allocate resource on different nodes for task - Key: YARN-2969 URL: https://issues.apache.org/jira/browse/YARN-2969 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Yang Hao At the help of slider, YARN will be a common resource managing OS and some application would like to apply container( or component on slider) on different nodes, so a configuration of allocating resource on different will be helpful -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-463) Show explicitly excluded nodes on the UI
[ https://issues.apache.org/jira/browse/YARN-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-463. - Resolution: Implemented We already have decommission nodes in UI page, so resolve this JIRA. Show explicitly excluded nodes on the UI Key: YARN-463 URL: https://issues.apache.org/jira/browse/YARN-463 Project: Hadoop YARN Issue Type: Improvement Reporter: Vinod Kumar Vavilapalli Labels: usability Nodes can be explicitly excluded via the config yarn.resourcemanager.nodes.exclude-path. We should have a way of displaying this list via web and command line UIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-551) The option shell_command of DistributedShell had better support compound command
[ https://issues.apache.org/jira/browse/YARN-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-551. - Resolution: Not a Problem The option shell_command of DistributedShell had better support compound command - Key: YARN-551 URL: https://issues.apache.org/jira/browse/YARN-551 Project: Hadoop YARN Issue Type: Improvement Reporter: rainy Yu The option shell_command of DistributedShell must be such single command as 'ls', not be compound command such as 'ps -ef' that including blank character. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-375) FIFO scheduler may crash due to bugg app
[ https://issues.apache.org/jira/browse/YARN-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-375. - Resolution: Not a Problem FIFO scheduler may crash due to bugg app -- Key: YARN-375 URL: https://issues.apache.org/jira/browse/YARN-375 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Arun C Murthy Priority: Critical The following code should check for a 0 return value rather than crash! {code} int availableContainers = node.getAvailableResource().getMemory() / capability.getMemory(); // TODO: A buggy // application // with this // zero would // crash the // scheduler. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-800) Clicking on an AM link for a running app leads to a HTTP 500
[ https://issues.apache.org/jira/browse/YARN-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-800. - Resolution: Duplicate Clicking on an AM link for a running app leads to a HTTP 500 Key: YARN-800 URL: https://issues.apache.org/jira/browse/YARN-800 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Arpit Gupta Priority: Minor Clicking the AM link tries to open up a page with url like http://hostname:8088/proxy/application_1370886527995_0645/ and this leads to an HTTP 500 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-520) webservices API ws/v1/cluster/nodes doesn't return LOST nodes
[ https://issues.apache.org/jira/browse/YARN-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-520. - Resolution: Duplicate webservices API ws/v1/cluster/nodes doesn't return LOST nodes - Key: YARN-520 URL: https://issues.apache.org/jira/browse/YARN-520 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.6 Reporter: Nathan Roberts webservices API ws/v1/cluster/nodes doesn't return LOST nodes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
Junping Du created YARN-3449: Summary: Recover appTokenKeepAliveMap upon nodemanager restart Key: YARN-3449 URL: https://issues.apache.org/jira/browse/YARN-3449 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Junping Du Assignee: Junping Du appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application alive after application is finished but NM still need app token to do log aggregation (when enable security and log aggregation). The applications are only inserted into this map when receiving getApplicationsToCleanup() from RM heartbeat response. And RM only send this info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM restart work preserving should put appTokenKeepAliveMap into NMStateStore and get recovered after restart. Without doing this, RM could terminate application earlier, so log aggregation could be failed if security is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3445) NM notify RM on running Apps in NM-RM heartbeat
Junping Du created YARN-3445: Summary: NM notify RM on running Apps in NM-RM heartbeat Key: YARN-3445 URL: https://issues.apache.org/jira/browse/YARN-3445 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Junping Du Assignee: Junping Du Per discussion in YARN-3334, we need filter out unnecessary collectors info from RM in heartbeat response. Our propose is to add additional field for running apps in NM heartbeat request, so RM only send collectors for local running apps back. This is also needed in YARN-914 (graceful decommission) that if no running apps in NM which is in decommissioning stage, it will get decommissioned immediately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3374) Collector's web server should randomly bind an available port
[ https://issues.apache.org/jira/browse/YARN-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3374. -- Resolution: Fixed Fix Version/s: YARN-2928 Hadoop Flags: Reviewed Commit it to branch YARN-2928. Thanks [~zjshen] for the patch! Collector's web server should randomly bind an available port - Key: YARN-3374 URL: https://issues.apache.org/jira/browse/YARN-3374 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: YARN-2928 Attachments: YARN-3347.1.patch It's based on the configuration now. The approach won't work if we move to app-level aggregator container solution. On NM my start multiple such aggregators, which cannot bind to the same configured port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3408) TestDistributedShell get failed due to RM start failure.
Junping Du created YARN-3408: Summary: TestDistributedShell get failed due to RM start failure. Key: YARN-3408 URL: https://issues.apache.org/jira/browse/YARN-3408 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Junping Du Assignee: Junping Du The log exception: {code} 2015-03-27 14:43:17,190 WARN [RM-0] mortbay.log (Slf4jLog.java:warn(89)) - Failed startup of context org.mortbay.jetty.webapp.WebAppContext@2d2d0132{/,file:/Users/jdu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/classes/webapps/cluster} javax.servlet.ServletException: java.lang.RuntimeException: Could not read signature secret file: /Users/jdu/hadoop-http-auth-signature-secret at org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:266) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.init(AuthenticationFilter.java:225) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.init(DelegationTokenAuthenticationFilter.java:161) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.init(RMAuthenticationFilter.java:53) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:773) at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:274) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:989) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1089) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:312) Caused by: java.lang.RuntimeException: Could not read signature secret file: /Users/jdu/hadoop-http-auth-signature-secret at org.apache.hadoop.security.authentication.util.FileSignerSecretProvider.init(FileSignerSecretProvider.java:59) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:264) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3402) Security support for new timeline service.
Junping Du created YARN-3402: Summary: Security support for new timeline service. Key: YARN-3402 URL: https://issues.apache.org/jira/browse/YARN-3402 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Junping Du Assignee: Junping Du We should support YARN security for new TimelineService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context
[ https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-3040. -- Resolution: Fixed Fix Version/s: YARN-2928 Hadoop Flags: Reviewed [Data Model] Make putEntities operation be aware of the app's context - Key: YARN-3040 URL: https://issues.apache.org/jira/browse/YARN-3040 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Fix For: YARN-2928 Attachments: YARN-3040.1.patch, YARN-3040.2.patch, YARN-3040.3.patch, YARN-3040.4.patch, YARN-3040.5.patch, YARN-3040.6.patch Per design in YARN-2928, implement client-side API for handling *flows*. Frameworks should be able to define and pass in all attributes of flows and flow runs to YARN, and they should be passed into ATS writers. YARN tags were discussed as a way to handle this piece of information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3333) rename TimelineAggregator etc. to TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-. -- Resolution: Fixed Hadoop Flags: Reviewed rename TimelineAggregator etc. to TimelineCollector --- Key: YARN- URL: https://issues.apache.org/jira/browse/YARN- Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN--unit-tests-fixes.patch, YARN-.001.patch, YARN-.002.patch Per discussions on YARN-2928, let's rename TimelineAggregator, etc. to TimelineCollector, etc. There are also several minor issues on the current branch, which can be fixed as part of this: - fixing some imports - missing license in TestTimelineServerClientIntegration.java - whitespaces - missing direct dependency -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient
Junping Du created YARN-3367: Summary: Replace starting a separate thread for post entity with event loop in TimelineClient Key: YARN-3367 URL: https://issues.apache.org/jira/browse/YARN-3367 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Since YARN-3039, we add loop in TimelineClient to wait for collectorServiceAddress ready before posting any entity. In consumer of TimelineClient (like AM), we are starting a new thread for each call to get rid of potential deadlock in main thread. This way has at least 3 major defects: 1. The consumer need some additional code to wrap a thread before calling putEntities() in TimelineClient. 2. It cost many thread resources which is unnecessary. 3. The sequence of events could be out of order because each posting operation thread get out of waiting loop randomly. We should have something like event loop in TimelineClient side, putEntities() only put related entities into a queue of entities and a separated thread handle to deliver entities in queue to collector via REST call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)