[jira] [Created] (YARN-7646) MR job (based on old version tarball) get failed due to incompatible resource request

2017-12-12 Thread Junping Du (JIRA)
Junping Du created YARN-7646:


 Summary: MR job (based on old version tarball) get failed due to 
incompatible resource request
 Key: YARN-7646
 URL: https://issues.apache.org/jira/browse/YARN-7646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Junping Du
Priority: Blocker


With quick workaround with fixing HDFS-12920 (set non time unit to 
hdfs-site.xml), the job still get failed with following error:
{noformat}
2017-12-12 16:39:13,105 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, requested memory < 0, or requested memory > max configured, 
requestedMemory=-1, maxMemory=8192
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:275)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:240)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:256)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246)
at 
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:217)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy81.allocate(Unknown Source)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:206)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:783)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:280)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:279)
at java.lang.Thread.run(Thread.java:745)
{noformat}
It looks like incompatible change with communication between old MR client 

[jira] [Created] (YARN-7230) Document DockerContainerRuntime for branch-2.8 with proper scope and claim as an experimental feature

2017-09-20 Thread Junping Du (JIRA)
Junping Du created YARN-7230:


 Summary: Document DockerContainerRuntime for branch-2.8 with 
proper scope and claim as an experimental feature
 Key: YARN-7230
 URL: https://issues.apache.org/jira/browse/YARN-7230
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.8.1
Reporter: Junping Du
Priority: Blocker


YARN-5258 is to document new feature for docker container runtime which already 
get checked in trunk/branch-2. We need a similar one for branch-2.8. However, 
given we missed several patches, we need to define narrowed scope of these 
feature/improvements which match with existing patches landed in 2.8. Also, 
like YARN-6622, to document it as experimental.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7138) Fix incompatible API change for YarnScheduler involved by YARN-5521

2017-08-30 Thread Junping Du (JIRA)
Junping Du created YARN-7138:


 Summary: Fix incompatible API change for YarnScheduler involved by 
YARN-5521
 Key: YARN-7138
 URL: https://issues.apache.org/jira/browse/YARN-7138
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Junping Du
Priority: Blocker


>From JACC report for 2.8.2 against 2.7.4, it indicates that we have 
>incompatible changes happen in YarnScheduler:
{noformat}
hadoop-yarn-server-resourcemanager-2.7.4.jar, YarnScheduler.class
package org.apache.hadoop.yarn.server.resourcemanager.scheduler
YarnScheduler.allocate ( ApplicationAttemptId p1, List p2, 
List p3, List p4, List p5 ) [abstract]  :  
Allocation 
{noformat}
The root cause is YARN-5221. We should change it back or workaround this by 
adding back original API (mark as deprecated if not used any more).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7124) CLONE - Log aggregation deletes/renames while file is open

2017-08-29 Thread Junping Du (JIRA)
Junping Du created YARN-7124:


 Summary: CLONE - Log aggregation deletes/renames while file is open
 Key: YARN-7124
 URL: https://issues.apache.org/jira/browse/YARN-7124
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.2
Reporter: Daryn Sharp
Assignee: Jason Lowe
Priority: Critical


YARN-6288 changes the log aggregation writer to be an autoclosable.  
Unfortunately the try-with-resources block for the writer will either rename or 
delete the log while open.

Assuming the NM's behavior is correct, deleting open files only results in 
ominous WARNs in the nodemanager log and increases the rate of logging in the 
NN when the implicit try-with-resource close fails.  These red herrings 
complicate debugging efforts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7027) Log aggregation finish time should get logged for trouble shooting.

2017-08-16 Thread Junping Du (JIRA)
Junping Du created YARN-7027:


 Summary: Log aggregation finish time should get logged for trouble 
shooting.
 Key: YARN-7027
 URL: https://issues.apache.org/jira/browse/YARN-7027
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Junping Du
Assignee: Junping Du


Now, RM track application log aggregation status in RMApp and the status change 
is triggered by NM heartbeat with log aggregation report. For each node's log 
aggregation status change from in-progress 
(NOT_START,RUNNING,RUNNING_WITH_FAILURE) to final status (SUCCEEDED,FAILED, 
TIMEOUT), it trigger an aggregation for log aggregation status: 
updateLogAggregationStatus(). The whole progress is log less and we cannot 
trace the log aggregation problem (delay of log aggregation, etc.) from RM (or 
NM) log. We should add more log here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1038) LocalizationProtocolPBClientImpl RPC failing

2017-08-07 Thread Junping Du (JIRA)

 [ 
https://issues-test.apache.org/jira/browse/YARN-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-1038.
--
Resolution: Cannot Reproduce

I don't think trunk branch has this problem now, just resolve as cannot 
reproduce.

> LocalizationProtocolPBClientImpl RPC failing
> 
>
> Key: YARN-1038
> URL: https://issues-test.apache.org/jira/browse/YARN-1038
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
>Reporter: Alejandro Abdelnur
>Priority: Blocker
>
> Trying to run an MR job in trunk is failing with:
> {code}
> 2013-08-06 22:24:21,498 WARN org.apache.hadoop.ipc.Client: interrupted 
> waiting to send rpc request to server
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
>   at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1019)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1372)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1352)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy25.heartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:250)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:977)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6891) Can kill other user's applications via RM UI

2017-07-27 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-6891.
--
Resolution: Duplicate

> Can kill other user's applications via RM UI
> 
>
> Key: YARN-6891
> URL: https://issues.apache.org/jira/browse/YARN-6891
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Junping Du
>Priority: Critical
>
> In a  secured cluster with UI unsecured which has following config
> {code}
> "hadoop.http.authentication.simple.anonymous.allowed" => "true"
> "hadoop.http.authentication.type" => kerberos
> {code}
> UI can be accessed without any security setting.
> Also any user can kill other user's applications via UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6890) If UI is not secured, we allow user to kill other users' job even yarn cluster is secured.

2017-07-27 Thread Junping Du (JIRA)
Junping Du created YARN-6890:


 Summary: If UI is not secured, we allow user to kill other users' 
job even yarn cluster is secured.
 Key: YARN-6890
 URL: https://issues.apache.org/jira/browse/YARN-6890
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sumana Sathish
Assignee: Junping Du
Priority: Critical


Configuring SPNEGO for web browser could be a head ache, so many production 
cluster choose to configure a unsecured UI access even for a secured cluster. 
In this setup, users (login as some random guy) could watch other users job 
which is expected. However, the kill button (added in YARN-3249 which enabled 
by default) shouldn't work in this situation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5007) Remove deprecated constructors of MiniYARNCluster and MiniMRYarnCluster

2017-04-27 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-5007.
--
Resolution: Later
  Assignee: (was: Andras Bokor)

> Remove deprecated constructors of MiniYARNCluster and MiniMRYarnCluster
> ---
>
> Key: YARN-5007
> URL: https://issues.apache.org/jira/browse/YARN-5007
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Andras Bokor
>  Labels: oct16-easy
> Attachments: YARN-5007.01.patch, YARN-5007.02.patch, 
> YARN-5007.03.patch
>
>
> MiniYarnCluster has a deprecated constructor which is called by the other 
> constructors and it causes javac warnings during the build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6534) ResourceManager failed due to TimelineClient try to init SSLFactory even https is not enabled

2017-04-26 Thread Junping Du (JIRA)
Junping Du created YARN-6534:


 Summary: ResourceManager failed due to TimelineClient try to init 
SSLFactory even https is not enabled
 Key: YARN-6534
 URL: https://issues.apache.org/jira/browse/YARN-6534
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3
Reporter: Junping Du
Priority: Blocker


In a non-secured cluster, RM get failed consistently due to 
TimelineServiceV1Publisher tries to init TimelineClient with SSLFactory without 
any checking on if https get used.

{noformat}
2017-04-26 21:09:10,683 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:main(1457)) - Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException: java.io.FileNotFoundException: 
/etc/security/clientKeys/all.jks (No such file or directory)
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:131)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.AbstractSystemMetricsPublisher.serviceInit(AbstractSystemMetricsPublisher.java:59)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.serviceInit(TimelineServiceV1Publisher.java:67)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:344)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1453)
Caused by: java.io.FileNotFoundException: /etc/security/clientKeys/all.jks (No 
such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.hadoop.security.ssl.ReloadingX509TrustManager.loadTrustManager(ReloadingX509TrustManager.java:168)
at 
org.apache.hadoop.security.ssl.ReloadingX509TrustManager.(ReloadingX509TrustManager.java:86)
at 
org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory.init(FileBasedKeyStoresFactory.java:219)
at org.apache.hadoop.security.ssl.SSLFactory.init(SSLFactory.java:179)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector.getSSLFactory(TimelineConnector.java:176)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector.serviceInit(TimelineConnector.java:106)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 11 more
{noformat}
CC [~rohithsharma] and [~gtCarrera9]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6336) Jenkins report YARN new UI build failure

2017-03-14 Thread Junping Du (JIRA)
Junping Du created YARN-6336:


 Summary: Jenkins report YARN new UI build failure 
 Key: YARN-6336
 URL: https://issues.apache.org/jira/browse/YARN-6336
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Priority: Blocker


In Jenkins report of YARN-6313 
(https://builds.apache.org/job/PreCommit-YARN-Build/15260/artifact/patchprocess/patch-compile-hadoop-yarn-project_hadoop-yarn.txt),
 we found following build failure due to YARN new UI:
{noformat}
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/target/src/main/webapp/node_modules/ember-cli-htmlbars/node_modules/broccoli-persistent-filter/node_modules/async-disk-cache/node_modules/username/index.js:2
const os = require('os');
^
Use of const in strict mode.
SyntaxError: Use of const in strict mode.
at Module._compile (module.js:439:25)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object. 
(/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/target/src/main/webapp/node_modules/ember-cli-htmlbars/node_modules/broccoli-persistent-filter/node_modules/async-disk-cache/index.js:24:16)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
DEPRECATION: Node v0.10.25 is no longer supported by Ember CLI. Please update 
to a more recent version of Node
undefined
version: 1.13.15
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
Building[INFO] 

{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6294) ATS client should better handle Socket closed case

2017-03-06 Thread Junping Du (JIRA)
Junping Du created YARN-6294:


 Summary: ATS client should better handle Socket closed case
 Key: YARN-6294
 URL: https://issues.apache.org/jira/browse/YARN-6294
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Reporter: Sumana Sathish
Assignee: Li Lu


Exception stack:
{noformat}
17/02/06 07:11:30 INFO distributedshell.ApplicationMaster: Container completed 
successfully., containerId=container_1486362713048_0037_01_02
17/02/06 07:11:30 ERROR distributedshell.ApplicationMaster: Error in 
RMCallbackHandler: 
com.sun.jersey.api.client.ClientHandlerException: java.net.SocketException: 
Socket closed
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:236)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:185)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:248)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at 
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPostingObject(TimelineWriter.java:154)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:115)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:112)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPosting(TimelineWriter.java:112)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:92)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:346)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishContainerEndEvent(ApplicationMaster.java:1145)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.access$400(ApplicationMaster.java:169)
at 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster$RMCallbackHandler.onContainersCompleted(ApplicationMaster.java:779)
at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:296)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
at 
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 20 more
Exception in thread "AMRM Callback Handler Thread" 
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6079) simple spelling errors in yarn test code

2017-01-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-6079.
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0-alpha2
   2.9.0

> simple spelling errors in yarn test code
> 
>
> Key: YARN-6079
> URL: https://issues.apache.org/jira/browse/YARN-6079
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Grant Sohn
>Assignee: vijay
>Priority: Trivial
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-6079.001.patch
>
>
> charactor -> character
> hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java:
> Assert.assertTrue("invalid label charactor should not add to repo", 
> caught);
> expteced -> expected
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java:
>   Assert.fail("Exception is not expteced.");
> Exepected -> Expected
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java:
> "Exepected AbsoluteUsedCapacity > 0.95, got: "
> expteced -> expected
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java:
>   Assert.fail("Exception is not expteced.");
> macthing -> matching
> hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:
> assertEquals("Expected no macthing requests.", matches.size(), 0);
> propogated -> propagated
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java:
> Assert.assertTrue("Node script time out message not propogated",
> protential -> potential
> hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/BasePBImplRecordsTest.java:
> LOG.info(String.format("Exclude protential property: %s\n", 
> gsp.propertyName));
> recevied -> received
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java:
> throw new Exception("Unexpected resource recevied.");
> shouldnt -> shouldn't
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServiceAppsNodelabel.java:
>   fail("resourceInfo object shouldnt be available for finished apps");
> Transistion -> Transition
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java:
>   Assert.fail("Transistion to Active should have failed for 
> refreshAll()");
> Unhelathy -> Unhealthy
> hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java:
> Assert.assertEquals("Unhelathy Nodes", initialUnHealthy,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6071) Fix incompatible API change on AM-RM protocol due to YARN-3866 (trunk only)

2017-01-07 Thread Junping Du (JIRA)
Junping Du created YARN-6071:


 Summary: Fix incompatible API change on AM-RM protocol due to 
YARN-3866 (trunk only)
 Key: YARN-6071
 URL: https://issues.apache.org/jira/browse/YARN-6071
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Wangda Tan
Priority: Blocker


In YARN-3866, we have addendum patch to fix incompatible API change on branch-2 
and branch-2.8. For trunk, we need a similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6068) Log aggregation get failed when NM restart even with recovery

2017-01-06 Thread Junping Du (JIRA)
Junping Du created YARN-6068:


 Summary: Log aggregation get failed when NM restart even with 
recovery
 Key: YARN-6068
 URL: https://issues.apache.org/jira/browse/YARN-6068
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


The exception log is as following:
{noformat}
2017-01-05 19:16:36,352 INFO  logaggregation.AppLogAggregatorImpl 
(AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation 
for application_1483640789847_0001
2017-01-05 19:16:36,352 WARN  logaggregation.AppLogAggregatorImpl 
(AppLogAggregatorImpl.java:run(399)) - Aggregation did not complete for 
application application_1483640789847_0001
2017-01-05 19:16:36,353 WARN  application.ApplicationImpl 
(ApplicationImpl.java:handle(461)) - Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
APPLICATION_LOG_HANDLING_FAILED at RUNNING
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:459)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:64)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1084)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1076)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
2017-01-05 19:16:36,355 INFO  application.ApplicationImpl 
(ApplicationImpl.java:handle(464)) - Application application_1483640789847_0001 
transitioned from RUNNING to null
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5718) TimelineClient (and other places in YARN) shouldn't over-write HDFS client retry settings which could cause unexpected behavior

2016-10-10 Thread Junping Du (JIRA)
Junping Du created YARN-5718:


 Summary: TimelineClient (and other places in YARN) shouldn't 
over-write HDFS client retry settings which could cause unexpected behavior
 Key: YARN-5718
 URL: https://issues.apache.org/jira/browse/YARN-5718
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient, resourcemanager
Reporter: Junping Du
Assignee: Junping Du


In one HA cluster, after NN failed over, we noticed that job is getting failed 
as TimelineClient failed to retry connection to proper NN. This is because we 
are overwrite hdfs client settings that hard code retry policy to be enabled 
that conflict NN failed-over case - hdfs client should fail fast so can retry 
on another NN.
We shouldn't assume any retry policy for hdfs client at all places in YARN. 
This should keep consistent with HDFS settings that has different retry polices 
in different deployment case. Thus, we should clean up these hard code settings 
in YARN, include: FileSystemTimelineWriter, FileSystemRMStateStore and 
FileSystemNodeLabelsStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5536) Multiple format support (JSON, etc.) for exclude node file in NM graceful decommission with timeout

2016-08-18 Thread Junping Du (JIRA)
Junping Du created YARN-5536:


 Summary: Multiple format support (JSON, etc.) for exclude node 
file in NM graceful decommission with timeout
 Key: YARN-5536
 URL: https://issues.apache.org/jira/browse/YARN-5536
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Priority: Blocker


Per discussion in YARN-4676, we agree that multiple format (other than xml) 
should be supported to decommission nodes with timeout values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5475) Test failed for TestAggregatedLogFormat on trunk

2016-08-05 Thread Junping Du (JIRA)
Junping Du created YARN-5475:


 Summary: Test failed for TestAggregatedLogFormat on trunk
 Key: YARN-5475
 URL: https://issues.apache.org/jira/browse/YARN-5475
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


Tests run: 3, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 1.114 sec <<< 
FAILURE! - in org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat
testReadAcontainerLogs1(org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat)
  Time elapsed: 0.012 sec  <<< ERROR!
java.io.IOException: Unable to create directory : 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/TestAggregatedLogFormat/testReadAcontainerLogs1/srcFiles/application_1_0001/container_1_0001_01_01/subDir
at 
org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.getOutputStreamWriter(TestAggregatedLogFormat.java:403)
at 
org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.writeSrcFile(TestAggregatedLogFormat.java:382)
at 
org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.testReadAcontainerLog(TestAggregatedLogFormat.java:211)
at 
org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat.testReadAcontainerLogs1(TestAggregatedLogFormat.java:185)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5416) TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed intermittently due to not wait SchedulerApplicationAttempt to be stopped

2016-07-21 Thread Junping Du (JIRA)
Junping Du created YARN-5416:


 Summary: TestRMRestart#testRMRestartWaitForPreviousAMToFinish 
failed intermittently due to not wait SchedulerApplicationAttempt to be stopped
 Key: YARN-5416
 URL: https://issues.apache.org/jira/browse/YARN-5416
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du
Priority: Minor


The test failure stack is:
Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
Tests run: 54, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 385.338 sec 
<<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
testRMRestartWaitForPreviousAMToFinish[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
  Time elapsed: 43.134 sec  <<< FAILURE!
java.lang.AssertionError: AppAttempt state is not correct (timedout) 
expected: but was:
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:86)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:594)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:1008)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:530)

This is due to the same issue that partially fixed in YARN-4968



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5311) Document graceful decommission CLI and usage

2016-07-05 Thread Junping Du (JIRA)
Junping Du created YARN-5311:


 Summary: Document graceful decommission CLI and usage
 Key: YARN-5311
 URL: https://issues.apache.org/jira/browse/YARN-5311
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Junping Du
Assignee: Junping Du






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5217) Close FileInputStream in NMWebServices#getLogs in branch-2.8

2016-06-09 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-5217.
--
Resolution: Duplicate

> Close FileInputStream in NMWebServices#getLogs in branch-2.8
> 
>
> Key: YARN-5217
> URL: https://issues.apache.org/jira/browse/YARN-5217
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Critical
> Attachments: YARN-5217.branch-2.8.patch
>
>
> In https://issues.apache.org/jira/browse/YARN-5199, we close LogReader in in 
> AHSWebServices#getStreamingOutput and FileInputStream in 
> NMWebServices#getLogs. We should do the same thing in branch-2.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater

2016-06-08 Thread Junping Du (JIRA)
Junping Du created YARN-5214:


 Summary: Pending on synchronized method 
DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
 Key: YARN-5214
 URL: https://issues.apache.org/jira/browse/YARN-5214
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a 
while and marked LOST by RM. From the log, the NM daemon is still running, but 
jstack hints NM's NodeStatusUpdater thread get blocked:
1.  Node Status Updater thread get blocked by 0x8065eae8 
{noformat}
"Node Status Updater" #191 prio=5 os_prio=0 tid=0x7f0354194000 nid=0x26fa 
waiting for monitor entry [0x7f035945a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
- waiting to lock <0x8065eae8> (a 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
at 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
at java.lang.Thread.run(Thread.java:745)
{noformat}

2. The actual holder of this lock is DiskHealthMonitor:
{noformat}
"DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x7f0397393000 
nid=0x26bd runnable [0x7f035e511000]
   java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.createDirectory(Native Method)
at java.io.File.mkdir(File.java:1316)
at 
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
at 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
at 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
at 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
- locked <0x8065eae8> (a 
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
at 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
at 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
at 
org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{noformat}

This disk operation could take longer time than expectation especially in high 
IO throughput case and we should have fine-grained lock for related operations 
here. 
The same issue on HDFS get raised and fixed in HDFS-7489, and we probably 
should have similar fix here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4955) Add retry for SocketTimeoutException in TimelineClient

2016-04-28 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-4955.
--
Resolution: Fixed

> Add retry for SocketTimeoutException in TimelineClient
> --
>
> Key: YARN-4955
> URL: https://issues.apache.org/jira/browse/YARN-4955
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4955.1.patch, YARN-4955.2.patch, YARN-4955.3.patch, 
> YARN-4955.4-1.patch, YARN-4955.4.patch, YARN-4955.5.patch, YARN-4955.6.patch
>
>
> We saw this exception several times when we tried to getDelegationToken from 
> ATS.
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> java.net.SocketTimeoutException: Read timed out
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:569)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:234)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:582)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:291)
>   at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:290)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>   at 
> org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
>   at 
> org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
>   at java.lang.Thread.run(Thread.java:745)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
> Caused by: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> java.net.SocketTimeoutException: Read timed out
>   at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:332)
>   at 
> org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:205)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:128)
>   at 
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:215)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:285)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.getDelegationToken(DelegationTokenAuthenticator.java:166)
>   at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.getDelegationToken(DelegationTokenAuthenticatedURL.java:371)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:475)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:467)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> 

[jira] [Created] (YARN-4984) LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.

2016-04-21 Thread Junping Du (JIRA)
Junping Du created YARN-4984:


 Summary: LogAggregationService shouldn't swallow exception in 
handling createAppDir() which cause thread leak.
 Key: YARN-4984
 URL: https://issues.apache.org/jira/browse/YARN-4984
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.2
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Due to YARN-4325, many stale applications still exists in NM state store and 
get recovered after NM restart. The app initiation will get failed due to token 
invalid, but exception is swallowed and aggregator thread is still created for 
invalid app.

Exception is:
{noformat}
158 2016-04-19 23:38:33,039 ERROR logaggregation.LogAggregationService 
(LogAggregationService.java:run(300)) - Failed to setup application log 
directory for application_1448060878692_11842
159 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 1380589 for hdfswrite) can't be fo
und in cache
160 at org.apache.hadoop.ipc.Client.call(Client.java:1427)
161 at org.apache.hadoop.ipc.Client.call(Client.java:1358)
162 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
163 at com.sun.proxy.$Proxy13.getFileInfo(Unknown Source)
164 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
165 at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source)
166 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
167 at java.lang.reflect.Method.invoke(Method.java:606)
168 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
169 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
170 at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
171 at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
172 at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
173 at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
174 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
175 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
176 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:248)
177 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:67)
178 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
179 at java.security.AccessController.doPrivileged(Native Method)
180 at javax.security.auth.Subject.doAs(Subject.java:415)
181 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
182 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:261)
183 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:367)
184 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
185 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:447)
186 at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4932) (Umbrella) YARN test failures on Windows

2016-04-08 Thread Junping Du (JIRA)
Junping Du created YARN-4932:


 Summary: (Umbrella) YARN test failures on Windows
 Key: YARN-4932
 URL: https://issues.apache.org/jira/browse/YARN-4932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Junping Du


We found several test failures related to Windows. Here is Umbrella jira to 
track them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4893) Fix some intermittent test failures in TestRMAdminService2

2016-03-29 Thread Junping Du (JIRA)
Junping Du created YARN-4893:


 Summary: Fix some intermittent test failures in TestRMAdminService2
 Key: YARN-4893
 URL: https://issues.apache.org/jira/browse/YARN-4893
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


As discussion in YARN-998, we need to add rm.drainEvents() after 
rm.registerNode() or some of test could get failed intermittently. Also, we can 
consider to add rm.drainEvents() within rm.registerNode() that could be more 
convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4863) AHS Security login should be in serviceInit() instead of serviceStart()

2016-03-24 Thread Junping Du (JIRA)
Junping Du created YARN-4863:


 Summary: AHS Security login should be in serviceInit() instead of 
serviceStart()
 Key: YARN-4863
 URL: https://issues.apache.org/jira/browse/YARN-4863
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du


Like other daemons, doSecureLogin() should get called in serviceInit but not 
serviceStart, otherwise, some FS operations could have problem during composite 
services get started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4832) NM side resource value should get updated if change applied in RM side

2016-03-19 Thread Junping Du (JIRA)
Junping Du created YARN-4832:


 Summary: NM side resource value should get updated if change 
applied in RM side
 Key: YARN-4832
 URL: https://issues.apache.org/jira/browse/YARN-4832
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Now, if we execute CLI to update node (single or multiple) resource in RM side, 
NM will not receive any notification. It doesn't affect resource scheduling but 
will make resource usage metrics reported by NM a bit weird. We should sync up 
new resource between RM and NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4791) Per user blacklist node for user specific error for container launch failure.

2016-03-11 Thread Junping Du (JIRA)
Junping Du created YARN-4791:


 Summary: Per user blacklist node for user specific error for 
container launch failure.
 Key: YARN-4791
 URL: https://issues.apache.org/jira/browse/YARN-4791
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Reporter: Junping Du
Assignee: Junping Du


There are some user specific error for container launch failure, like:
when enabling LinuxContainerExecutor, but some node doesn't have such user 
exists, so container launch should get failed with following information:
{noformat}
2016-02-14 15:37:03,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 
2016-02-14 15:37:03,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1434045496283_0036 failed 2 times due to AM Container for 
appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: 
Application application_1434045496283_0036 initialization failed (exitCode=255) 
with output: User jdu not found 
{noformat}
Obviously, this node is not suitable for launching container for this user's 
other applications. We need a per user blacklist track mechanism rather than 
per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4790) Per user blacklist node for user specific error for container launch failure.

2016-03-11 Thread Junping Du (JIRA)
Junping Du created YARN-4790:


 Summary: Per user blacklist node for user specific error for 
container launch failure.
 Key: YARN-4790
 URL: https://issues.apache.org/jira/browse/YARN-4790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Reporter: Junping Du
Assignee: Junping Du


There are some user specific error for container launch failure, like:
when enabling LinuxContainerExecutor, but some node doesn't have such user 
exists, so container launch should get failed with following information:
{noformat}
2016-02-14 15:37:03,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 
2016-02-14 15:37:03,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1434045496283_0036 failed 2 times due to AM Container for 
appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: 
Application application_1434045496283_0036 initialization failed (exitCode=255) 
with output: User jdu not found 
{noformat}
Obviously, this node is not suitable for launching container for this user's 
other applications. We need a per user blacklist track mechanism rather than 
per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4638) Node whitelist support for AM launching

2016-01-25 Thread Junping Du (JIRA)
Junping Du created YARN-4638:


 Summary: Node whitelist support for AM launching 
 Key: YARN-4638
 URL: https://issues.apache.org/jira/browse/YARN-4638
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.

2016-01-25 Thread Junping Du (JIRA)
Junping Du created YARN-4636:


 Summary: Make blacklist tracking policy pluggable for more 
extensions.
 Key: YARN-4636
 URL: https://issues.apache.org/jira/browse/YARN-4636
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-01-25 Thread Junping Du (JIRA)
Junping Du created YARN-4635:


 Summary: Add global blacklist tracking for AM container failure.
 Key: YARN-4635
 URL: https://issues.apache.org/jira/browse/YARN-4635
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4600) More general services provided to application/container by YARN

2016-01-18 Thread Junping Du (JIRA)
Junping Du created YARN-4600:


 Summary: More general services provided to application/container 
by YARN
 Key: YARN-4600
 URL: https://issues.apache.org/jira/browse/YARN-4600
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: applications, resourcemanager
Reporter: Junping Du
Priority: Critical


There are more general services like HA, message/notification, should be 
supported by YARN to containers to better support colorful applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4601) HA as a general YARN service to highlighted container by application.

2016-01-18 Thread Junping Du (JIRA)
Junping Du created YARN-4601:


 Summary: HA as a general YARN service to highlighted container by 
application.
 Key: YARN-4601
 URL: https://issues.apache.org/jira/browse/YARN-4601
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


For LRS (long running services) on YARN, get rid of single point failure for 
critical container failure may not be necessary. Some applications would like 
to build its own HA architecture. However, it would be ideal to provide some 
fundamental support to HA service in YARN, like: launching container marked 
with active/standby, monitor/trigger out failed over, provide end point for 
shring information between active/standby container, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4602) Message/notification service between containers

2016-01-18 Thread Junping Du (JIRA)
Junping Du created YARN-4602:


 Summary: Message/notification service between containers
 Key: YARN-4602
 URL: https://issues.apache.org/jira/browse/YARN-4602
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du


Currently, mostly communications among YARN daemons, services and applications 
are go through RPC. In almost all cases, logic running inside of containers are 
RPC client but not server because it get launched inflight. The only special 
case is AM container, because it get launched earlier than any other containers 
so it can be RPC server and tell new coming containers server address in 
application logic (like MR AM). 
The side effects are: 
1. When AM container get failed, the new AM attempts will get launched with new 
address/port, so previous RPC are broken.
2. Application's requirement are variable, there could be other dependency 
between containers (not AM), so some container failed over will affect other 
containers' running logic.
It is better to have some message/notification mechanism between containers for 
handle above cases.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4576) Extend blacklist mechanism to protect AM failed multiple times on failure nodes

2016-01-11 Thread Junping Du (JIRA)
Junping Du created YARN-4576:


 Summary: Extend blacklist mechanism to protect AM failed multiple 
times on failure nodes
 Key: YARN-4576
 URL: https://issues.apache.org/jira/browse/YARN-4576
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried 
to launch containers on a specific node get failed for several times, AM will 
blacklist this node in future resource asking. This mechanism works fine for 
normal containers. However, from our observation on behaviors of clusters: if 
this problematic node launch AM failed, then RM could pickup this problematic 
node to launch next AM attempts again and again that cause application failure 
in case other functional nodes are busy. In normal case, the customized healthy 
checker script cannot be so sensitive to mark node as unhealthy when one or two 
containers get launched failed. However, in RM side, we can blacklist these 
nodes for launching AM for a certain time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4552) NM ResourceLocalizationService should check and initialize local filecache dir (and log dir) even if NM recover is enabled.

2016-01-06 Thread Junping Du (JIRA)
Junping Du created YARN-4552:


 Summary: NM ResourceLocalizationService should check and 
initialize local filecache dir (and log dir) even if NM recover is enabled.
 Key: YARN-4552
 URL: https://issues.apache.org/jira/browse/YARN-4552
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


In some cases, user are cleanup localized file cache for debugging/trouble 
shooting purpose during NM down time. However, after bring back NM (with 
recovery enabled), the job submission could be failed for exception like below:
{noformat}
Diagnostics: java.io.FileNotFoundException: File /disk/12/yarn/local/filecache 
does not exist.
{noformat}
This is due to we only create filecache dir when recover is not enabled during 
ResourceLocalizationService get initialized/started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4542) Cleanup AHS code and configuration

2016-01-04 Thread Junping Du (JIRA)
Junping Du created YARN-4542:


 Summary: Cleanup AHS code and configuration
 Key: YARN-4542
 URL: https://issues.apache.org/jira/browse/YARN-4542
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


ATS (many versions so far) is designed to replace AHS. We should consider to 
cleanup AHS related configuration and code later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4527) Possible thread leak if TimelineClient.start() get called multiple times.

2015-12-30 Thread Junping Du (JIRA)
Junping Du created YARN-4527:


 Summary: Possible thread leak if TimelineClient.start() get called 
multiple times.
 Key: YARN-4527
 URL: https://issues.apache.org/jira/browse/YARN-4527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.8.0
Reporter: Junping Du
Assignee: Junping Du


Since YARN-4234, we involve TimelineClient start and stop that would create 
different TimelineWriter according to the configuration. serviceStart() will 
create a TimelineWriter instance every time which will spawn several timer 
threads afterwards. If one TimelineClient get call start() multiple times for 
some reason (application bug or intentionally in some cases), the spawned timer 
threads will get leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4501) Document new put APIs in TimelineClient for ATS 1.5

2015-12-23 Thread Junping Du (JIRA)
Junping Du created YARN-4501:


 Summary: Document new put APIs in TimelineClient for ATS 1.5
 Key: YARN-4501
 URL: https://issues.apache.org/jira/browse/YARN-4501
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Junping Du
Assignee: Xuan Gong


In YARN-4234, we are adding new put APIs in TimelineClient, we should document 
it properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4466) ResourceManager should tolerate unexpected exceptions to happen in non-critical subsystem/services like SystemMetricsPublisher

2015-12-16 Thread Junping Du (JIRA)
Junping Du created YARN-4466:


 Summary: ResourceManager should tolerate unexpected exceptions to 
happen in non-critical subsystem/services like SystemMetricsPublisher
 Key: YARN-4466
 URL: https://issues.apache.org/jira/browse/YARN-4466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Junping Du


>From my comment in 
>YARN-4452(https://issues.apache.org/jira/browse/YARN-4452?focusedCommentId=15059805=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15059805),
> we should make RM more robust with ignore (but log) unexpected exception in 
>its non-critical subsystems/services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4429) RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on retry failed reason or the log from RMProxy's retry could be very misleading.

2015-12-07 Thread Junping Du (JIRA)
Junping Du created YARN-4429:


 Summary: RetryPolicies (other than 
FailoverOnNetworkExceptionRetry) should put on retry failed reason or the log 
from RMProxy's retry could be very misleading.
 Key: YARN-4429
 URL: https://issues.apache.org/jira/browse/YARN-4429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.6.0
Reporter: Junping Du
Assignee: Junping Du


In debugging a NM retry connection to RM (non-HA), the NM log during RM down 
time is very misleading:
{noformat}
2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
{noformat}
It actually only log client side retry on NetworkConnection failure but not 
include any info on RetryInvocationHandler where the real retry policy works. 
From the code below in RetryInvocationHandler.java, even the retry ends, we 
don't put warn messages to include how much/many time/ counts we spent on retry 
logic that make it harder to debug.

{code}
if (failAction != null) {

[jira] [Created] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM

2015-12-07 Thread Junping Du (JIRA)
Junping Du created YARN-4431:


 Summary: Not necessary to do unRegisterNM() if NM get stop due to 
failed to connect to RM
 Key: YARN-4431
 URL: https://issues.apache.org/jira/browse/YARN-4431
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du


{noformat}
2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 12:16:58,876 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unregistration 
of the Node 10.200.10.53:25454 failed.
java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to 0.0.0.0:8031 
failed on connection exception: java.net.ConnectException: Connection refused; 
For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at 
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at 
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377)
{noformat}
If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the 
connection with proper retry policy. After retry the maximum times (15 minutes 
by default), it will send NodeManagerEventType.SHUTDOWN to shutdown NM. But NM 
shutdown will call NodeStatusUpdaterImpl.serviceStop() which will call 
unRegisterNM() to unregister NM from RM and get retry again (another 15 
minutes). This is completely unnecessary and we should skip unRegisterNM when 
NM get shutdown because of connection issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-01 Thread Junping Du (JIRA)
Junping Du created YARN-4403:


 Summary: (AM/NM/Container)LivelinessMonitor should use monotonic 
time when calculating period
 Key: YARN-4403
 URL: https://issues.apache.org/jira/browse/YARN-4403
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
calculate a duration of expire which could be broken by settimeofday. We should 
use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster

2015-11-24 Thread Junping Du (JIRA)
Junping Du created YARN-4389:


 Summary: "yarn.am.blacklisting.enabled" and 
"yarn.am.blacklisting.disable-failure-threshold" should be app specific rather 
than a setting for whole YARN cluster
 Key: YARN-4389
 URL: https://issues.apache.org/jira/browse/YARN-4389
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Reporter: Junping Du
Priority: Critical


"yarn.am.blacklisting.enabled" and 
"yarn.am.blacklisting.disable-failure-threshold" should be application specific 
rather than a setting in cluster level, or we should't maintain 
amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We 
should allow each am to override this config, i.e. via submissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4388) Cleanup "mapreduce.job.hdfs-servers" from yarn-default.xml

2015-11-24 Thread Junping Du (JIRA)
Junping Du created YARN-4388:


 Summary: Cleanup "mapreduce.job.hdfs-servers" from yarn-default.xml
 Key: YARN-4388
 URL: https://issues.apache.org/jira/browse/YARN-4388
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Priority: Minor


It is obviously that "mapreduce.job.hdfs-servers" shouldn't belongs to yarn 
configuration so we should move it to mapred-default.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4351) Tests in h.y.c.TestGetGroups get failed on trunk

2015-11-12 Thread Junping Du (JIRA)
Junping Du created YARN-4351:


 Summary: Tests in h.y.c.TestGetGroups get failed on trunk
 Key: YARN-4351
 URL: https://issues.apache.org/jira/browse/YARN-4351
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


>From test report: 
>https://builds.apache.org/job/PreCommit-YARN-Build/9661/testReport/, we can 
>see there are several test failures for TestGetGroups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4352) Timeout for tests in TestYarnClient, TestAMRMClient and TestNMClient

2015-11-12 Thread Junping Du (JIRA)
Junping Du created YARN-4352:


 Summary: Timeout for tests in TestYarnClient, TestAMRMClient and 
TestNMClient
 Key: YARN-4352
 URL: https://issues.apache.org/jira/browse/YARN-4352
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


>From 
>https://builds.apache.org/job/PreCommit-YARN-Build/9661/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client-jdk1.7.0_79.txt,
> we can see the tests in TestYarnClient, TestAMRMClient and TestNMClient get 
>timeout which can be reproduced locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1949) Add admin ACL check to AdminService#updateNodeResource()

2015-11-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-1949.
--
Resolution: Duplicate

The core change for ACL update is already included in YARN-1506, so I close 
this JIRA as duplicated. [~kj-ki], if you would like to continue your patch 
with keep test part only, please reopen this JIRA and update your patch to sync 
with trunk. Thanks!

> Add admin ACL check to AdminService#updateNodeResource()
> 
>
> Key: YARN-1949
> URL: https://issues.apache.org/jira/browse/YARN-1949
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
> Attachments: YARN-1949.patch
>
>
> At present, updateNodeResource() doesn't check ACL. We should call 
> checkAcls() before setResourceOption().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4346) Test committer.commitJob() behavior during committing when MR AM get failed.

2015-11-11 Thread Junping Du (JIRA)
Junping Du created YARN-4346:


 Summary: Test committer.commitJob() behavior during committing 
when MR AM get failed.
 Key: YARN-4346
 URL: https://issues.apache.org/jira/browse/YARN-4346
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du


In MAPREDUCE-5485, we are adding additional API (isCommitJobRepeatable) to 
allow job commit can tolerate AM failure in some cases (like 
FileOutputCommitter in v2 algorithm). Although we have unit test to cover most 
of flows, we may want a completed end to end test to verify the whole work flow.
The scenario include:
1. For FileOutputCommitter (or some sub class), emulate a MR AM failure or 
restart during commitJob() in progress
2. Check different behavior for v1 and v2 (support isCommitJobRepeatable() or 
not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4288) NodeManager restart should keep retrying to register to RM while connection exception happens during RM restart

2015-10-22 Thread Junping Du (JIRA)
Junping Du created YARN-4288:


 Summary: NodeManager restart should keep retrying to register to 
RM while connection exception happens during RM restart
 Key: YARN-4288
 URL: https://issues.apache.org/jira/browse/YARN-4288
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with 
RPC which could throw following exceptions when RM get restarted at the same 
time, like following exception shows:
{noformat}
2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl 
(NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - 
Unexpected error rebooting NodeStatusUpdater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: "172.27.62.28"; destination host 
is: "172.27.62.57":8025;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager 
(NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.io.IOException: Connection reset by peer; Host 
Details : local host is: "172.27.62.28"; destination host is: 
"172.27.62.57":8025;
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: 
Connection reset by peer; Host Details : local host is: 
"ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 

[jira] [Created] (YARN-4274) NodeStatusUpdaterImpl should register to RM again after a non-fatal exception happen before

2015-10-16 Thread Junping Du (JIRA)
Junping Du created YARN-4274:


 Summary: NodeStatusUpdaterImpl should register to RM again after a 
non-fatal exception happen before
 Key: YARN-4274
 URL: https://issues.apache.org/jira/browse/YARN-4274
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
Assignee: Junping Du


>From YARN-3896, an non-fatal exception like response ID mismatch between NM 
>and RM (due to a race condition) will cause NM stop working. I think we should 
>make it more robust to tolerant a few times failure in registering to RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4160) Dynamic NM Resources Configuration file should be simplified.

2015-09-15 Thread Junping Du (JIRA)
Junping Du created YARN-4160:


 Summary: Dynamic NM Resources Configuration file should be 
simplified.
 Key: YARN-4160
 URL: https://issues.apache.org/jira/browse/YARN-4160
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du


In YARN-313, we provide CLI to refresh NMs' resources dynamically. The format 
of dynamic-resources.xml is something like following:
{noformat}

  
yarn.resource.dynamic.node_id_1.vcores
16
  
  
yarn.resource.dynamic.node_id_1.memory
1024
  

{noformat}
This looks too redundant from review comments of YARN-313. We should have a 
better, concisely format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4031) Add JvmPauseMonitor to ApplicationHistoryServer

2015-08-07 Thread Junping Du (JIRA)
Junping Du created YARN-4031:


 Summary: Add JvmPauseMonitor to ApplicationHistoryServer
 Key: YARN-4031
 URL: https://issues.apache.org/jira/browse/YARN-4031
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du


We should add the JvmPauseMonitor to ApplicationHistoryServer and 
WebAppProxyServer like what we do in yarn-4019 for ResourceManager and 
NodeManager.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3959) Store application related configurations in Timeline Service v2

2015-07-22 Thread Junping Du (JIRA)
Junping Du created YARN-3959:


 Summary: Store application related configurations in Timeline 
Service v2
 Key: YARN-3959
 URL: https://issues.apache.org/jira/browse/YARN-3959
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du


We already have configuration field in HBase schema for application entity. We 
need to make sure AM write it out when it get launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations

2015-06-17 Thread Junping Du (JIRA)
Junping Du created YARN-3815:


 Summary: [Aggregation] Application/Flow/User/Queue Level 
Aggregations
 Key: YARN-3815
 URL: https://issues.apache.org/jira/browse/YARN-3815
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Per previous discussions in some design documents for YARN-2928, the basic 
scenario is the query for stats can happen on:
- Application level, expect return: an application with aggregated stats
- Flow level, expect return: aggregated stats for a flow_run, flow_version and 
flow 
- User level, expect return: aggregated stats for applications submitted by user
- Queue level, expect return: aggregated stats for applications within the Queue

Application states is the basic building block for all other level 
aggregations. We can provide Flow/User/Queue level aggregated statistics info 
based on application states (a dedicated table for application states is needed 
which is missing from previous design documents like HBase/Phoenix schema 
design). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3817) Flow and User level aggregation on Application States table

2015-06-17 Thread Junping Du (JIRA)
Junping Du created YARN-3817:


 Summary: Flow and User level aggregation on Application States 
table
 Key: YARN-3817
 URL: https://issues.apache.org/jira/browse/YARN-3817
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du


We need flow/user level aggregation to present flow/user related states to end 
users.
Flow level aggregation involve three levels aggregations:
- The first level is Flow_run level which represents one execution of a flow 
and shows exactly aggregated data for a run of flow.
- The 2nd level is Flow_version level which represents summary info of a 
version of flow.
- The 3rd level is Flow level which represents summary info of a specific flow.

User level aggregation represents summary info of a specific user, it should 
include summary info of accumulated and statistic means (by two levels: 
application and flow), like: number of Flows, applications, resource 
consumption, resource means per app or flow, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3695) EOFException shouldn't be retry forever in RMProxy

2015-05-21 Thread Junping Du (JIRA)
Junping Du created YARN-3695:


 Summary: EOFException shouldn't be retry forever in RMProxy
 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du


YARN-3646 fix the retry forever policy that it only applies on limited 
exceptions rather than all exceptions. Here, we may want to review these 
exceptions. At least, exception EOFException shouldn't retry forever.

{code}
exceptionToPolicyMap.put(EOFException.class, retryPolicy);
exceptionToPolicyMap.put(ConnectException.class, retryPolicy);
exceptionToPolicyMap.put(NoRouteToHostException.class, retryPolicy);
exceptionToPolicyMap.put(UnknownHostException.class, retryPolicy);
exceptionToPolicyMap.put(ConnectTimeoutException.class, retryPolicy);
exceptionToPolicyMap.put(RetriableException.class, retryPolicy);
exceptionToPolicyMap.put(SocketException.class, retryPolicy);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3641) stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)
Junping Du created YARN-3641:


 Summary: stopRecoveryStore() shouldn't be skipped when exceptions 
happen in stopping NM's sub-services.
 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
if (isStopping.getAndSet(true)) {
  return;
}
super.serviceStop();
stopRecoveryStore();
DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a final block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3599) Fix the javadoc of DelegationTokenSecretManager in hadoop-yarn

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3599.
--
Resolution: Duplicate

 Fix the javadoc of DelegationTokenSecretManager in hadoop-yarn
 --

 Key: YARN-3599
 URL: https://issues.apache.org/jira/browse/YARN-3599
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gabor Liptak
Priority: Trivial
 Attachments: YARN-3599.1.patch, YARN-3599.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3596) Fix the javadoc of DelegationTokenSecretManager in hadoop-common

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3596.
--
Resolution: Duplicate

 Fix the javadoc of DelegationTokenSecretManager in hadoop-common
 

 Key: YARN-3596
 URL: https://issues.apache.org/jira/browse/YARN-3596
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gabor Liptak
Priority: Trivial
 Attachments: YARN-3596.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3597) Fix the javadoc of DelegationTokenSecretManager in hadoop-hdfs

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3597.
--
Resolution: Duplicate

 Fix the javadoc of DelegationTokenSecretManager in hadoop-hdfs
 --

 Key: YARN-3597
 URL: https://issues.apache.org/jira/browse/YARN-3597
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gabor Liptak
Priority: Trivial
 Attachments: YARN-3597.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3598) Fix the javadoc of DelegationTokenSecretManager in hadoop-mapreduce

2015-05-11 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3598.
--
Resolution: Duplicate

 Fix the javadoc of DelegationTokenSecretManager in hadoop-mapreduce
 ---

 Key: YARN-3598
 URL: https://issues.apache.org/jira/browse/YARN-3598
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gabor Liptak
Priority: Trivial
 Attachments: YARN-3598.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3592) Fix typos in RMNodeLabelsManager

2015-05-07 Thread Junping Du (JIRA)
Junping Du created YARN-3592:


 Summary: Fix typos in RMNodeLabelsManager
 Key: YARN-3592
 URL: https://issues.apache.org/jira/browse/YARN-3592
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Junping Du


acccessibleNodeLabels = accessibleNodeLabels in many places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3586) RM only get back addresses of Collectors that NM needs to know.

2015-05-06 Thread Junping Du (JIRA)
Junping Du created YARN-3586:


 Summary: RM only get back addresses of Collectors that NM needs to 
know.
 Key: YARN-3586
 URL: https://issues.apache.org/jira/browse/YARN-3586
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, timelineserver
Reporter: Junping Du
Assignee: Junping Du


After YARN-3445, RM cache runningApps for each NM. So RM heartbeat back to NM 
should only include collectors' address for running applications against 
specific NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2470) A high value for yarn.nodemanager.delete.debug-delay-sec causes Nodemanager to crash. Slider needs this value to be high. Setting a very high value throws an exception an

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2470.
--
Resolution: Won't Fix

 A high value for yarn.nodemanager.delete.debug-delay-sec causes Nodemanager 
 to crash. Slider needs this value to be high. Setting a very high value 
 throws an exception and nodemanager does not start
 --

 Key: YARN-2470
 URL: https://issues.apache.org/jira/browse/YARN-2470
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.1
Reporter: Shivaji Dutta
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2483) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to incorrect AppAttempt state

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2483.
--
  Resolution: Duplicate
Target Version/s:   (was: 2.6.0)

Resolve this JIRA as duplicated.

 TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to 
 incorrect AppAttempt state
 

 Key: YARN-2483
 URL: https://issues.apache.org/jira/browse/YARN-2483
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/665/console :
 {code}
 testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)
   Time elapsed: 49.686 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:84)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:582)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:589)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:182)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:402)
 {code}
 TestApplicationMasterLauncher#testallocateBeforeAMRegistration fails with 
 similar cause.
 These tests failed in build #664 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2445.
--
Resolution: Won't Fix

Agree with [~billie.rinaldi]'s comments above, this is expected behavior.

 ATS does not reflect changes to uploaded TimelineEntity
 ---

 Key: YARN-2445
 URL: https://issues.apache.org/jira/browse/YARN-2445
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Marcelo Vanzin
Priority: Minor
 Attachments: ats2.java


 If you make a change to the TimelineEntity and send it to the ATS, that 
 change is not reflected in the stored data.
 For example, in the attached code, an existing primary filter is removed and 
 a new one is added. When you retrieve the entity from the ATS, it only 
 contains the old value:
 {noformat}
 {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]}
 {noformat}
 Perhaps this is what the design wanted, but from an API user standpoint, it's 
 really confusing, since to upload events I have to upload the entity itself, 
 and the changes are not reflected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-670) Add an Exception to indicate 'Maintenance' for NMs and add this to the JavaDoc for appropriate protocols

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-670.
-
Resolution: Won't Fix

 Add an Exception to indicate 'Maintenance' for NMs and add this to the 
 JavaDoc for appropriate protocols
 

 Key: YARN-670
 URL: https://issues.apache.org/jira/browse/YARN-670
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-964) Give a parameter that can set AM retry interval

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-964.
-
Resolution: Won't Fix

Agree with [~vinodkv]. Resolve this as won't fix.

 Give a parameter that can set  AM retry interval
 

 Key: YARN-964
 URL: https://issues.apache.org/jira/browse/YARN-964
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: qus-jiawei

 Our am retry number is 4.
 As one nodemanager 's disk is full,the container of am couldn't allocate on 
 this nodemanager.But RM try this AM on the same NM every 3 secondes.
 i think there shoule be a params to set the  AM retry interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2365) TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2365.
--
Resolution: Cannot Reproduce

 TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2
 --

 Key: YARN-2365
 URL: https://issues.apache.org/jira/browse/YARN-2365
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Mit Desai

 TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails on branch-2 
 with the following errror
 {noformat}
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 46.471 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
 testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)
   Time elapsed: 46.354 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:569)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:576)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3567) Document exit codes and their meanings used by LinuxContainerExecutor.

2015-05-01 Thread Junping Du (JIRA)
Junping Du created YARN-3567:


 Summary: Document exit codes and their meanings used by 
LinuxContainerExecutor.
 Key: YARN-3567
 URL: https://issues.apache.org/jira/browse/YARN-3567
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Junping Du


Similar to YARN-2334, we should document exit codes and means for 
LinuxContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2334) Document exit codes and their meanings used by linux task controller.

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2334.
--
Resolution: Not A Problem

 Document exit codes and their meanings used by linux task controller.
 -

 Key: YARN-2334
 URL: https://issues.apache.org/jira/browse/YARN-2334
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Sreekanth Ramakrishnan
 Attachments: HADOOP-5912.1.patch, MAPREDUCE-1318.1.patch, 
 MAPREDUCE-1318.2.patch, MAPREDUCE-1318.patch


 Currently, linux task controller binary uses a set of exit code, which is not 
 documented. These should be documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2364) TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2364.
--
Resolution: Duplicate

Duplicated with YARN-1468

 TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy
 

 Key: YARN-2364
 URL: https://issues.apache.org/jira/browse/YARN-2364
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Mit Desai

 TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy. It fails 
 intermittently on branch-2 with the following errors.
 Fails with any of these
 {noformat}
 Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 26.836 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
 testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
   Time elapsed: 26.687 sec   FAILURE!
 java.lang.AssertionError: expected:4 but was:3
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:557)
 {noformat}
 or
 {noformat}
 Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 51.326 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
 testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
   Time elapsed: 51.055 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:949)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:519)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2390) Investigating whether generic history service needs to support queue-acls

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2390.
--
Resolution: Won't Fix

 Investigating whether generic history service needs to support queue-acls
 -

 Key: YARN-2390
 URL: https://issues.apache.org/jira/browse/YARN-2390
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Sunil G

 According YARN-1250,  it's arguable whether queue-acls should be applied to 
 the generic history service as well, because the queue admin may not need the 
 access to the completed application that is removed from the queue. Create 
 this ticket to tackle the discussion around.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2401) Rethinking of the HTTP method of TimelineWebServices#postEntities

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2401.
--
Resolution: Won't Fix

 Rethinking of the HTTP method of TimelineWebServices#postEntities
 -

 Key: YARN-2401
 URL: https://issues.apache.org/jira/browse/YARN-2401
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Now TimelineWebServices#postEntities is using POST. However, semantically, 
 postEntities is creating an entity or append more data into it. POST may not 
 be the most proper method to for this API.
 AFAIK, PUT is used to update the entire resource and supposed to be 
 idempotent. Therefore, I'm not sure it's an idea to change the method to PUT 
 because once the entity is created, the following updates are actually 
 appending more data to the existing one. The best fit should be PATCH, 
 however, it requires the additional implementation at the web services side. 
 Hence, somebody online suggested using POST for partial non-idempotent update 
 as well. We need to think more about it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2398) TestResourceTrackerOnHA crashes

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2398.
--
Resolution: Cannot Reproduce

 TestResourceTrackerOnHA crashes
 ---

 Key: YARN-2398
 URL: https://issues.apache.org/jira/browse/YARN-2398
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Jason Lowe

 TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-880) Configuring map/reduce memory equal to nodemanager's memory, hangs the job execution

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-880.
-
Resolution: Not A Problem
  Assignee: (was: Omkar Vinit Joshi)

 Configuring map/reduce memory equal to nodemanager's memory, hangs the job 
 execution
 

 Key: YARN-880
 URL: https://issues.apache.org/jira/browse/YARN-880
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
Reporter: Nishan Shetty
Priority: Critical

 Scenario:
 =
 Cluster is installed with 2 Nodemanagers 
 Configuraiton:
 NM memory (yarn.nodemanager.resource.memory-mb): 8 gb
 map and reduce memory : 8 gb
 Appmaster memory: 2 gb
 If map task is reserved on the same nodemanager where appmaster of the same 
 job is running then job execution hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-943) RM starts 2 attempts of failed app even though am-max-retries is set to 1

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-943.
-
Resolution: Not A Problem
  Assignee: (was: Zhijie Shen)

 RM starts 2 attempts of failed app even though am-max-retries is set to 1
 -

 Key: YARN-943
 URL: https://issues.apache.org/jira/browse/YARN-943
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
 Attachments: nm.log, rm.log, yarn-site.xml


 nameyarn.resourcemanager.am.max-retries/name is set to 1 but the RM still 
 retries the AM 2 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-27 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3431.
--
   Resolution: Fixed
Fix Version/s: YARN-2928
 Hadoop Flags: Reviewed

I have commit this to YARN-2928. Thanks [~zjshen] for contributing the patch! 
Also, thanks for review, [~sjlee0] and [~gtCarrera9]!

 Sub resources of timeline entity needs to be passed to a separate endpoint.
 ---

 Key: YARN-3431
 URL: https://issues.apache.org/jira/browse/YARN-3431
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: YARN-2928

 Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, 
 YARN-3431.4.patch, YARN-3431.5.patch, YARN-3431.6.patch, YARN-3431.7.patch


 We have TimelineEntity and some other entities as subclass that inherit from 
 it. However, we only have a single endpoint, which consume TimelineEntity 
 rather than sub-classes and this endpoint will check the incoming request 
 body contains exactly TimelineEntity object. However, the json data which is 
 serialized from sub-class object seems not to be treated as an TimelineEntity 
 object, and won't be deserialized into the corresponding sub-class object 
 which cause deserialization failure as some discussions in YARN-3334 : 
 https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps

2015-04-17 Thread Junping Du (JIRA)
Junping Du created YARN-3505:


 Summary: Node's Log Aggregation Report with SUCCEED should not 
cached in RMApps
 Key: YARN-3505
 URL: https://issues.apache.org/jira/browse/YARN-3505
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Junping Du
Assignee: Xuan Gong
Priority: Critical


Per discussions in YARN-1402, we shouldn't cache all node's log aggregation 
reports in RMApps for always, especially for those finished with SUCCEED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3488) AM get timeline service info from RM rather than Application specific configuration.

2015-04-14 Thread Junping Du (JIRA)
Junping Du created YARN-3488:


 Summary: AM get timeline service info from RM rather than 
Application specific configuration.
 Key: YARN-3488
 URL: https://issues.apache.org/jira/browse/YARN-3488
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications
Reporter: Junping Du
Assignee: Junping Du


Since v1 timeline service, we have MR configuration to enable/disable putting 
history event to timeline service. For today's v2 timeline service ongoing 
effort, currently we have different methods/structures between v1 and v2 for 
consuming TimelineClient, so application have to be aware of which version 
timeline service get used there.
There are basically two options here:
First option is as current way in DistributedShell or MR to let application has 
specific configuration to point out that if enabling ATS and which version 
could be, like: MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA, etc.
The other option is to let application to figure out timeline related info from 
YARN/RM, it can be done through registerApplicationMaster() in 
ApplicationMasterProtocol with return value for service off, v1_on, or 
v2_on.
We prefer the latter option because application owner doesn't have to aware 
RM/YARN infrastructure details. Please note that we should keep compatible 
(consistent behavior with the same setting) with released configurations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart

2015-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3449.
--
Resolution: Invalid
  Assignee: (was: Junping Du)

 Recover appTokenKeepAliveMap upon nodemanager restart
 -

 Key: YARN-3449
 URL: https://issues.apache.org/jira/browse/YARN-3449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0, 2.7.0
Reporter: Junping Du

 appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application 
 alive after application is finished but NM still need app token to do log 
 aggregation (when enable security and log aggregation). 
 The applications are only inserted into this map when receiving 
 getApplicationsToCleanup() from RM heartbeat response. And RM only send this 
 info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM 
 restart work preserving should put appTokenKeepAliveMap into NMStateStore and 
 get recovered after restart. Without doing this, RM could terminate 
 application earlier, so log aggregation could be failed if security is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-329) yarn CHANGES.txt link missing from docs Reference

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-329.
-
   Resolution: Fixed
Fix Version/s: 2.6.0

This get fixed since 2.6.0 release. Mark it as resolved.

 yarn CHANGES.txt link missing from docs Reference
 -

 Key: YARN-329
 URL: https://issues.apache.org/jira/browse/YARN-329
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Thomas Graves
Priority: Minor
 Fix For: 2.6.0


 Looking at the hadoop 0.23 docs: http://hadoop.apache.org/docs/r0.23.5/
 There is no link to the yarn CHANGES.txt in the Reference menu on the left 
 side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-374) Job History Server doesn't show jobs which killed by ClientRMProtocol.forceKillApplication

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-374.
-
Resolution: Not a Problem

 Job History Server doesn't show jobs which killed by 
 ClientRMProtocol.forceKillApplication
 --

 Key: YARN-374
 URL: https://issues.apache.org/jira/browse/YARN-374
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.1-alpha
Reporter: Nemon Lou

 After i kill a app by typing bin/yarn rmadmin app -kill APP_ID,
 no job info is kept on JHS web page.
 However, when i kill a job by typing  bin/mapred  job -kill JOB_ID ,
 i can see a killed job left on JHS.
 Some hive users are confused by that their jobs been killed but nothing left 
 on JHS ,and killed app's info on RM web page is not enough.(They kill job by 
 clientRMProtocol)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2969) allocate resource on different nodes for task

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2969.
--
Resolution: Duplicate

 allocate resource on different nodes for task
 -

 Key: YARN-2969
 URL: https://issues.apache.org/jira/browse/YARN-2969
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Yang Hao

 At the help of slider, YARN will be a common resource managing OS and some 
 application would like to apply container( or component on slider) on 
 different nodes, so a configuration of allocating resource on different will 
 be helpful



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-463) Show explicitly excluded nodes on the UI

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-463.
-
Resolution: Implemented

We already have decommission nodes in UI page, so resolve this JIRA.

 Show explicitly excluded nodes on the UI
 

 Key: YARN-463
 URL: https://issues.apache.org/jira/browse/YARN-463
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Vinod Kumar Vavilapalli
  Labels: usability

 Nodes can be explicitly excluded via the config 
 yarn.resourcemanager.nodes.exclude-path. We should have a way of displaying 
 this list via web and command line UIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-551) The option shell_command of DistributedShell had better support compound command

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-551.
-
Resolution: Not a Problem

 The option shell_command of DistributedShell had better support  compound 
 command
 -

 Key: YARN-551
 URL: https://issues.apache.org/jira/browse/YARN-551
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: rainy Yu

 The option shell_command of DistributedShell must be such  single command as 
 'ls', not be compound command such as 'ps -ef' that including blank character.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-375) FIFO scheduler may crash due to bugg app

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-375.
-
Resolution: Not a Problem

 FIFO scheduler may crash due to bugg app  
 --

 Key: YARN-375
 URL: https://issues.apache.org/jira/browse/YARN-375
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Arun C Murthy
Priority: Critical

 The following code should check for a 0 return value rather than crash!
 {code}
 int availableContainers = 
   node.getAvailableResource().getMemory() / capability.getMemory(); // 
 TODO: A buggy
 // 
 application
 // 
 with this
 // 
 zero would
 // 
 crash the
 // 
 scheduler.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-800) Clicking on an AM link for a running app leads to a HTTP 500

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-800.
-
Resolution: Duplicate

 Clicking on an AM link for a running app leads to a HTTP 500
 

 Key: YARN-800
 URL: https://issues.apache.org/jira/browse/YARN-800
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Arpit Gupta
Priority: Minor

 Clicking the AM link tries to open up a page with url like
 http://hostname:8088/proxy/application_1370886527995_0645/
 and this leads to an HTTP 500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-520) webservices API ws/v1/cluster/nodes doesn't return LOST nodes

2015-04-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-520.
-
Resolution: Duplicate

 webservices API ws/v1/cluster/nodes doesn't return LOST nodes
 -

 Key: YARN-520
 URL: https://issues.apache.org/jira/browse/YARN-520
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.6
Reporter: Nathan Roberts

 webservices API ws/v1/cluster/nodes doesn't return LOST nodes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart

2015-04-03 Thread Junping Du (JIRA)
Junping Du created YARN-3449:


 Summary: Recover appTokenKeepAliveMap upon nodemanager restart
 Key: YARN-3449
 URL: https://issues.apache.org/jira/browse/YARN-3449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0, 2.7.0
Reporter: Junping Du
Assignee: Junping Du


appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application alive 
after application is finished but NM still need app token to do log aggregation 
(when enable security and log aggregation). 
The applications are only inserted into this map when receiving 
getApplicationsToCleanup() from RM heartbeat response. And RM only send this 
info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM restart 
work preserving should put appTokenKeepAliveMap into NMStateStore and get 
recovered after restart. Without doing this, RM could terminate application 
earlier, so log aggregation could be failed if security is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3445) NM notify RM on running Apps in NM-RM heartbeat

2015-04-02 Thread Junping Du (JIRA)
Junping Du created YARN-3445:


 Summary: NM notify RM on running Apps in NM-RM heartbeat
 Key: YARN-3445
 URL: https://issues.apache.org/jira/browse/YARN-3445
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Junping Du
Assignee: Junping Du


Per discussion in YARN-3334, we need filter out unnecessary collectors info 
from RM in heartbeat response. Our propose is to add additional field for 
running apps in NM heartbeat request, so RM only send collectors for local 
running apps back. This is also needed in YARN-914 (graceful decommission) that 
if no running apps in NM which is in decommissioning stage, it will get 
decommissioned immediately. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3374) Collector's web server should randomly bind an available port

2015-04-02 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3374.
--
   Resolution: Fixed
Fix Version/s: YARN-2928
 Hadoop Flags: Reviewed

Commit it to branch YARN-2928. Thanks [~zjshen] for the patch!

 Collector's web server should randomly bind an available port
 -

 Key: YARN-3374
 URL: https://issues.apache.org/jira/browse/YARN-3374
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: YARN-2928

 Attachments: YARN-3347.1.patch


 It's based on the configuration now. The approach won't work if we move to 
 app-level aggregator container solution. On NM my start multiple such 
 aggregators, which cannot bind to the same configured port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3408) TestDistributedShell get failed due to RM start failure.

2015-03-27 Thread Junping Du (JIRA)
Junping Du created YARN-3408:


 Summary: TestDistributedShell get failed due to RM start failure.
 Key: YARN-3408
 URL: https://issues.apache.org/jira/browse/YARN-3408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du


The log exception:
{code}
2015-03-27 14:43:17,190 WARN  [RM-0] mortbay.log (Slf4jLog.java:warn(89)) - 
Failed startup of context 
org.mortbay.jetty.webapp.WebAppContext@2d2d0132{/,file:/Users/jdu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/classes/webapps/cluster}
javax.servlet.ServletException: java.lang.RuntimeException: Could not read 
signature secret file: /Users/jdu/hadoop-http-auth-signature-secret
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:266)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.init(AuthenticationFilter.java:225)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.init(DelegationTokenAuthenticationFilter.java:161)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.init(RMAuthenticationFilter.java:53)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:773)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:274)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:989)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1089)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:312)
Caused by: java.lang.RuntimeException: Could not read signature secret file: 
/Users/jdu/hadoop-http-auth-signature-secret
at 
org.apache.hadoop.security.authentication.util.FileSignerSecretProvider.init(FileSignerSecretProvider.java:59)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.initializeSecretProvider(AuthenticationFilter.java:264)
... 23 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3402) Security support for new timeline service.

2015-03-26 Thread Junping Du (JIRA)
Junping Du created YARN-3402:


 Summary: Security support for new timeline service.
 Key: YARN-3402
 URL: https://issues.apache.org/jira/browse/YARN-3402
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Junping Du
Assignee: Junping Du


We should support YARN security for new TimelineService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3040) [Data Model] Make putEntities operation be aware of the app's context

2015-03-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-3040.
--
   Resolution: Fixed
Fix Version/s: YARN-2928
 Hadoop Flags: Reviewed

 [Data Model] Make putEntities operation be aware of the app's context
 -

 Key: YARN-3040
 URL: https://issues.apache.org/jira/browse/YARN-3040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Fix For: YARN-2928

 Attachments: YARN-3040.1.patch, YARN-3040.2.patch, YARN-3040.3.patch, 
 YARN-3040.4.patch, YARN-3040.5.patch, YARN-3040.6.patch


 Per design in YARN-2928, implement client-side API for handling *flows*. 
 Frameworks should be able to define and pass in all attributes of flows and 
 flow runs to YARN, and they should be passed into ATS writers.
 YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3333) rename TimelineAggregator etc. to TimelineCollector

2015-03-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-.
--
  Resolution: Fixed
Hadoop Flags: Reviewed

 rename TimelineAggregator etc. to TimelineCollector
 ---

 Key: YARN-
 URL: https://issues.apache.org/jira/browse/YARN-
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN--unit-tests-fixes.patch, YARN-.001.patch, 
 YARN-.002.patch


 Per discussions on YARN-2928, let's rename TimelineAggregator, etc. to 
 TimelineCollector, etc.
 There are also several minor issues on the current branch, which can be fixed 
 as part of this:
 - fixing some imports
 - missing license in TestTimelineServerClientIntegration.java
 - whitespaces
 - missing direct dependency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3367) Replace starting a separate thread for post entity with event loop in TimelineClient

2015-03-18 Thread Junping Du (JIRA)
Junping Du created YARN-3367:


 Summary: Replace starting a separate thread for post entity with 
event loop in TimelineClient
 Key: YARN-3367
 URL: https://issues.apache.org/jira/browse/YARN-3367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Junping Du
Assignee: Junping Du


Since YARN-3039, we add loop in TimelineClient to wait for 
collectorServiceAddress ready before posting any entity. In consumer of  
TimelineClient (like AM), we are starting a new thread for each call to get rid 
of potential deadlock in main thread. This way has at least 3 major defects:
1. The consumer need some additional code to wrap a thread before calling 
putEntities() in TimelineClient.
2. It cost many thread resources which is unnecessary.
3. The sequence of events could be out of order because each posting operation 
thread get out of waiting loop randomly.
We should have something like event loop in TimelineClient side, putEntities() 
only put related entities into a queue of entities and a separated thread 
handle to deliver entities in queue to collector via REST call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >