[jira] [Created] (YARN-7672) hadoop-sls can not simulate huge scale of YARN

2017-12-18 Thread zhangshilong (JIRA)
zhangshilong created YARN-7672:
--

 Summary: hadoop-sls can not simulate huge scale of YARN
 Key: YARN-7672
 URL: https://issues.apache.org/jira/browse/YARN-7672
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhangshilong
Assignee: zhangshilong


Our YARN cluster scale to nearly 10 thousands nodes.
We need to do scheduler pressure test.
we start  2000+ threads to simulate NM and AM. So  cpu.load very high to 100+. 
I thought that will affect  performance evaluation of scheduler. 
So I thought to separate the scheduler from the simulator.
I start a real RM. Then SLS will register nodes to RM,And submit apps to RM 
using RM RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7214) duplicated container completed To AM

2017-09-19 Thread zhangshilong (JIRA)
zhangshilong created YARN-7214:
--

 Summary: duplicated container completed To AM
 Key: YARN-7214
 URL: https://issues.apache.org/jira/browse/YARN-7214
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.7.1
 Environment: hadoop 2.7.1  rm recovery and nm recovery enabled
Reporter: zhangshilong


env: hadoop 2.7.1  with rm recovery and nm recovery enabled
case:
 spark app(app1) running least one container(named c1) in NM1.
 1、NM1 crashed,and RM found NM1 expired in 10 minutes.
 2、RM will remove all containers in NM1(RMNodeImpl). and  app1 will receive c1 
completed message.But RM can not send c1(to be removed) to NM1 because NM1 lost.
 3、NM1 restart and register with RM(c1 in register request),but RM found NM1 is 
lost and will not handle containers from NM1.
4、NM1 will not heartbeat with c1(c1 not in heartbeat request).  So c1 will not 
removed from context of NM1.
5、 RM restart, NM1 re register with RM。And c1 will be handled and recovered. RM 
will send c1 complted message to AM of app1.  So, app1 received duplicated c1. 
once spark AM   receive one container completed from RM, it will allocate one 
new container.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6045) apps/queues that have no pending containers will still affect the efficiency of scheduling

2016-12-29 Thread zhangshilong (JIRA)
zhangshilong created YARN-6045:
--

 Summary: apps/queues that have no pending containers will still 
affect the efficiency of scheduling
 Key: YARN-6045
 URL: https://issues.apache.org/jira/browse/YARN-6045
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: jdk 1.7
kernel:2.6.32-431.20.3.el6
Reporter: zhangshilong
Assignee: zhangshilong


Sorting queues/apps consumes a significant amount of time during a single 
container allocation.
Each time a container is assigned, all queues / apps are sorted by hierarchy.
In practice, many queues / apps without pending container  do not need to 
participate in the sort.
Without the need for resources, apps / queues do not participate in sorting, 
scheduling performance will increase a lot.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5969) FairShareComparator getResourceUsage pool performance

2016-12-05 Thread zhangshilong (JIRA)
zhangshilong created YARN-5969:
--

 Summary: FairShareComparator getResourceUsage pool performance
 Key: YARN-5969
 URL: https://issues.apache.org/jira/browse/YARN-5969
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: zhangshilong


in FairShareComparator.java, the performance of function getResourceUsage()  is 
very pool. It will be executed above 100,000,000 times per second.
In our scene, It  takes 20 seconds per minute.  
A simple solution is to reduce call counts  of the function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)
zhangshilong created YARN-4327:
--

 Summary: RM can not renew  TIMELINE_DELEGATION_TOKEN in securt 
clusters
 Key: YARN-4327
 URL: https://issues.apache.org/jira/browse/YARN-4327
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.1
 Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
kerberos security.
conf like this:

  hadoop.security.authorization
  true
  Is service-level authorization enabled?



  hadoop.security.authentication
  kerberos
  Possible values are simple (no authentication), and kerberos
  


Reporter: zhangshilong


in hadoop 2.7.1
ATS conf like this: 

yarn.timeline-service.http-authentication.type
simple


yarn.timeline-service.http-authentication.kerberos.principal
HTTP/_h...@xxx.com


yarn.timeline-service.http-authentication.kerberos.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.principal
xxx/_h...@xxx.com


yarn.timeline-service.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.best-effort
true


yarn.timeline-service.enabled
true
  
 

I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM Context, 
but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application to failure.
RM logs:
2015-11-03 11:58:38,191 WARN 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: 
Unable to add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
masterKeyId=2)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [500], message [Null user]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:543)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:540)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)