[jira] [Resolved] (YARN-10566) Elapsed time should be measured monotonicNow

2021-06-04 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-10566.
--
Release Note: see discussions in HADOOP-15901
  Resolution: Won't Fix

> Elapsed time should be measured monotonicNow
> 
>
> Key: YARN-10566
> URL: https://issues.apache.org/jira/browse/YARN-10566
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I noticed that there is a widespread incorrect usage of 
> {{System.currentTimeMillis()}}  throughout the yarn code.
> For example:
> {code:java}
> // Some comments here
> long start = System.currentTimeMillis();
> while (System.currentTimeMillis() - start < timeout) {
>   // Do something
> }
> {code}
> Elapsed time should be measured using `monotonicNow()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10

2021-04-12 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10733:


 Summary: TimelineService Hbase tests are failing with timeout 
error on branch-2.10
 Key: YARN-10733
 URL: https://issues.apache.org/jira/browse/YARN-10733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, timelineserver, yarn
Reporter: Ahmed Hussein
 Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, 
2021-04-12T12-40-58_857.dumpstream, 
org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip


{code:bash}
03:54:41 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout or 
other error in the fork -> [Help 1]
03:54:41 [ERROR] 
03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with 
the -e switch.
03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
03:54:41 [ERROR] 
03:54:41 [ERROR] For more information about the errors and possible solutions, 
please read the following articles:
03:54:41 [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
03:54:41 [ERROR] 
03:54:41 [ERROR] After correcting the problems, you can resume the build with 
the command
03:54:41 [ERROR]   mvn  -rf 
:hadoop-yarn-server-timelineservice-hbase-tests
{code}

Failure of the tests is due to test unit {{TestHBaseStorageFlowRunCompaction}} 
getting stuck.
Upon checking the surefire reports, I found several Class no Found Exceptions.

{code:bash}
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66)
at 
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
at 
org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
at 
org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
... 33 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 51 more
{code}

and 

{code:bash}
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.regionserver.StoreFileInfo
at 
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
at 
org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
at 
org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
... 10 more
{code}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-02-03 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-10352.
--
Resolution: Fixed

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Fix For: 3.4.0
>
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-10585.
--
Resolution: Fixed

> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10568) TestTimelineClient#testTimelineClientCleanup fails on trunk

2021-01-11 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10568:


 Summary: TestTimelineClient#testTimelineClientCleanup fails on 
trunk
 Key: YARN-10568
 URL: https://issues.apache.org/jira/browse/YARN-10568
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Reporter: Ahmed Hussein


{{TestTimelineClient.testTimelineClientCleanup}} gives a NPE on trunk

{code:bash}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.client.api.impl.TestTimelineClient.testTimelineClientCleanup(TestTimelineClient.java:483)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10566) Elapsed time should be measured monotonicNow

2021-01-11 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10566:


 Summary: Elapsed time should be measured monotonicNow
 Key: YARN-10566
 URL: https://issues.apache.org/jira/browse/YARN-10566
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


I noticed that there is a widespread incorrect usage of 
{{System.currentTimeMillis()}}  throughout the yarn code.

For example:

{code:java}
// Some comments here
long start = System.currentTimeMillis();
while (System.currentTimeMillis() - start < timeout) {
  // Do something
}
{code}

Elapsed time should be measured using `monotonicNow()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10556) Web-app server does not work for V2 timeline

2020-12-30 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10556:


 Summary: Web-app server does not work for V2 timeline
 Key: YARN-10556
 URL: https://issues.apache.org/jira/browse/YARN-10556
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Ahmed Hussein


{{TestDistributedShell}} for timeline version 2.0 shows the following errors in 
the log files, with the below exception.
There is a previous YARN-3087 that added a fix to the same issue before. There 
is a need to investigate whether it is a testing issue or it the error has 
resurfaced. 


{code:bash}
org.apache.hadoop.yarn.webapp.WebAppException: 
/v2/timeline/clusters/yarn_cluster/apps/application_1609346161655_0001: 
controller for v2 not found
at org.apache.hadoop.yarn.webapp.Router.resolveDefault(Router.java:247)
at org.apache.hadoop.yarn.webapp.Router.resolve(Router.java:155)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:152)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
at 
com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at 
org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1702)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 

[jira] [Created] (YARN-10553) Refactor TestDistributedShell

2020-12-28 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10553:


 Summary: Refactor TestDistributedShell
 Key: YARN-10553
 URL: https://issues.apache.org/jira/browse/YARN-10553
 Project: Hadoop YARN
  Issue Type: Bug
  Components: distributed-shell, test
Reporter: Ahmed Hussein


TestDistributedShell has grown so large over time. It has 29 tests.
This is ru inning the risk of exceeding 30 minutes limit for a single unit 
class.

* The implementation has lots of code redundancy.
* It is inefficient in the setup and tearing down. The large percentage of time 
execution is exhausted by starting cluster and stopping the services.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10536) Client in distributedShell swallows interrupt exceptions

2020-12-16 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10536:


 Summary: Client in distributedShell swallows interrupt exceptions
 Key: YARN-10536
 URL: https://issues.apache.org/jira/browse/YARN-10536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, distributed-shell
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


In {{applications.distributedshell.Client}} , the method {{monitorApplication}} 
loops waiting for the following conditions:

* Application fails: reaches {{YarnApplicationState.KILLED}}, or 
{{YarnApplicationState.FAILED}}
* Application succeeds: {{FinalApplicationStatus.SUCCEEDED}} or 
{{YarnApplicationState.FINISHED}}
* the time spent waiting is longer than {{clientTimeout}} (if it exists in the 
parameters).

When the Client thread is interrupted, it ignores the exception:

{code:java}
  // Check app status every 1 second.
  try {
Thread.sleep(1000);
  } catch (InterruptedException e) {
LOG.debug("Thread sleep in monitoring loop interrupted");
  }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-09 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10485:


 Summary: TimelineConnector swallows InterruptedException
 Key: YARN-10485
 URL: https://issues.apache.org/jira/browse/YARN-10485
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


Some tests timeout or take excessively long to shutdown because the 
{{TimelineConnector}} will catch InterruptedException and go into a retry loop 
instead of aborting.

[~daryn] reported that this makes debugging more difficult and he suggests the 
exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-05 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-10483.
--
Release Note: Please create Jiras that makes it easy for other developers 
to search and understand. 
  Resolution: Information Provided

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10468) TestNodeStatusUpdater does not handle early failure in threads

2020-10-20 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10468:


 Summary: TestNodeStatusUpdater does not handle early failure in 
threads
 Key: YARN-10468
 URL: https://issues.apache.org/jira/browse/YARN-10468
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Ahmed Hussein


While investigating HADOOP-17314, I found that the 

* TestNodeStatusUpdater#testNMRegistration() will continue running {{while 
(heartBeatID <= 3 && waitCount++ != 200) {}} even though the nm thread could 
already be dead.  the unit should detect that the nm has died and terminates 
sooner to release resources for other tests.
* TestNodeStatusUpdater#testNMRMConnectionConf(). Same problem as described 
above. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-07 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10455:


 Summary: TestNMProxy.testNMProxyRPCRetry is not consistent
 Key: YARN-10455
 URL: https://issues.apache.org/jira/browse/YARN-10455
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


The fix in YARN-8844 may fail depending on the configuration of the machine 
running the test.
In some cases the address gets resolved and the Unit throws a connection 
timeout exception instead. In such scenario the JUnit times out the main reason 
behind the failure is swallowed by the shutdown of the clients.
To make sure that the JUnit behavior is consistent, a suggested fix is to set 
the host address to {{127.0.0.1:1}}. The latter will omit the probability of 
collisions on non-privileged ports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10337) TestRMHATimelineCollectors fails on hadoop trunk

2020-07-02 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10337:


 Summary: TestRMHATimelineCollectors fails on hadoop trunk
 Key: YARN-10337
 URL: https://issues.apache.org/jira/browse/YARN-10337
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Reporter: Ahmed Hussein


{{TestRMHATimelineCollectors}} has been failing on trunk. I see it frequently 
in the qbt reports and the yetus reprts


{code:bash}
[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.95 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
[ERROR] 
testRebuildCollectorDataOnFailover(org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors)
  Time elapsed: 5.615 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover(TestRMHATimelineCollectors.java:105)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover:105 
NullPointer
[INFO]
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
[INFO]
[ERROR] There are test failures.

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10334) TestDistributedShell leaks resources on timeout/failure

2020-06-30 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10334:


 Summary: TestDistributedShell leaks resources on timeout/failure
 Key: YARN-10334
 URL: https://issues.apache.org/jira/browse/YARN-10334
 Project: Hadoop YARN
  Issue Type: Bug
  Components: distributed-shell, test, yarn
Reporter: Ahmed Hussein


{{TestDistributedShell}} times out on trunk. I found that the application, and 
containers will stay running in the background long after the unit test has 
failed.
This causes failure of other test cases and several false positives failures as 
result of:
* Ports will stay busy, so other tests cases fail to launch.
* Unit tests fail because of memory restrictions.

Although the unit test is already broken on trunk, we do not want its failures 
to other unit tests.
{{TestDistributedShell}} needs to be revisited to make sure that all 
{{YarnClients}}, and {{YarnApplications}} are closed properly at the end of the 
each unit test (including exception and timeouts)

Steps to reproduce:



{code:bash}
mvn test -Dtest=TestDistributedShell#testDSShellWithOpportunisticContainers

## this will timeout as
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 90.234 
s <<< FAILURE! - in 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
[ERROR] 
testDSShellWithOpportunisticContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 90.018 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 9 
milliseconds
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:1117)
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:1089)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers(TestDistributedShell.java:1438)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   TestDistributedShell.testDSShellWithOpportunisticContainers:1438 » 
TestTimedOut
[INFO] 
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
{code}


Using {{ps}} command, you can find the yarn processes are still in the 
background

{code:bash}
/bin/bash -c $JRE_HOME/bin/java -Xmx512m 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster 
--container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 
--num_containers 2 --priority 0 --appname DistributedShell --homedir 
file:/Users/ahussein 
1>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_01/AppMaster.stdout
 
2>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_01/AppMaster.stderr


$JRE_HOME/bin/java -Xmx512m 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster 
--container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 
--num_containers 2 --priority 0 --appname DistributedShell --homedir 
file:/Users/ahussein
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10220) RM HA times out intermittently

2020-05-12 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-10220.
--
Resolution: Cannot Reproduce

I will close it for now since I cannot reproduce the failures as reported in 
YARN-2710

> RM HA times out intermittently
> --
>
> Key: YARN-10220
> URL: https://issues.apache.org/jira/browse/YARN-10220
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 3.3.0, 3.2.1, 3.1.3
>Reporter: Ahmed Hussein
>Assignee: Bilwa S T
>Priority: Major
>
> TestResourceTrackerOnHA Among other tests time out intermittently
> * TestApplicationClientProtocolOnHA
> * TestApplicationMasterServiceProtocolForTimelineV2
> * TestApplicationMasterServiceProtocolOnHA
> {code:bash}
> [INFO] --- maven-surefire-plugin:3.0.0-M1:test (default-test) @ 
> hadoop-yarn-client ---
> [INFO]
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 19.612 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
> [ERROR] 
> testResourceTrackerOnHA(org.apache.hadoop.yarn.client.TestResourceTrackerOnHA)
>   Time elapsed: 19.473 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 15000 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:699)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:812)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1452)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy93.registerNodeManager(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy94.registerNodeManager(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA.testResourceTrackerOnHA(TestResourceTrackerOnHA.java:64)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> 

[jira] [Created] (YARN-10256) Refactor TestContainerSchedulerQueuing.testContainerUpdateExecTypeGuaranteedToOpportunistic

2020-04-30 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10256:


 Summary: Refactor 
TestContainerSchedulerQueuing.testContainerUpdateExecTypeGuaranteedToOpportunistic
 Key: YARN-10256
 URL: https://issues.apache.org/jira/browse/YARN-10256
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


In 3.x, 
{{TestContainerSchedulerQueuing.testContainerUpdateExecTypeGuaranteedToOpportunistic}}
 has redundant assertions. Since the UT throws timeout exception, 
{{GenericTestsUtils.waitFor()}} guarantees that the predicate is met 
successfully. Otherwise, the UT would throw a timeout exception.
The redundant loop causes confusion in understanding the test unit and may 
increase the possibility of failure in case the container terminates



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10255) revisit fix to intermittent TestContainerSchedulerQueuing.testContainerUpdateExecTypeGuaranteedToOpportunistic

2020-04-30 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10255:


 Summary: revisit fix to intermittent 
TestContainerSchedulerQueuing.testContainerUpdateExecTypeGuaranteedToOpportunistic
 Key: YARN-10255
 URL: https://issues.apache.org/jira/browse/YARN-10255
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


Creating this Jira to fix intermittent failure in branch-2.10. Also, the fix in 
YARN-7372 has some redundancy in assertion that could be removed.

UT failure in branch-2.10:

 {noformat}
testContainerUpdateExecTypeGuaranteedToOpportunistic:
 

[jira] [Created] (YARN-10220) RM HA times out intermittently

2020-04-01 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10220:


 Summary: RM HA times out intermittently
 Key: YARN-10220
 URL: https://issues.apache.org/jira/browse/YARN-10220
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ahmed Hussein


TestResourceTrackerOnHA Among other tests time out intermittently
* TestApplicationClientProtocolOnHA
* TestApplicationMasterServiceProtocolForTimelineV2
* TestApplicationMasterServiceProtocolOnHA
{code:bash}
[INFO] --- maven-surefire-plugin:3.0.0-M1:test (default-test) @ 
hadoop-yarn-client ---
[INFO]
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 19.612 
s <<< FAILURE! - in org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
[ERROR] 
testResourceTrackerOnHA(org.apache.hadoop.yarn.client.TestResourceTrackerOnHA)  
Time elapsed: 19.473 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 15000 
milliseconds
at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:699)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:812)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy93.registerNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy94.registerNodeManager(Unknown Source)
at 
org.apache.hadoop.yarn.client.TestResourceTrackerOnHA.testResourceTrackerOnHA(TestResourceTrackerOnHA.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   TestResourceTrackerOnHA.testResourceTrackerOnHA:64 » 

[jira] [Resolved] (YARN-9452) Fix TestDistributedShell and TestTimelineAuthFilterForV2 failures

2020-02-28 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein resolved YARN-9452.
-
Resolution: Fixed

> Fix TestDistributedShell and TestTimelineAuthFilterForV2 failures
> -
>
> Key: YARN-9452
> URL: https://issues.apache.org/jira/browse/YARN-9452
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, distributed-shell, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9452-001.patch, YARN-9452-002.patch, 
> YARN-9452-003.patch, YARN-9452-004.patch
>
>
> *TestDistributedShell#testDSShellWithoutDomainV2CustomizedFlow*
> {code}
> [ERROR] 
> testDSShellWithoutDomainV2CustomizedFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
>   Time elapsed: 72.14 s  <<< FAILURE!
> java.lang.AssertionError: Entity ID prefix should be same across each publish 
> of same entity expected:<9223372036854775806> but was:<9223370482298585580>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyEntityForTimelineV2(TestDistributedShell.java:695)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:588)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:459)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow(TestDistributedShell.java:330)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> *TestTimelineAuthFilterForV2#testPutTimelineEntities*
> {code}
> [ERROR] 
> testPutTimelineEntities[3](org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2)
>   Time elapsed: 1.047 s  <<< FAILURE!
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertNotNull(Assert.java:712)
>   at org.junit.Assert.assertNotNull(Assert.java:722)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.verifyEntity(TestTimelineAuthFilterForV2.java:282)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:421)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 

[jira] [Created] (YARN-10176) TestTimelineAuthFilterForV2 fails intermittently

2020-02-28 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10176:


 Summary: TestTimelineAuthFilterForV2 fails intermittently
 Key: YARN-10176
 URL: https://issues.apache.org/jira/browse/YARN-10176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineservice
Reporter: Ahmed Hussein
Assignee: Prabhu Joseph


TestTimelineAuthFilterForV2 fails intermittently on trunk and branch-2.10.
To reproduce the failure, execute TestTimelineAuthFilterForV2 inside a loop.

{code:bash}
[INFO] Running 
org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 18.148 
s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2
[ERROR] 
testPutTimelineEntities[1](org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2)
  Time elapsed: 6.852 s  <<< FAILURE!
java.lang.AssertionError: Entities should have been published successfully.
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:416)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestTimelineAuthFilterForV2.testPutTimelineEntities:416 Entities 
should have been published successfully.
[INFO]
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10140) TestTimelineAuthFilterForV2 fails due to login failures in branch-2.10

2020-02-16 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-10140:


 Summary: TestTimelineAuthFilterForV2 fails due to login failures 
in branch-2.10
 Key: YARN-10140
 URL: https://issues.apache.org/jira/browse/YARN-10140
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineservice
Affects Versions: 2.10.0
Reporter: Ahmed Hussein


For branch-2.10, it seems that {{TestTimelineAuthFilterForV2}} was broken for 
some time.
Using {{git bisect}} tool, I found that the first bad commit is 
"{{a3470c65d8b4e205c8a16d0c0b8dad10d0134bb8}}" in HADOOP-15959 .
For trunk, the JUnit passes without failures.

The stack trace is:

{code:bash}
[ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 20.231 
s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2
[ERROR] 
testPutTimelineEntities[0](org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2)
  Time elapsed: 0.877 s  <<< ERROR!
org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
HTTP/localh...@example.com from keytab 
/home/ahussein/workspace/repos/amahadoop-yhadoop-3090/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/bf6554f1-1df4-4ce0-8516-e6c337067870
 javax.security.auth.login.LoginException: java.lang.IllegalArgumentException: 
Illegal principal name HTTP/localh...@example.com: 
org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No 
rules applied to HTTP/localh...@example.com
at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1104)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:312)
at 
org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.initialize(TestTimelineAuthFilterForV2.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runners.Suite.runChild(Suite.java:127)
at org.junit.runners.Suite.runChild(Suite.java:26)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413)
Caused by: 

[jira] [Created] (YARN-9857) TestDelegationTokenRenewer throws NPE but tests pass

2019-09-25 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-9857:
---

 Summary: TestDelegationTokenRenewer throws NPE but tests pass
 Key: YARN-9857
 URL: https://issues.apache.org/jira/browse/YARN-9857
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


{{TestDelegationTokenRenewer}} throws some NPEs:


{code:bash}
2019-09-25 12:51:23,446 WARN  [pool-19-thread-2] 
security.DelegationTokenRenewer 
(DelegationTokenRenewer.java:handleDTRenewerAppSubmitEvent(945)) - Unable to 
add the application to the delegation token renewer.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:918)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-09-25 12:51:23,446 DEBUG [main] util.MBeans (MBeans.java:unregister(138)) 
- Unregistering Hadoop:service=ResourceManager,name=CapacitySchedulerMetrics
Exception in thread "pool-19-thread-2" java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:951)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:918)
2019-09-25 12:51:23,447 DEBUG [main] util.MBeans (MBeans.java:unregister(138)) 
- Unregistering Hadoop:service=ResourceManager,name=MetricsSystem,sub=Stats
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-09-25 12:51:23,447 INFO  [main] impl.MetricsSystemImpl 
(MetricsSystemImpl.java:stop(216)) - ResourceManager metrics system stopped.
{code}

the RMContext dispatcher is not set for the RMMock which results in NPE 
accessing the event handler of the dispatcher inside {{DelegationTokenRenewer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9815) ReservationACLsTestBase fails with NPE

2019-09-06 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-9815:
---

 Summary: ReservationACLsTestBase fails with NPE
 Key: YARN-9815
 URL: https://issues.apache.org/jira/browse/YARN-9815
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Ahmed Hussein


Running ReservationACLsTestBase throws a NPE running the FairScheduler. Old 
revisions back in 2016 also throw NPE.

In the test case, QueueC does not have reserveACLs, so ReservationsACLsManager 
would throw NPE when it tries to access the ACL on line 82.

I still could not find what was the first revision that caused this test case 
to fail. I stopped at bbfaf3c2712c9ba82b0f8423bdeb314bf505a692 which was 
working fine.

I have OsX with java 1.8.0_201

 
{code:java}
[ERROR] 
testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
  Time elapsed: 1.897 s  <<< ERROR![ERROR] 
testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
  Time elapsed: 1.897 s  <<< 
ERROR!java.lang.NullPointerException:java.lang.NullPointerException at 
org.apache.hadoop.yarn.server.resourcemanager.security.ReservationsACLsManager.checkAccess(ReservationsACLsManager.java:83)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.checkReservationACLs(ClientRMService.java:1527)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1290)
 at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:511)
 at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) 
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) 
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitReservation(ApplicationClientProtocolPBClientImpl.java:511)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.submitReservation(ReservationACLsTestBase.java:447)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.verifySubmitReservationSuccess(ReservationACLsTestBase.java:247)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.testApplicationACLs(ReservationACLsTestBase.java:125)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 

[jira] [Created] (YARN-9805) Fine-grained SchedulerNode synchronization

2019-08-30 Thread Ahmed Hussein (Jira)
Ahmed Hussein created YARN-9805:
---

 Summary: Fine-grained SchedulerNode synchronization
 Key: YARN-9805
 URL: https://issues.apache.org/jira/browse/YARN-9805
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


Yarn schedulerNode and RMNode are using synchronized methods on reading and 
updating the resources.

Instead, use read-write reentrant locks to provide fine-grained locking and to 
avoid blocking concurrent reads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9597) Memory efficiency in speculator

2019-06-03 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created YARN-9597:
---

 Summary: Memory efficiency in speculator 
 Key: YARN-9597
 URL: https://issues.apache.org/jira/browse/YARN-9597
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ahmed Hussein


The data structures in speculator and runtime-estimator are bloating. Data 
elements such as (taskID, TA-ID, task stats, tasks speculated, tasks 
finished..etc) are added to the concurrent maps but never removed.

For long running jobs, there are couple of issues:
 # memory leakage: the speculator memory usage increases over time. 
 # performance: keeping large structures in the heap affects the performance 
due to locality and cache misses.

*Suggested Fixes:*

- When a TA transitions to {{MoveContainerToSucceededFinishingTransition}}, the 
TA notifies the speculator. The latter handles the event by cleaning the 
internal structure accordingly.
- When a task transitions is failed/killed, the speculator is notified to clean 
the internal data structure.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9563) Resource report REST API could return NaN or Inf

2019-05-17 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created YARN-9563:
---

 Summary: Resource report REST API could return NaN or Inf
 Key: YARN-9563
 URL: https://issues.apache.org/jira/browse/YARN-9563
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ahmed Hussein


The Resource Manager's Cluster Applications and Cluster Application REST APIs 
are sometimes returning invalid JSON. This was addressed in YARN-6082.

However, the fix only fixes the calculation in one site and does not guarantee 
to avoid the problem. Instead, protob can safely check for NaN/INF replacing 
them by 0.0f



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9285) RM UI progress column is of wrong type

2019-02-06 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created YARN-9285:
---

 Summary: RM UI progress column is of wrong type
 Key: YARN-9285
 URL: https://issues.apache.org/jira/browse/YARN-9285
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein


The column type assigned for progress column in the application report is not 
correct.

The rank of the progress column should be 16, and 18. In WebPageUtils.java the 
"atargets" needs to be incremented by 1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org