[jira] [Created] (YARN-11056) Incorrect capitalization of NVIDIA in the docs
Gera Shegalov created YARN-11056: Summary: Incorrect capitalization of NVIDIA in the docs Key: YARN-11056 URL: https://issues.apache.org/jira/browse/YARN-11056 Project: Hadoop YARN Issue Type: Bug Reporter: Gera Shegalov According to [https://www.nvidia.com/en-us/about-nvidia/legal-info/] the spelling should be all-caps NVIDIA Examples of differing capitalization https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/UsingGpus.md -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11055) cgroups-operations.c some fprintf format strings lack "\n"
Gera Shegalov created YARN-11055: Summary: cgroups-operations.c some fprintf format strings lack "\n" Key: YARN-11055 URL: https://issues.apache.org/jira/browse/YARN-11055 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.3.1, 3.3.0, 3.2.0, 3.1.0, 3.0.0 Reporter: Gera Shegalov In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at the end leading to a hard-to-parse error message output example: https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7847) Provide permalinks for container logs
Gera Shegalov created YARN-7847: --- Summary: Provide permalinks for container logs Key: YARN-7847 URL: https://issues.apache.org/jira/browse/YARN-7847 Project: Hadoop YARN Issue Type: New Feature Components: amrmproxy Reporter: Gera Shegalov YARN doesn't offer a service similar to AM proxy URL for container logs even if log-aggregation is enabled. The current mechanism of having the NM redirect to yarn.log.server.url fails once the node is down. Workarounds like in MR JobHistory to rewrite URI's on the fly are possible, but do not represent a good long term solution to onboard new apps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7747) YARN UI is broken in the minicluster
Gera Shegalov created YARN-7747: --- Summary: YARN UI is broken in the minicluster Key: YARN-7747 URL: https://issues.apache.org/jira/browse/YARN-7747 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Reporter: Gera Shegalov Assignee: Gera Shegalov YARN web apps use non-injected instances of GuiceFilter, i.e. instances created by Jetty as opposed by Guice itself. This triggers the [call path|https://github.com/google/guice/blob/master/extensions/servlet/src/com/google/inject/servlet/GuiceFilter.java#L251] where the static field {{pipeline}} is used instead of the instance field {{injectedPipeline}}. However, besides GuiceFilter instances created by Jetty, each Guice module generates them as well. On the injection call path this static variable is updated by each instance. Thus if there are multiple modules as it happens to be the case in the minicluster the one loaded last ends up defining the filter pipeline for all Jetty instances. In the minicluster case this is the nodemanager UI -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
Gera Shegalov created YARN-7592: --- Summary: yarn.federation.failover.enabled missing in yarn-default.xml Key: YARN-7592 URL: https://issues.apache.org/jira/browse/YARN-7592 Project: Hadoop YARN Issue Type: Bug Components: federation Affects Versions: 3.0.0-beta1 Reporter: Gera Shegalov yarn.federation.failover.enabled should be documented in yarn-default.xml. I am also not sure why it should be true by default and force the HA retry policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env
Gera Shegalov created YARN-4789: --- Summary: Provide helpful exception for non-PATH-like conflict with admin.user.env Key: YARN-4789 URL: https://issues.apache.org/jira/browse/YARN-4789 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.2 Reporter: Gera Shegalov Assignee: Gera Shegalov Environment variables specified in mapreduce.admin.user.env are supposed to be paths (class, shell, library) and they can be merged with the user-provided values. However, it's also possible that the cluster admins specify some non-PATH-like variable such as JAVA_HOME. In this case if there is the same variable provided by the user, we'll get a concatenation that does not make sense and is difficult to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-683) Class MiniYARNCluster not found when starting the minicluster
[ https://issues.apache.org/jira/browse/YARN-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov resolved YARN-683. Resolution: Duplicate Closing as a dup because HADOOP-9891 now documents this workaround Class MiniYARNCluster not found when starting the minicluster - Key: YARN-683 URL: https://issues.apache.org/jira/browse/YARN-683 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.0.4-alpha Environment: MacOSX 10.8.3 - Java 1.6.0_45 Reporter: Rémy SAISSY Starting the minicluster with the following command line: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.4-alpha-tests.jar minicluster -format Fails for MiniYARNCluster with the following error: 13/05/14 17:06:58 INFO hdfs.MiniDFSCluster: Cluster is active 13/05/14 17:06:58 INFO mapreduce.MiniHadoopClusterManager: Started MiniDFSCluster -- namenode on port 55205 java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/server/MiniYARNCluster at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:170) at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129) at org.apache.hadoop.mapreduce.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:115) at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.server.MiniYARNCluster at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 16 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3568) TestAMRMTokens should use some random port
Gera Shegalov created YARN-3568: --- Summary: TestAMRMTokens should use some random port Key: YARN-3568 URL: https://issues.apache.org/jira/browse/YARN-3568 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Gera Shegalov Since the default port is used for yarn.resourcemanager.scheduler.address, if we already run a pseudo-distributed cluster on the same development machine, the test fails like this: {code} testMasterKeyRollOver[0](org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens) Time elapsed: 1.511 sec ERROR! org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8030] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.hadoop.ipc.Server.bind(Server.java:413) at org.apache.hadoop.ipc.Server$Listener.init(Server.java:590) at org.apache.hadoop.ipc.Server.init(Server.java:2340) at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:945) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.init(ProtobufRpcEngine.java:534) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:787) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.serviceStart(ApplicationMasterService.java:140) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:586) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:996) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1037) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1033) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1033) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1073) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens.testMasterKeyRollOver(TestAMRMTokens.java:235) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
Gera Shegalov created YARN-2893: --- Summary: AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
Gera Shegalov created YARN-2377: --- Summary: Localization exception stack traces are not passed as diagnostic info Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then onlt {{ java.net.UnknownHos tException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
Gera Shegalov created YARN-1996: --- Summary: Provide alternative policies for UNHEALTHY nodes. Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1700) AHS records non-launched containers
Gera Shegalov created YARN-1700: --- Summary: AHS records non-launched containers Key: YARN-1700 URL: https://issues.apache.org/jira/browse/YARN-1700 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov When testing AHS with a MR sleep job, AHS sometimes threw NPE out of AppAttemptBlock.render because logUrl in container report was null. I realized that this is because AHS may record containers that never launch. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1701) More intuitive defaults for AHS
Gera Shegalov created YARN-1701: --- Summary: More intuitive defaults for AHS Key: YARN-1701 URL: https://issues.apache.org/jira/browse/YARN-1701 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov Assignee: Gera Shegalov When I enable AHS via yarn.ahs.enabled, the app history is still not visible in AHS webUI. This is due to NullApplicationHistoryStore as yarn.resourcemanager.history-writer.class. It would be good to have just one key to enable basic functionality. yarn.ahs.fs-history-store.uri uses ${hadoop.log.dir}, which is local file system location. However, FileSystemApplicationHistoryStore uses DFS by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1599) webUI rm.webapp.AppBlock should redirect to a history App page if and when available
Gera Shegalov created YARN-1599: --- Summary: webUI rm.webapp.AppBlock should redirect to a history App page if and when available Key: YARN-1599 URL: https://issues.apache.org/jira/browse/YARN-1599 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0, 2.0.5-alpha Reporter: Gera Shegalov Assignee: Gera Shegalov When the log aggregation is enabled, and the application finishes, our users think that the AppMaster logs were lost because the link to the AM attempt logs are not updated and result in HTTP 404. Only tracking URL is updated. In order to have a smoother user experience, we propose to simply redirect to the new tracking URL when the page with invalid log links is accessed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1551) Allow user-specified reason for killApplication
Gera Shegalov created YARN-1551: --- Summary: Allow user-specified reason for killApplication Key: YARN-1551 URL: https://issues.apache.org/jira/browse/YARN-1551 Project: Hadoop YARN Issue Type: Improvement Reporter: Gera Shegalov This completes MAPREDUCE-5648 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1542) Add unit test for public resource on viewfs
Gera Shegalov created YARN-1542: --- Summary: Add unit test for public resource on viewfs Key: YARN-1542 URL: https://issues.apache.org/jira/browse/YARN-1542 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Gera Shegalov -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1529) Add Localization overhead metrics to NM
Gera Shegalov created YARN-1529: --- Summary: Add Localization overhead metrics to NM Key: YARN-1529 URL: https://issues.apache.org/jira/browse/YARN-1529 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Users are often unaware of localization cost that their jobs incur. To measure effectiveness of localization caches it is necessary to expose the overhead in the form of metrics. We propose addition of the following metrics to NodeManagerMetrics. When a container is about to launch, its set of LocalResources has to be fetched from a central location, typically on HDFS, that results in a number of download requests for the files missing in caches. LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache misses. LocalizedFilesCached: total localization requests that were served from local caches. Cache hits. LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses. LocalizedBytesCached: total bytes satisfied from local caches. Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that were served out of cache: ratio = 100 * caches / (caches + misses) LocalizationDownloadNanos: total elapsed time in nanoseconds for a container to go from ResourceRequestTransition to LocalizedTransition -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
Gera Shegalov created YARN-1515: --- Summary: Ability to dump the container threads and stop the containers in a single RPC Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: New Feature Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1401) With zero sleep-delay-before-sigkill.ms, no signal is ever sent
Gera Shegalov created YARN-1401: --- Summary: With zero sleep-delay-before-sigkill.ms, no signal is ever sent Key: YARN-1401 URL: https://issues.apache.org/jira/browse/YARN-1401 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Gera Shegalov If you set in yarn-site.xml yarn.nodemanager.sleep-delay-before-sigkill.ms=0 then an unresponsive child JVM is never killed. In MRv1, TT used to immediately SIGKILL in this case. -- This message was sent by Atlassian JIRA (v6.1#6144)