[jira] [Resolved] (YARN-2981) DockerContainerExecutor must support a Cluster-wide default Docker image

2016-10-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2981.
---
Resolution: Won't Fix

Agreed. I am closing this as "WON'T FIX" for now given YARN-5388. Please reopen 
this if you disagree. Thanks.

> DockerContainerExecutor must support a Cluster-wide default Docker image
> 
>
> Key: YARN-2981
> URL: https://issues.apache.org/jira/browse/YARN-2981
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Abin Shahab
>Assignee: Abin Shahab
>  Labels: oct16-easy
> Attachments: YARN-2981.patch, YARN-2981.patch, YARN-2981.patch, 
> YARN-2981.patch
>
>
> This allows the yarn administrator to add a cluster-wide default docker image 
> that will be used when there are no per-job override of docker images. With 
> this features, it would be convenient for newer applications like slider to 
> launch inside a cluster-default docker container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3755) Log the command of launching containers

2016-10-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3755.
---
Resolution: Duplicate

Actually I think this is a dup of YARN-4309 which now dumps a lot more useful 
information in addition to command line, env etc.

Closing this as duplicate for now, please reopen if you disagree.

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>  Labels: oct16-easy
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5759) Capability to register for a notification/callback on the expiry of timeouts

2016-11-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-5759.
---
Resolution: Duplicate

Agree with [~rohithsharma]. Closing this as a dup of YARN-2261.

> Capability to register for a notification/callback on the expiry of timeouts
> 
>
> Key: YARN-5759
> URL: https://issues.apache.org/jira/browse/YARN-5759
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Gour Saha
>
> There is a need for the YARN native services REST-API service, to take 
> certain actions once a timeout of an application expires. For example, an 
> immediate requirement is to destroy a Slider application, once its lifetime 
> timeout expires and YARN has stopped the application. Destroying a Slider 
> application means cleanup of Slider HDFS state store and ZK paths for that 
> application. 
> Potentially, there will be advanced requirements from the REST-API service 
> and other services in the future, which will make this feature very handy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5068) Expose scheduler queue to application master

2017-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-5068.
---
Resolution: Duplicate

Closing this accurately as a dup of YARN-1623.

> Expose scheduler queue to application master
> 
>
> Key: YARN-5068
> URL: https://issues.apache.org/jira/browse/YARN-5068
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: MAPREDUCE-6692.patch, YARN-5068.1.patch, 
> YARN-5068.2.patch, YARN-5068-branch-2.1.patch
>
>
> The AM needs to know the queue name in which it was launched.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6543) yarn application's privilege is determined by yarn process creator instead of yarn application user.

2017-07-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-6543.
---
Resolution: Not A Bug

Agree with [~rohithsharma], this is working as designed. Closing this.

> yarn application's privilege is determined by yarn process creator instead of 
> yarn application user.
> 
>
> Key: YARN-6543
> URL: https://issues.apache.org/jira/browse/YARN-6543
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: wuchang
>
> My application is a pyspark application which is impersonated by user 
> 'wuchang'
> My application infomation is :
> {code}
> Application Report : 
> Application-Id : application_1493004858240_0007
> Application-Name : livy-session-6
> Application-Type : SPARK
> User : wuchang
> Queue : root.wuchang
> Start-Time : 1493708942748
> Finish-Time : 0
> Progress : 10%
> State : RUNNING
> Final-State : UNDEFINED
> Tracking-URL : http://10.120.241.82:34462
> RPC Port : 0
> AM Host : 10.120.241.82
> Aggregate Resource Allocation : 4369480 MB-seconds, 2131 vcore-seconds
> Diagnostics :
> {code}
> And the process is :
> {code}
> appuser  25454 25872  0 15:09 ?00:00:00 bash 
> /data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/default_container_executor.sh
> appuser  25456 25454  0 15:09 ?00:00:00 /bin/bash -c 
> /home/jdk/bin/java -server -Xmx1024m 
> -Djava.io.tmpdir=/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/tmp
>  '-Dspark.ui.port=0' '-Dspark.driver.port=40969' 
> -Dspark.yarn.app.container.log.dir=/home/log/hadoop/logs/userlogs/application_1493004858240_0007/container_1493004858240_0007_01_04
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.120.241.82:40969 --executor-id 2 --hostname 
> 10.120.241.18 --cores 1 --app-id application_1493004858240_0007 
> --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/__app__.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/livy-api-0.3.0-SNAPSHOT.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/livy-rsc-0.3.0-SNAPSHOT.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/netty-all-4.0.29.Final.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/commons-codec-1.9.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/livy-core_2.11-0.3.0-SNAPSHOT.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/livy-repl_2.11-0.3.0-SNAPSHOT.jar
>  1> 
> /home/log/hadoop/logs/userlogs/application_1493004858240_0007/container_1493004858240_0007_01_04/stdout
>  2> 
> /home/log/hadoop/logs/userlogs/application_1493004858240_0007/container_1493004858240_0007_01_04/stderr
> appuser  25468 25456  2 15:09 ?00:00:09 /home/jdk/bin/java -server 
> -Xmx1024m 
> -Djava.io.tmpdir=/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/tmp
>  -Dspark.ui.port=0 -Dspark.driver.port=40969 
> -Dspark.yarn.app.container.log.dir=/home/log/hadoop/logs/userlogs/application_1493004858240_0007/container_1493004858240_0007_01_04
>  -XX:OnOutOfMemoryError=kill %p 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.120.241.82:40969 --executor-id 2 --hostname 
> 10.120.241.18 --cores 1 --app-id application_1493004858240_0007 
> --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/__app__.jar
>  --user-class-path 
> file:/data/data/hadoop/tmp/nm-local-dir/usercache/wuchang/appcache/application_1493004858240_0007/container_1493004858240_0007_01_04/livy-api-0.3.0-SNAPSHOT.jar
>  --user-c

[jira] [Created] (YARN-6930) Admins should be able to explicitly enable specific LinuxContainerRuntime in the NodeManager

2017-08-02 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-6930:
-

 Summary: Admins should be able to explicitly enable specific 
LinuxContainerRuntime in the NodeManager
 Key: YARN-6930
 URL: https://issues.apache.org/jira/browse/YARN-6930
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Today, in the java land, all LinuxContainerRuntimes are always enabled when 
using LinuxContainerExecutor and the user can simply invoke anything that 
he/she wants - default, docker, java-sandbox.

We should have a way for admins to explicitly enable only specific runtimes 
that he/she decides for the cluster. And by default, we should have everything 
other than the default one disabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7064) Use cgroup to get container resource utilization

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-7064.
---
Resolution: Fixed

Resolving as Fixed instead of Done per our conventions.

> Use cgroup to get container resource utilization
> 
>
> Key: YARN-7064
> URL: https://issues.apache.org/jira/browse/YARN-7064
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-7064.000.patch, YARN-7064.001.patch, 
> YARN-7064.002.patch, YARN-7064.003.patch, YARN-7064.004.patch, 
> YARN-7064.005.patch, YARN-7064.007.patch, YARN-7064.008.patch, 
> YARN-7064.009.patch, YARN-7064.010.patch, YARN-7064.011.patch, 
> YARN-7064.012.patch, YARN-7064.013.patch, YARN-7064.014.patch
>
>
> This is an addendum to YARN-6668. What happens is that that jira always wants 
> to rebase patches against YARN-1011 instead of trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7873) Revert YARN-6078

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-7873.
---
   Resolution: Fixed
Fix Version/s: 3.0.1
   2.9.1
   2.10.0
   3.1.0

> Revert YARN-6078
> 
>
> Key: YARN-7873
> URL: https://issues.apache.org/jira/browse/YARN-7873
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Blocker
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
>
> I think we should revert YARN-6078, since it is not working as intended. The 
> NM does not have permission to destroy the process of the ContainerLocalizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8109) Resource Manager WebApps fails to start due to ConcurrentModificationException

2018-04-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-8109.
---
Resolution: Duplicate

Already fixed by HADOOP-13556. Closing as a dup. Please reopen if that isn't 
the case.

> Resource Manager WebApps  fails to start due to 
> ConcurrentModificationException
> ---
>
> Key: YARN-8109
> URL: https://issues.apache.org/jira/browse/YARN-8109
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Major
>
> {code}
> 2018-03-22 04:57:39,289 INFO  resourcemanager.ResourceTrackerService 
> (ResourceTrackerService.java:nodeHeartbeat(497)) - Node not found resyncing 
> ctr-e138-1518143905142-129550-01-36.hwx.site:25454
> 2018-03-22 04:57:39,294 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
> state STARTED; cause: java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
> at java.util.Hashtable$Enumerator.next(Hashtable.java:1378)
> at 
> org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2564)
> at 
> org.apache.hadoop.conf.Configuration.getPropsWithPrefix(Configuration.java:2583)
> at 
> org.apache.hadoop.yarn.webapp.WebApps$Builder.getConfigParameters(WebApps.java:386)
> at 
> org.apache.hadoop.yarn.webapp.WebApps$Builder.build(WebApps.java:334)
> at 
> org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:395)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1049)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1152)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1293)
> 2018-03-22 04:57:39,296 INFO  ipc.Server (Server.java:stop(2752)) - Stopping 
> server on 8050
> 2018-03-22 04:57:39,300 INFO  ipc.Server (Server.java:run(932)) - Stopping 
> IPC Server listener on 8050
> 2018-03-22 04:57:39,301 INFO  ipc.Server (Server.java:run(1069)) - Stopping 
> IPC Server Responder
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN

2018-04-06 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-5983.
---
Resolution: Fixed

> [Umbrella] Support for FPGA as a Resource in YARN
> -
>
> Key: YARN-5983
> URL: https://issues.apache.org/jira/browse/YARN-5983
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf, 
> YARN-5983-implementation-notes.pdf, YARN-5983_end-to-end_test_report.pdf
>
>
> As various big data workload running on YARN, CPU will no longer scale 
> eventually and heterogeneous systems will become more important. ML/DL is a 
> rising star in recent years, applications focused on these areas have to 
> utilize GPU or FPGA to boost performance. Also, hardware vendors such as 
> Intel also invest in such hardware. It is most likely that FPGA will become 
> popular in data centers like CPU in the near future.
> So YARN as a resource managing and scheduling system, would be great to 
> evolve to support this. This JIRA proposes FPGA to be a first-class citizen. 
> The changes roughly includes:
> 1. FPGA resource detection and heartbeat
> 2. Scheduler changes (YARN-3926 invlolved)
> 3. FPGA related preparation and isolation before launch container
> We know that YARN-3926 is trying to extend current resource model. But still 
> we can leave some FPGA related discussion here



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2018-04-06 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-6223.
---
Resolution: Fixed

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5881) Enable configuration of queue capacity in terms of absolute resources

2018-04-06 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-5881.
---
Resolution: Fixed

> Enable configuration of queue capacity in terms of absolute resources
> -
>
> Key: YARN-5881
> URL: https://issues.apache.org/jira/browse/YARN-5881
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Sean Po
>Assignee: Sunil G
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: 
> YARN-5881.Support.Absolute.Min.Max.Resource.In.Capacity.Scheduler.design-doc.v1.pdf,
>  YARN-5881.v0.patch, YARN-5881.v1.patch
>
>
> Currently, Yarn RM supports the configuration of queue capacity in terms of a 
> proportion to cluster capacity. In the context of Yarn being used as a public 
> cloud service, it makes more sense if queues can be configured absolutely. 
> This will allow administrators to set usage limits more concretely and 
> simplify customer expectations for cluster allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2018-04-08 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1489.
---
Resolution: Fixed
  Assignee: (was: Vinod Kumar Vavilapalli)

Resolved this very old feature as fixed. Keeping it unassigned given multiple 
contributors. No fix-version given the tasks (perhaps?) spanned across releases.

> [Umbrella] Work-preserving ApplicationMaster restart
> 
>
> Key: YARN-1489
> URL: https://issues.apache.org/jira/browse/YARN-1489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Priority: Major
> Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7212) [Atsv2] TimelineSchemaCreator fails to create flowrun table causes RegionServer down!

2018-04-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-7212.
---
Resolution: Duplicate

> [Atsv2] TimelineSchemaCreator fails to create flowrun table causes 
> RegionServer down!
> -
>
> Key: YARN-7212
> URL: https://issues.apache.org/jira/browse/YARN-7212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Priority: Major
>
> *Hbase-2.0* officially support *hadoop-alpha* compilations. So I was trying 
> to build and test with HBase-2.0. But table schema creation fails and causes 
> RegionServer to shutdown with following error
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hadoop.hbase.Tag.asList([BII)Ljava/util/List;
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.getCurrentAggOp(FlowScanner.java:250)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.nextInternal(FlowScanner.java:226)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.FlowScanner.next(FlowScanner.java:145)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:132)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
> at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:973)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2252)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2672)
> {noformat}
> Since HBase-2.0 community is ready to release Hadoop-3.x compatible versions, 
> ATSv2 also need to support HBase-2.0 versions. For this, we need to take up a 
> task of test and validate HBase-2.0 issues! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8181) Docker container run_time

2018-04-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-8181.
---
Resolution: Invalid

[~sajavadi], please see http://hadoop.apache.org/mailing_lists.html. You can 
send emails to u...@hadoop.apache.org. You can subscribe to the list for other 
related discussions.

Resolving this for now.

> Docker container run_time
> -
>
> Key: YARN-8181
> URL: https://issues.apache.org/jira/browse/YARN-8181
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Seyyed Ahmad Javadi
>Priority: Major
>
> Hi All,
> I want to use docker container run time but could not solve the facing 
> problem. I am following the guide below and the NM log is as follows. I can 
> not see any docker containers to be created. It works when I use default LCE. 
> Please also find how I submit a job at the end as well.
> Do you have any guide on how can I make Docker rum_time works?
> May you please let me know how can use LCE binary to make sure my docker 
> setup is correct?
> I confirmed that "docker run" works fine. I really like this developing 
> feature and would like to contribute to it. Many thanks in advance.
> [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainers.html]
> {code:java}
> NM LOG:
> ...
> 2018-04-19 11:49:24,568 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1524151293356_0005_01 (auth:SIMPLE)
> 2018-04-19 11:49:24,580 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1524151293356_0005_01_01 by user ubuntu
> 2018-04-19 11:49:24,584 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1524151293356_0005
> 2018-04-19 11:49:24,584 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=ubuntu    
> IP=130.245.127.176    OPERATION=Start Container Request    
> TARGET=ContainerManageImpl    RESULT=SUCCESS    
> APPID=application_1524151293356_0005    
> CONTAINERID=container_1524151293356_0005_01_01
> 2018-04-19 11:49:24,585 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1524151293356_0005 transitioned from NEW to INITING
> 2018-04-19 11:49:24,585 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Adding container_1524151293356_0005_01_01 to application 
> application_1524151293356_0005
> 2018-04-19 11:49:24,585 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1524151293356_0005 transitioned from INITING to 
> RUNNING
> 2018-04-19 11:49:24,588 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1524151293356_0005_01_01 transitioned from NEW to 
> LOCALIZING
> 2018-04-19 11:49:24,588 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_INIT for appId application_1524151293356_0005
> 2018-04-19 11:49:24,589 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1524151293356_0005_01_01
> 2018-04-19 11:49:24,616 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /tmp/hadoop-ubuntu/nm-local-dir/nmPrivate/container_1524151293356_0005_01_01.tokens
> 2018-04-19 11:49:28,090 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1524151293356_0005_01_01 transitioned from 
> LOCALIZING to SCHEDULED
> 2018-04-19 11:49:28,090 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Starting container [container_1524151293356_0005_01_01]
> 2018-04-19 11:49:28,212 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1524151293356_0005_01_01 transitioned from SCHEDULED 
> to RUNNING
> 2018-04-19 11:49:28,212 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Starting resource-monitoring for container_1524151293356_0005_01_01
> 2018-04-19 11:49:29,401 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_1524151293356_0005_01_01 succeeded
> 2018-04-19 11:49:29,401 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1524151293356_0005_01_01 transitioned from RUNNING 
> to EX

[jira] [Closed] (YARN-1705) Reset cluster-metrics on transition to standby

2015-11-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-1705.
-

> Reset cluster-metrics on transition to standby
> --
>
> Key: YARN-1705
> URL: https://issues.apache.org/jira/browse/YARN-1705
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
> Fix For: 2.4.0
>
> Attachments: YARN-1705.1.patch, YARN-1705.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2055.
---
Resolution: Not A Problem

Between YARN-2074, MAPREDUCE-5956 and YARN-2022, this is not much of an issue 
anymore.

For the remaining toggle of AM containers between nearly full-queues, not much 
can really be done.

I am closing this as not-a-problem anymore, please reopen as necessary.

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3575) Job using 2.5 jars fails on a 2.6 cluster whose RM has been restarted

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3575.
---
Resolution: Won't Fix

Agreed. I requested that this be filed largely for documentation concerns.

There is one simple way sites can avoid this _incompatibility_: Do not use 
RM-recovery on a 2.6+ cluster while you still have applications running 
versions < 2.6. Once all apps are upgraded to 2.6+, you can enable RM-recovery 
and perform work-preserving RM restarts and rolling upgrades.

Closing this for now as won't fix. Please reopen if you disagree. Thanks.

> Job using 2.5 jars fails on a 2.6 cluster whose RM has been restarted
> -
>
> Key: YARN-3575
> URL: https://issues.apache.org/jira/browse/YARN-3575
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>
> Trying to launch a job that uses the 2.5 jars fails on a 2.6 cluster whose RM 
> has been restarted (i.e.: epoch != 0) becaue the epoch number starts 
> appearing in the container IDs and the 2.5 jars no longer know how to parse 
> the container IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1336) [Umbrella] Work-preserving nodemanager restart

2015-11-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1336.
---
Resolution: Fixed

Though there are a couple of improvements left to take care of, this feature is 
largely completed across 2.6 / 2.7 versions and used in production at many 
sites as part of rolling-upgrades.

Closing this as fixed, thanks [~jlowe]!

> [Umbrella] Work-preserving nodemanager restart
> --
>
> Key: YARN-1336
> URL: https://issues.apache.org/jira/browse/YARN-1336
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: NMRestartDesignOverview.pdf, YARN-1336-rollup-v2.patch, 
> YARN-1336-rollup.patch
>
>
> This serves as an umbrella ticket for tasks related to work-preserving 
> nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4372) Cannot enable system-metrics-publisher inside MiniYARNCluster

2015-11-18 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4372:
-

 Summary: Cannot enable system-metrics-publisher inside 
MiniYARNCluster
 Key: YARN-4372
 URL: https://issues.apache.org/jira/browse/YARN-4372
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


[~Naganarasimha] found this at YARN-2859, see [this 
comment|https://issues.apache.org/jira/browse/YARN-2859?focusedCommentId=15005746&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15005746].

The way daemons are started inside MiniYARNCluster, RM is not setup correctly 
to send information to TimelineService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4692) [Umbrella] Simplified and first-class support for services in YARN

2016-02-12 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4692:
-

 Summary: [Umbrella] Simplified and first-class support for 
services in YARN
 Key: YARN-4692
 URL: https://issues.apache.org/jira/browse/YARN-4692
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


YARN-896 focused on getting the ball rolling on the support for services (long 
running applications) on YARN.

I’d like propose the next stage of this effort: _Simplified and first-class 
support for services in YARN_.

The chief rationale for filing a separate new JIRA is threefold:
 - Do a fresh survey of all the things that are already implemented in the 
project
 - Weave a comprehensive story around what we further need and attempt to rally 
the community around a concrete end-goal, and
 - Additionally focus on functionality that YARN-896 and friends left for 
higher layers to take care of and see how much of that is better integrated 
into the YARN platform itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1978) TestLogAggregationService#testLocalFileDeletionAfterUpload fails sometimes

2016-02-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1978.
---
Resolution: Duplicate

Closing this as a dup of YARN-4168 which is much closer to completion.

> TestLogAggregationService#testLocalFileDeletionAfterUpload fails sometimes
> --
>
> Key: YARN-1978
> URL: https://issues.apache.org/jira/browse/YARN-1978
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-1978.txt
>
>
> This happens in a Windows VM, though the issue isn't related to Windows.
> {code}
> ---
> Test set: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
> ---
> Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.859 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
> testLocalFileDeletionAfterUpload(org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService)
>   Time elapsed: 0.906 sec  <<< FAILURE!
> junit.framework.AssertionFailedError: check 
> Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\TestLogAggregationService-localLogDir\application_1234_0001\container_1234_0001_01_01\stdout
> at junit.framework.Assert.fail(Assert.java:50)
> at junit.framework.Assert.assertTrue(Assert.java:20)
> at junit.framework.Assert.assertFalse(Assert.java:34)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLocalFileDeletionAfterUpload(TestLogAggregationService.java:201)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4724) [Umbrella] Recognizing services: Special handling of preemption, container reservations etc.

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4724:
-

 Summary: [Umbrella] Recognizing services: Special handling of 
preemption, container reservations etc.
 Key: YARN-4724
 URL: https://issues.apache.org/jira/browse/YARN-4724
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


See overview doc at YARN-4692, copying the sub-section to track all related 
efforts.

{quote}
Though it is desirable to not special­-case services anywhere in YARN, there 
are a few key areas where such special recognition is not only unavoidable but 
very necessary. For example, preemption and reservation of long running 
containers have different implications from regular ones.

Preemption of resources in YARN today works by killing of containers. 
Obviously, preempting long running containers is different and costlier for the 
apps. For many long-­lived applications, preemption by killing will likely be 
not even an option that they can tolerate.

[Task] P​reemption also means that scheduler should avoid allocating long 
running containers on borrowed resources.

On the other hand, today’s scheduler creates reservations when they cannot fit 
a container on a machine with free resources. When making such reservations for 
containers of long running services, the scheduler shouldn’t queue the 
reservation behind other services running on a node ­ otherwise the reservation 
may get stuck unfulfilled forever.

[Task] P​reemption and reservations logic thus need to understand if an 
application has long running containers and make decisions accordingly.

[Task] ​There is an existing JIRA Y​ARN-­1039 (Add parameter for YARN resource 
requests to indicate "long lived"​) which was filed to address some of this 
special recognition of service containers. The options were between a boolean 
flag and a long representing the life­cycle, though in practice I think we will 
need to have both.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4725) [Umbrella] Auto-­restart of containers

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4725:
-

 Summary: [Umbrella] Auto-­restart of containers
 Key: YARN-4725
 URL: https://issues.apache.org/jira/browse/YARN-4725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


See overview doc at YARN-4692, copying the sub-section to track all related 
efforts.

Today, when a container (process­-tree) dies, NodeManager assumes that the 
container’s allocation is also expired, and reports accordingly to the 
ResourceManager which then releases the allocation. For service containers, 
this is undesirable in many cases. Long running containers may exit for various 
reasons, crash and need to restart but forcing them to go through the complete 
scheduling cycle, resource localization etc is both unnecessary and expensive. 
(​Task) ​For services it will be good to have NodeManagers automatically 
restart containers. This looks a lot like inittab / daemon­tools at the system 
level.

We will need to enable app­-specific policies (very similar to the handling of 
AM restarts at YARN level) for restarting containers automatically but limit 
such restarts if a container dies too often in a short interval of time.

YARN­-3998 is an existing ticket that looks at some if not all of this 
functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4726) [Umbrella] Allocation reuse for application upgrades

2016-02-23 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4726:
-

 Summary: [Umbrella] Allocation reuse for application upgrades
 Key: YARN-4726
 URL: https://issues.apache.org/jira/browse/YARN-4726
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli


See overview doc at YARN-4692, copying the sub-section to track all related 
efforts.

Once auto-­restart of containers is taken care of (YARN-4725), we need to 
address what I believe is the second most important reason for service 
containers to restart : upgrades. Once a service is running on YARN, the way 
container allocation-­lifecycle works, any time the container exits, YARN will 
reclaim the resources. During an upgrade, with multitude of other applications 
running in the system, giving up and getting back resources allocated to the 
service is hard to manage. Things like N​ode­Labels in YARN ​help this cause 
but are not straight­forward to use to address the app­-specific use­cases.

We need a first class way of letting application reuse the same 
resource­allocation for multiple launches of the processes inside the 
container. This is done by decoupling allocation lifecycle and the process 
life­cycle.

The JIRA YARN-1040 initiated this conversation. We need two things here: 
 - (1) (​Task) ​the ApplicationMaster should be able to use the same 
container-allocation and issue multiple s​tartContainer​requests to the 
NodeManager.
 - (2) [(Task) To support the upgrade of the ApplicationMaster itself, clients 
should be able to inform YARN to restart AM within the same allocation but with 
new bits.

The JIRAs YARN-3417 and YARN-4470 talk about the second task above ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps

2018-05-17 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-8319:
-

 Summary: More YARN pages need to honor 
yarn.resourcemanager.display.per-user-apps
 Key: YARN-8319
 URL: https://issues.apache.org/jira/browse/YARN-8319
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Vinod Kumar Vavilapalli


When this config is on
 - Per queue page on UI2 should filter app list by user
 -- TODO: Verify the same with UI1 Per-queue page
 - ATSv2 with UI2 should filter list of all users' flows and flow activities
 - Per Node pages
 -- Listing of apps and containers on a per-node basis should filter apps and 
containers by user.

To this end, because this is no longer just for resourcemanager, we should also 
deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of 
{{yarn.webapp.filter-app-list-by-user}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8338) TimelineService V1.5 doesn't come up after HADOOP-15406

2018-05-22 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-8338:
-

 Summary: TimelineService V1.5 doesn't come up after HADOOP-15406
 Key: YARN-8338
 URL: https://issues.apache.org/jira/browse/YARN-8338
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


TimelineService V1.5 fails with the following:

{code}
java.lang.NoClassDefFoundError: org/objenesis/Objenesis
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9548) [Umbrella] Make YARN work well in elastic cloud environments

2019-05-13 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-9548:
-

 Summary: [Umbrella] Make YARN work well in elastic cloud 
environments
 Key: YARN-9548
 URL: https://issues.apache.org/jira/browse/YARN-9548
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli


YARN works well in static environments but there isn't fundamentally broken in 
YARN to stop us from making it work well in dynamic environments like cloud 
(public or private) as well.

There are few areas where we need to invest though
 # Autoscaling
 -- cluster level: add/remove nodes intelligently based on metrics and/or admin 
plugins
 -- node level: scale nodes up/down vertically?
 # Smarter scheduling
-- to pack containers as opposed to spreading them around to account for 
nodes going away
-- to account for speculative nodes like spot instances
 # Handling nodes going away better
-- by decommissioning sanely
-- dealing with auxiliary services data
 # And any installation helpers in this dynamic world - scripts, operators etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-03 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4757:
-

 Summary: [Umbrella] Simplified discovery of services via DNS 
mechanisms
 Key: YARN-4757
 URL: https://issues.apache.org/jira/browse/YARN-4757
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli


[See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track all 
related efforts.]

In addition to completing the present story of service­-registry (YARN-913), we 
also need to simplify the access to the registry entries. The existing read 
mechanisms of the YARN Service Registry are currently limited to a registry 
specific (java) API and a REST interface. In practice, this makes it very 
difficult for wiring up existing clients and services. For e.g, dynamic 
configuration of dependent end­points of a service is not easy to implement 
using the present registry­-read mechanisms, *without* code-changes to existing 
services.

A good solution to this is to expose the registry information through a more 
generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
uses the well-­known DNS interfaces to browse the network for services. 
YARN-913 in fact talked about such a DNS based mechanism but left it as a 
future task. (Task) Having the registry information exposed via DNS simplifies 
the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4758) Enable discovery of AMs by containers

2016-03-03 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4758:
-

 Summary: Enable discovery of AMs by containers
 Key: YARN-4758
 URL: https://issues.apache.org/jira/browse/YARN-4758
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli


{color:red}
This is already discussed on the umbrella JIRA YARN-1489.

Copying some of my condensed summary from the design doc (section 3.2.10.3) of 
YARN-4692.
{color}

Even after the existing work in Work­preserving AM restart (Section 3.1.2 / 
YARN-1489), we still haven’t solved the problem of old running containers not 
knowing where the new AM starts running after the previous AM crashes. This is 
a specifically important problem to be solved for long running services where 
we’d like to avoid killing service containers when AMs fail­over. So far, we 
left this as a task for the apps, but solving it in YARN is much desirable. 
[(Task) This looks very much like service­-registry (YARN-913), but for 
app­containers to discover their own AMs.

Combining this requirement (of any container being able to find their AM across 
fail­overs) with those of services (to be able to find through DNS where a 
service container is running - YARN-4757) will put our registry scalability 
needs to be much higher than that of just service end­points. This calls for a 
more distributed solution for registry readers  something that is discussed in 
the comments section of YARN-1489 and MAPREDUCE-6608.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4762) NMs failing on DelegatingLinuxContainerRuntime init with LCE on

2016-03-03 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4762:
-

 Summary: NMs failing on DelegatingLinuxContainerRuntime init with 
LCE on
 Key: YARN-4762
 URL: https://issues.apache.org/jira/browse/YARN-4762
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


Seeing this exception and the NMs crash.
{code}
2016-03-03 16:47:57,807 DEBUG org.apache.hadoop.service.AbstractService: 
Service 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
is started
2016-03-03 16:47:58,027 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
checkLinuxExecutorSetup: 
[/hadoop/hadoop-yarn-nodemanager/bin/container-executor, --checksetup]
2016-03-03 16:47:58,043 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
 Mount point Based on mtab file: /proc/mounts. Controller mount point not 
writable for: cpu
2016-03-03 16:47:58,043 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
 Unable to get cgroups handle.
2016-03-03 16:47:58,044 DEBUG org.apache.hadoop.service.AbstractService: 
noteFailure org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to 
initialize container executor
2016-03-03 16:47:58,044 INFO org.apache.hadoop.service.AbstractService: Service 
NodeManager failed in state INITED; cause: 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
container executor
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
container executor
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
Caused by: java.io.IOException: Failed to initialize linux container runtime(s)!
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
... 3 more
2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
Service: NodeManager entered state STOPPED
2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.CompositeService: 
NodeManager: stopping services, size=0
2016-03-03 16:47:58,047 DEBUG org.apache.hadoop.service.AbstractService: 
Service: 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService 
entered state STOPPED
2016-03-03 16:47:58,047 FATAL 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize 
container executor
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:240)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:539)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:587)
Caused by: java.io.IOException: Failed to initialize linux container runtime(s)!
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:207)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:238)
... 3 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4770) Auto-restart of containers should work across NM restarts.

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4770:
-

 Summary: Auto-restart of containers should work across NM restarts.
 Key: YARN-4770
 URL: https://issues.apache.org/jira/browse/YARN-4770
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


See my comment 
[here|https://issues.apache.org/jira/browse/YARN-3998?focusedCommentId=15133367&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133367]
 on YARN-3998. Need to take care of two things:
 - The relaunch feature needs to work across NM restarts, so we should save the 
retry-context and policy per container into the state-store and reload it for 
continue relaunching after NM restart.
 - We should also handle restarting of any containers that may have crashed 
during the NM reboot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4791) Per user blacklist node for user specific error for container launch failure.

2016-03-11 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-4791.
---
Resolution: Duplicate

Closing as duplicate.

> Per user blacklist node for user specific error for container launch failure.
> -
>
> Key: YARN-4791
> URL: https://issues.apache.org/jira/browse/YARN-4791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Reporter: Junping Du
>Assignee: Junping Du
>
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user 
> exists, so container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed 
> (exitCode=255) with output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's 
> other applications. We need a per user blacklist track mechanism rather than 
> per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4793) [Umbrella] Simplified API layer for services and beyond

2016-03-11 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4793:
-

 Summary: [Umbrella] Simplified API layer for services and beyond
 Key: YARN-4793
 URL: https://issues.apache.org/jira/browse/YARN-4793
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


[See overview doc at YARN-4692, modifying and copy-pasting some of the relevant 
pieces and sub-section 3.3.2 to track the specific sub-item.]

Bringing a new service on YARN today is not a simple experience. The APIs of 
existing frameworks are either too low­ level (native YARN), require writing 
new code (for frameworks with programmatic APIs ) or writing a complex spec 
(for declarative frameworks).

In addition to building critical building blocks inside YARN (as part of other 
efforts at YARN-4692), we should also look to simplifying the user facing story 
for building services. Experience of projects like Slider building real-­life 
services like HBase, Storm, Accumulo, Solr etc gives us some very good 
learnings on how simplified APIs for building services will look like.

To this end, we should look at a new simple-services API layer backed by REST 
interfaces. The REST layer can act as a single point of entry for creation and 
lifecycle management of YARN services. Services here can range from simple 
single-­component apps to the most complex, multi­-component applications 
needing special orchestration needs.

We should also look at making this a unified REST based entry point for other 
important features like resource­-profile management (YARN-3926), 
package-definitions' lifecycle­-management and service­-discovery (YARN-913 / 
YARN-4757). We also need to flesh out its relation to our present much ­lower 
level REST APIs (YARN-1695) in YARN for application-­submission and management.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing

2016-03-19 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4837:
-

 Summary: User facing aspects of 'AM blacklisting' feature need 
fixing
 Key: YARN-4837
 URL: https://issues.apache.org/jira/browse/YARN-4837
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.

Looking at the 'AM blacklisting feature', I see several things to be fixed 
before we release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4902) [Umbrella] Generalized and unified scheduling-strategies in YARN

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-4902:
-

 Summary: [Umbrella] Generalized and unified scheduling-strategies 
in YARN
 Key: YARN-4902
 URL: https://issues.apache.org/jira/browse/YARN-4902
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan


Apache Hadoop YARN's ResourceRequest mechanism is the core part of the YARN's 
scheduling API for applications to use. The ResourceRequest mechanism is a 
powerful API for applications (specifically ApplicationMasters) to indicate to 
YARN what size of containers are needed, and where in the cluster etc.

However a host of new feature requirements are making the API increasingly more 
and more complex and difficult to understand by users and making it very 
complicated to implement within the code-base.

This JIRA aims to generalize and unify all such scheduling-strategies in YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4919) Yarn logs should support a option to output logs as compressed archive

2016-04-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-4919.
---
Resolution: Duplicate

This as a dup of YARN-1134, closing so.

> Yarn logs should support a option to output logs as compressed archive
> --
>
> Key: YARN-4919
> URL: https://issues.apache.org/jira/browse/YARN-4919
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3052) [Data Serving] Provide a very simple POC html ATS UI

2016-04-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3052.
---
Resolution: Duplicate

Looks like this is a dup of YARN-4097, please reopen if you disagree. Tx.

> [Data Serving] Provide a very simple POC html ATS UI
> 
>
> Key: YARN-3052
> URL: https://issues.apache.org/jira/browse/YARN-3052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>
> As part of accomplishing a minimum viable product, we want to be able to show 
> some UI in html (however crude it is). This subtask calls for creating a 
> barebones UI to do that.
> This should be replaced later with a better-designed and implemented proper 
> UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-5079) [Umbrella] Native YARN framework layer for services and beyond

2016-05-12 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-5079:
-

 Summary: [Umbrella] Native YARN framework layer for services and 
beyond
 Key: YARN-5079
 URL: https://issues.apache.org/jira/browse/YARN-5079
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


(See overview doc at YARN-4692, modifying and copy-pasting some of the relevant 
pieces and sub-section 3.3.1 to track the specific sub-item.)

(This is a companion to YARN-4793 in our effort to simplify the entire story, 
but focusing on APIs)

So far, YARN by design has restricted itself to having a very low-­level API 
that can support any type of application. Frameworks like Apache Hadoop 
MapReduce, Apache Tez, Apache Spark, Apache REEF, Apache Twill, Apache Helix 
and others ended up exposing higher level APIs that end­-users can directly 
leverage to build their applications on top of YARN. On the services side, 
Apache Slider has done something similar.

With our current attention on making services first­-class and simplified, it's 
time to take a fresh look at how we can make Apache Hadoop YARN support 
services well out of the box. Beyond the functionality that I outlined in the 
previous sections in the doc on how NodeManagers can be enhanced to help 
services, the biggest missing piece is the framework itself. There is a lot of 
very important functionality that a services' framework can own together with 
YARN in executing services end­-to­-end.

In this JIRA I propose we look at having a native Apache Hadoop framework for 
running services natively on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5075) Fix findbugs warning in hadoop-yarn-common module

2016-05-16 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-5075.
---
Resolution: Duplicate

Going to reopen YARN-4412 - that's better than filing new tickets.

> Fix findbugs warning in hadoop-yarn-common module
> -
>
> Key: YARN-5075
> URL: https://issues.apache.org/jira/browse/YARN-5075
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akira AJISAKA
> Attachments: findbugs.html
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4919) Yarn logs should support a option to output logs as compressed archive

2016-05-20 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-4919.
---
Resolution: Won't Fix

> Yarn logs should support a option to output logs as compressed archive
> --
>
> Key: YARN-4919
> URL: https://issues.apache.org/jira/browse/YARN-4919
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5122) "yarn logs" for running containers should print an explicit footer saying that the log may be incomplete

2016-05-20 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-5122:
-

 Summary: "yarn logs" for running containers should print an 
explicit footer saying that the log may be incomplete
 Key: YARN-5122
 URL: https://issues.apache.org/jira/browse/YARN-5122
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


We can have a footer of the sort {quote}[This log file belongs to a running 
container and so may not be complete..]{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5312) Parameter 'size' in the webservices "/containerlogs/$containerid/$filename" and in AHSWebServices is semantically confusing

2016-07-05 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-5312:
-

 Summary: Parameter 'size' in the webservices 
"/containerlogs/$containerid/$filename" and in AHSWebServices is semantically 
confusing
 Key: YARN-5312
 URL: https://issues.apache.org/jira/browse/YARN-5312
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


This got added in YARN-5088 and I found this while reviewing YARN-5224.
bq. Also, the parameter 'size' in the API 
"/containerlogs/$containerid/$filename" and similarly in AHSWebServices is 
confusing with semantics. I think we are better off with an offset and size.

An offset (in bytes, +ve to indicate from the start and -ve to indicate from 
the end) together with a size (in bytes) indicating how much to read from the 
offset are a better combination - this is how most file-system APIs look like, 
for comparison.

I can also imagine number of lines as a better unit than bytes for offset and 
size - perhaps yet another ticket.

/cc [~vvasudev].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3561) Non-AM Containers continue to run even after AM is stopped

2016-07-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3561.
---
Resolution: Duplicate

Duplicate of HADOOP-12317.

> Non-AM Containers continue to run even after AM is stopped
> --
>
> Key: YARN-3561
> URL: https://issues.apache.org/jira/browse/YARN-3561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.0
> Environment: debian 7
>Reporter: Gour Saha
>Priority: Critical
> Attachments: app0001.zip, application_1431771946377_0001.zip
>
>
> Non-AM containers continue to run even after application is stopped. This 
> occurred while deploying Storm 0.9.3 using Slider (0.60.0 and 0.70.1) in a 
> Hadoop 2.6 deployment. 
> Following are the NM logs from 2 different nodes:
> *host-07* - where Slider AM was running
> *host-03* - where Storm NIMBUS container was running.
> *Note:* The logs are partial, starting with the time when the relevant Slider 
> AM and NIMBUS containers were allocated, till the time when the Slider AM was 
> stopped. Also, the large number of "Memory usage" log lines were removed 
> keeping only a few starts and ends of every segment.
> *NM log from host-07 where Slider AM container was running:*
> {noformat}
> 2015-04-29 00:39:24,614 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(356)) - Stopping resource-monitoring for 
> container_1428575950531_0020_02_01
> 2015-04-29 00:41:10,310 INFO  ipc.Server (Server.java:saslProcess(1306)) - 
> Auth successful for appattempt_1428575950531_0021_01 (auth:SIMPLE)
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(803)) - Start request for 
> container_1428575950531_0021_01_01 by user yarn
> 2015-04-29 00:41:10,322 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:startContainerInternal(843)) - Creating a new 
> application reference for app application_1428575950531_0021
> 2015-04-29 00:41:10,323 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from NEW to INITING
> 2015-04-29 00:41:10,325 INFO  nodemanager.NMAuditLogger 
> (NMAuditLogger.java:logSuccess(89)) - USER=yarn   IP=10.84.105.162
> OPERATION=Start Container Request   TARGET=ContainerManageImpl  
> RESULT=SUCCESS  APPID=application_1428575950531_0021
> CONTAINERID=container_1428575950531_0021_01_01
> 2015-04-29 00:41:10,328 WARN  logaggregation.LogAggregationService 
> (LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root 
> Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
> [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
> users.
> 2015-04-29 00:41:10,328 WARN  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as 
> -1. The log rolling mornitoring interval is disabled. The logs will be 
> aggregated after this application is finished.
> 2015-04-29 00:41:10,351 INFO  application.Application 
> (ApplicationImpl.java:transition(304)) - Adding 
> container_1428575950531_0021_01_01 to application 
> application_1428575950531_0021
> 2015-04-29 00:41:10,352 INFO  application.Application 
> (ApplicationImpl.java:handle(464)) - Application 
> application_1428575950531_0021 transitioned from INITING to RUNNING
> 2015-04-29 00:41:10,356 INFO  container.Container 
> (ContainerImpl.java:handle(999)) - Container 
> container_1428575950531_0021_01_01 transitioned from NEW to LOCALIZING
> 2015-04-29 00:41:10,357 INFO  containermanager.AuxServices 
> (AuxServices.java:handle(196)) - Got event CONTAINER_INIT for appId 
> application_1428575950531_0021
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/htrace-core-3.0.4.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,357 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/jettison-1.1.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/storm1/tmp/application_1428575950531_0021/am/lib/api-util-1.0.0-M20.jar
>  transitioned from INIT to DOWNLOADING
> 2015-04-29 00:41:10,358 INFO  localizer.LocalizedResource 
> (LocalizedResource.java:handle(203)) - Resource 
> hdfs://zsexp/user/yarn/.slider/cluster/sto

[jira] [Created] (YARN-5363) For AM containers, or for containers of running-apps, "yarn logs" incorrectly only (tries to) shows syslog file-type by default

2016-07-12 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-5363:
-

 Summary: For AM containers, or for containers of running-apps, 
"yarn logs" incorrectly only (tries to) shows syslog file-type by default
 Key: YARN-5363
 URL: https://issues.apache.org/jira/browse/YARN-5363
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


For e.g, for a running application, the following happens:
{code}
# yarn logs -applicationId application_1467838922593_0001
16/07/06 22:07:05 INFO impl.TimelineClientImpl: Timeline service address: 
http://:8188/ws/v1/timeline/
16/07/06 22:07:06 INFO client.RMProxy: Connecting to ResourceManager at 
/:8050
16/07/06 22:07:07 INFO impl.TimelineClientImpl: Timeline service address: 
http://l:8188/ws/v1/timeline/
16/07/06 22:07:07 INFO client.RMProxy: Connecting to ResourceManager at 
/:8050
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_01 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_02 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_03 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_04 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_05 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_06 within the application: 
application_1467838922593_0001
Can not find any log file matching the pattern: [syslog] for the container: 
container_e03_1467838922593_0001_01_07 within the application: 
application_1467838922593_0001
Can not find the logs for the application: application_1467838922593_0001 with 
the appOwner: 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5393) [Umbrella] Optimize YARN tests runtime

2016-07-17 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-5393:
-

 Summary: [Umbrella] Optimize YARN tests runtime 
 Key: YARN-5393
 URL: https://issues.apache.org/jira/browse/YARN-5393
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Vinod Kumar Vavilapalli


When I originally merged MAPREDUCE-279 into Hadoop, *all* of YARN tests used to 
take 10 mins with pretty good coverage.

Now only TestRMRestart takes that much time - we'ven't been that great writing 
pointed - short tests.

Time for an initiative to optimize YARN tests. And even after that, if it takes 
too long, we go the MAPREDUCE-670 route.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3504) TestRMRestart fails occasionally in trunk

2016-07-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3504.
---
Resolution: Duplicate

Looks like that to me too, closing as dup oYARN-2871.

> TestRMRestart fails occasionally in trunk
> -
>
> Key: YARN-3504
> URL: https://issues.apache.org/jira/browse/YARN-3504
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Priority: Minor
>
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> Stacktrace
> org.mockito.exceptions.verification.TooLittleActualInvocations: 
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1836) Add retry cache support in ResourceManager

2016-08-24 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1836.
---
Resolution: Invalid

Old JIRA.

As [~xgong] [mentioned on 
YARN-1521|https://issues.apache.org/jira/browse/YARN-1521?focusedCommentId=13948528&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13948528],
 YARN doesn't need yet another RetryCache - StateStore + recovered apps already 
serve that purpose.

Closing this as invalid for now. We can reopen it in future as needed. Revert 
back if you disagree. Tx.

> Add retry cache support in ResourceManager
> --
>
> Key: YARN-1836
> URL: https://issues.apache.org/jira/browse/YARN-1836
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>
> HDFS-4942 supports RetryCache on NN. This JIRA tracks RetryCache on 
> ResourceManager. If the RPCs are non-idempotent, we should use RetryCache to 
> avoid returning incorrect failures to client.
> YARN-1521 is a related JIRA. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1763) Handle RM failovers during the submitApplication call.

2016-08-24 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1763.
---
Resolution: Duplicate

> Handle RM failovers during the submitApplication call.
> --
>
> Key: YARN-1763
> URL: https://issues.apache.org/jira/browse/YARN-1763
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-3669) Attempt-failures validatiy interval should have a global admin configurable lower limit

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3669:
-

 Summary: Attempt-failures validatiy interval should have a global 
admin configurable lower limit
 Key: YARN-3669
 URL: https://issues.apache.org/jira/browse/YARN-3669
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Found this while reviewing YARN-3480.

bq. When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a 
small value, retried attempts might be very large. So we need to delete some 
attempts stored in RMStateStore and RMStateStore.
I think we need to have a lower limit on the failure-validaty interval to avoid 
situations like this.

Having this will avoid pardoning too-many failures in too-short a duration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3682) Decouple PID-file management from ContainerExecutor

2015-05-19 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3682:
-

 Summary: Decouple PID-file management from ContainerExecutor
 Key: YARN-3682
 URL: https://issues.apache.org/jira/browse/YARN-3682
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


The PID-files management currently present in ContainerExecutor really doesn't 
belong there. I know the original history of why we added it, that was about 
the only right place to put it in at that point of time.

Given the evolution of executors for Windows etc, the ContainerExecutor is 
getting more complicated than is necessary.

We should pull the PID-file management into its own entity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3683) Create an abstraction in ContainerExecutor for Container-script generation

2015-05-19 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3683:
-

 Summary: Create an abstraction in ContainerExecutor for 
Container-script generation
 Key: YARN-3683
 URL: https://issues.apache.org/jira/browse/YARN-3683
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Before YARN-1964, Container-script generation was fundamentally driven by 
ContainerLaunch object. After YARN-1964, this got pulled into ContainerExecutor 
via the {{writeLaunchEnv()}} method which looks like an API, but isn't.

In addition, DefaultContainerExecutor itself has a plugin 
{{LocalWrapperScriptBuilder}} which kind of does the same thing, but only for 
Linux/Windows.

We need to have a common API to override the script generation for 
Linux/Windows/Docker etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3685) NodeManager unnecessarily knows about classpath-jars due to Windows limitations

2015-05-19 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3685:
-

 Summary: NodeManager unnecessarily knows about classpath-jars due 
to Windows limitations
 Key: YARN-3685
 URL: https://issues.apache.org/jira/browse/YARN-3685
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Found this while looking at cleaning up ContainerExecutor via YARN-3648, making 
it a sub-task.

YARN *should not* know about classpaths. Our original design modeled around 
this. But when we added windows suppport, due to classpath issues, we ended up 
breaking this abstraction via YARN-316. We should clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3704) Container Launch fails with exitcode 127 with DefaultContainerExecutor

2015-05-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3704.
---
Resolution: Invalid

It is almost definitely an app-issue, closing this as invalid, please reopen if 
you disagree. If you want help debugging this, post this on the user lists.

I left YARN-3703 open to see if we can improve our error messages.

> Container Launch fails with exitcode 127 with DefaultContainerExecutor
> --
>
> Key: YARN-3704
> URL: https://issues.apache.org/jira/browse/YARN-3704
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 2.7.0
>Reporter: Devaraj K
>Priority: Minor
>
> Please find the below NM log when the issue occurs.
> {code:xml}
> 2015-05-22 08:08:53,165 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1432208816246_0930_01_37 is : 127
> 2015-05-22 08:08:53,166 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: 
> container_1432208816246_0930_01_37 and exit code: 127
> ExitCodeException exitCode=127:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at org.apache.hadoop.util.Shell.run(Shell.java:456)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1432208816246_0930_01_37
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 127
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
> ExitCodeException exitCode=127:
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.util.Shell.run(Shell.java:456)
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 2015-05-22 08:08:53,179 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
> java.lang.Thread.run(Thread.java:745)
> 2015-05-22 08:08:53,180 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container exited with a non-zero exit code 127
> 2015-05-22 08:08:53,180 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.C

[jira] [Resolved] (YARN-3715) Oozie jobs are failed with IllegalArgumentException: Does not contain a valid host:port authority: maprfs:/// (configuration property 'yarn.resourcemanager.address') on s

2015-05-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3715.
---

Doubt if this is a real bug - likely a configuration/environment issue.

In any case, please first try to resolve user issues in the user mailing lists 
(http://hadoop.apache.org/mailing_lists.html).

The JIRA is a place to address existing bugs/new features in the project. 
Closing this for now. Thanks.

> Oozie jobs are failed with IllegalArgumentException: Does not contain a valid 
> host:port authority: maprfs:/// (configuration property 
> 'yarn.resourcemanager.address') on secure cluster with RM HA
> --
>
> Key: YARN-3715
> URL: https://issues.apache.org/jira/browse/YARN-3715
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Sergey Svinarchuk
>
> 2015-05-21 16:06:55,887  WARN ActionStartXCommand:544 -
> SERVER[centos6.localdomain] USER[mapr] GROUP[-] TOKEN[] APP[Hive]
> JOB[001-150521123655733-oozie-mapr-W]
> ACTION[001-150521123655733-oozie-mapr-W@Hive] Error starting action 
> [Hive].
> ErrorType [ERROR], ErrorCode [IllegalArgumentException], Message
> [IllegalArgumentException: Does not contain a valid host:port authority:
> maprfs:/// (configuration property 'yarn.resourcemanager.address')]
> org.apache.oozie.action.ActionExecutorException: IllegalArgumentException: 
> Does
> not contain a valid host:port authority: maprfs:/// (configuration property
> 'yarn.resourcemanager.address')
> at
> org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:401)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:979)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1134)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:228)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
> at org.apache.oozie.command.XCommand.call(XCommand.java:281)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:323)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:252)
> at
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: Does not contain a valid
> host:port authority: maprfs:/// (configuration property
> 'yarn.resourcemanager.address')
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:211)
> at
> org.apache.hadoop.conf.Configuration.getSocketAddr(Configuration.java:1788)
> at org.apache.hadoop.mapred.Master.getMasterAddress(Master.java:58)
> at org.apache.hadoop.mapred.Master.getMasterPrincipal(Master.java:67)
> at
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:114)
> at
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:460)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
> at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
> at
> org.apache.oozie.action.had

[jira] [Created] (YARN-3720) Need comprehensive documentation for configuration CPU/memory resources on NodeManager

2015-05-26 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3720:
-

 Summary: Need comprehensive documentation for configuration 
CPU/memory resources on NodeManager
 Key: YARN-3720
 URL: https://issues.apache.org/jira/browse/YARN-3720
 Project: Hadoop YARN
  Issue Type: Task
  Components: documentation, nodemanager
Reporter: Vinod Kumar Vavilapalli


Things are getting more and more complex after the likes of YARN-160. We need a 
document explaining how to configure cpu/memory values on a NodeManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3744) ResourceManager should avoid allocating AM to same node repeatedly in case of AM launch failures

2015-05-29 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3744.
---
Resolution: Duplicate

I think this is a dup of YARN-2005. Please reopen if you disagree.

> ResourceManager should avoid allocating AM to same node repeatedly in case of 
> AM launch failures
> 
>
> Key: YARN-3744
> URL: https://issues.apache.org/jira/browse/YARN-3744
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jaideep Dhok
>
> We have seen that if AM launch fails on some node due to configuration or bad 
> disk issue [YARN-3591], quite often it gets reallocated on the same node, 
> causing job failures if the AM attempt limit is reached.
> It would be preferable if the scheduler can try to allocate AM on different 
> nodes for subsequent attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3492) AM fails to come up because RM and NM can't connect to each other

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3492.
---
Resolution: Cannot Reproduce

Closing this based on previous comments. Please reopen this in case you run 
into it again.

> AM fails to come up because RM and NM can't connect to each other
> -
>
> Key: YARN-3492
> URL: https://issues.apache.org/jira/browse/YARN-3492
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
> Environment: pseudo-distributed cluster on a mac
>Reporter: Karthik Kambatla
>Priority: Blocker
> Attachments: mapred-site.xml, 
> yarn-kasha-nodemanager-kasha-mbp.local.log, 
> yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml
>
>
> Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The 
> container gets allocated, but doesn't get launched. The NM can't talk to the 
> RM. Logs to follow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3828) Add a flag in container to indicate whether it's an AM container or not

2015-06-18 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3828.
---
Resolution: Duplicate

We usually close the newly filed JIRAs as dups.

I am closing this as dup of YARN-3116 and assigning that to 
[~giovanni.fumarola].

> Add a flag in container to indicate whether it's an AM container or not 
> 
>
> Key: YARN-3828
> URL: https://issues.apache.org/jira/browse/YARN-3828
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Giovanni Matteo Fumarola
>
> YARN-2022 adds a flag in RMContainer to indicate whether the container is an 
> AM or not to skip pre-emption of AM containers. This JIRA proposes 
> propagating this information to NMs as it's required by the AMRMProxy 
> (YARN-2884/YARN-3666) to identify if the container is an AM for:
> 1) Security - for authorizing only AMs to communicate with RMs. For e.g.: 
> this is useful to prevent DDos attacks by all containers of a malicious app.
> 2) Federation - to allow for transparently spanning an app across multiple 
> sub-clusters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4154) Tez Build with hadoop 2.6.1 fails due to MiniYarnCluster change

2015-09-16 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-4154.
---
Resolution: Duplicate

Tx [~ajisakaa], not sure how I missed it earlier, I was careful about these 
reverts.

Anyways, the new commit applies cleanly.

Ran compilation and TestJobHistoryEventHandler, TestMRTimelineEventHandling, 
TestDistributedShell, TestMiniYarnCluster before the push.

Closing this correctly as a duplicate.

> Tez Build with hadoop 2.6.1 fails due to MiniYarnCluster change
> ---
>
> Key: YARN-4154
> URL: https://issues.apache.org/jira/browse/YARN-4154
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Jeff Zhang
>Priority: Blocker
>
> {code}
> [ERROR] 
> /mnt/nfs0/jzhang/tez-autobuild/tez/tez-plugins/tez-yarn-timeline-history/src/test/java/org/apache/tez/tests/MiniTezClusterWithTimeline.java:[92,5]
>  no suitable constructor found for 
> MiniYARNCluster(java.lang.String,int,int,int,int,boolean)
> constructor 
> org.apache.hadoop.yarn.server.MiniYARNCluster.MiniYARNCluster(java.lang.String,int,int,int,int)
>  is not applicable
>   (actual and formal argument lists differ in length)
> constructor 
> org.apache.hadoop.yarn.server.MiniYARNCluster.MiniYARNCluster(java.lang.String,int,int,int)
>  is not applicable
>   (actual and formal argument lists differ in length)
> {code}
> MR might have the same issue.
> \cc [~vinodkv]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3700) ATS Web Performance issue at load time when large number of jobs

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3700.
-

> ATS Web Performance issue at load time when large number of jobs
> 
>
> Key: YARN-3700
> URL: https://issues.apache.org/jira/browse/YARN-3700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, webapp, yarn
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.2
>
> Attachments: YARN-3700-branch-2.6.1.txt, YARN-3700-branch-2.7.2.txt, 
> YARN-3700.1.patch, YARN-3700.2.1.patch, YARN-3700.2.2.patch, 
> YARN-3700.2.patch, YARN-3700.3.patch, YARN-3700.4.patch
>
>
> Currently, we will load all the apps when we try to load the yarn 
> timelineservice web page. If we have large number of jobs, it will be very 
> slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3999) RM hangs on draining events

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3999.
-

> RM hangs on draining events
> ---
>
> Key: YARN-3999
> URL: https://issues.apache.org/jira/browse/YARN-3999
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: YARN-3999-branch-2.6.1.txt, YARN-3999-branch-2.7.patch, 
> YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, YARN-3999.3.patch, 
> YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-2890.
-

> MiniYarnCluster should turn on timeline service if configured to do so
> --
>
> Key: YARN-2890
> URL: https://issues.apache.org/jira/browse/YARN-2890
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.2
>
> Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, 
> YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, 
> YARN-2890.patch, YARN-2890.patch
>
>
> Currently the MiniMRYarnCluster does not consider the configuration value for 
> enabling timeline service before starting. The MiniYarnCluster should only 
> start the timeline service if it is configured to do so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3740) Fixed the typo with the configuration name: APPLICATION_HISTORY_PREFIX_MAX_APPS

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3740.
-

> Fixed the typo with the configuration name: 
> APPLICATION_HISTORY_PREFIX_MAX_APPS
> ---
>
> Key: YARN-3740
> URL: https://issues.apache.org/jira/browse/YARN-3740
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, webapp, yarn
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.2
>
> Attachments: YARN-3740.1.patch
>
>
> YARN-3700 introduces a new configuration named 
> APPLICATION_HISTORY_PREFIX_MAX_APPS, which need be changed to 
> APPLICATION_HISTORY_MAX_APPS. 
> This is not an incompatibility change since YARN-3700 is in 2.8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3248) Display count of nodes blacklisted by apps in the web UI

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3248.
-

> Display count of nodes blacklisted by apps in the web UI
> 
>
> Key: YARN-3248
> URL: https://issues.apache.org/jira/browse/YARN-3248
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.8.0, 2.7.2
>
> Attachments: All applications.png, App page.png, Screenshot.jpg, 
> YARN-3248-branch-2.6.1.txt, YARN-3248-branch-2.7.2.txt, 
> apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, 
> apache-yarn-3248.3.patch, apache-yarn-3248.4.patch
>
>
> It would be really useful when debugging app performance and failure issues 
> to get a count of the nodes blacklisted by individual apps displayed in the 
> web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-4047) ClientRMService getApplications has high scheduler lock contention

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-4047.
-

> ClientRMService getApplications has high scheduler lock contention
> --
>
> Key: YARN-4047
> URL: https://issues.apache.org/jira/browse/YARN-4047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: YARN-4047.001.patch
>
>
> The getApplications call can be particuarly expensive because the code can 
> call checkAccess on every application being tracked by the RM.  checkAccess 
> will often call scheduler.checkAccess which will grab the big scheduler lock. 
>  This can cause a lot of contention with the scheduler thread which is busy 
> trying to process node heartbeats, app allocation requests, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3990.
-

> AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is 
> connected/disconnected
> 
>
> Key: YARN-3990
> URL: https://issues.apache.org/jira/browse/YARN-3990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch, 
> 0003-YARN-3990.patch
>
>
> Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent 
> to all the applications that are in the rmcontext. But for 
> finished/killed/failed applications it is not required to send these events. 
> Additional check for wheather app is finished/killed/failed would minimizes 
> the unnecessary events
> {code}
>   public void handle(NodesListManagerEvent event) {
> RMNode eventNode = event.getNode();
> switch (event.getType()) {
> case NODE_UNUSABLE:
>   LOG.debug(eventNode + " reported unusable");
>   unusableRMNodesConcurrentSet.add(eventNode);
>   for(RMApp app: rmContext.getRMApps().values()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> RMAppNodeUpdateType.NODE_UNUSABLE));
>   }
>   break;
> case NODE_USABLE:
>   if (unusableRMNodesConcurrentSet.contains(eventNode)) {
> LOG.debug(eventNode + " reported usable");
> unusableRMNodesConcurrentSet.remove(eventNode);
>   }
>   for (RMApp app : rmContext.getRMApps().values()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> RMAppNodeUpdateType.NODE_USABLE));
>   }
>   break;
> default:
>   LOG.error("Ignoring invalid eventtype " + event.getType());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3978.
-

> Configurably turn off the saving of container info in Generic AHS
> -
>
> Key: YARN-3978
> URL: https://issues.apache.org/jira/browse/YARN-3978
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver, yarn
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>  Labels: 2.6.1-candidate
> Fix For: 3.0.0, 2.6.1, 2.8.0, 2.7.2
>
> Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
> YARN-3978.003.patch, YARN-3978.004.patch
>
>
> Depending on how each application's metadata is stored, one week's worth of 
> data stored in the Generic Application History Server's database can grow to 
> be almost a terabyte of local disk space. In order to alleviate this, I 
> suggest that there is a need for a configuration option to turn off saving of 
> non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI after YARN-3272 / YARN-3262

2015-09-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-3301.
-

> Fix the format issue of the new RM web UI and AHS web UI after YARN-3272 / 
> YARN-3262
> 
>
> Key: YARN-3301
> URL: https://issues.apache.org/jira/browse/YARN-3301
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.7.1
>
> Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 
> 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, 
> YARN-3301.3.patch, YARN-3301.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2986) Support hierarchical and unified scheduler configuration

2014-12-23 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-2986:
-

 Summary: Support hierarchical and unified scheduler configuration
 Key: YARN-2986
 URL: https://issues.apache.org/jira/browse/YARN-2986
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Today's scheduler configuration is fragmented and non-intuitive, and needs to 
be improved. Details in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2291) Timeline and RM web services should use same authentication code

2015-02-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2291.
---
Resolution: Duplicate

Closing with the right resolution..

> Timeline and RM web services should use same authentication code
> 
>
> Key: YARN-2291
> URL: https://issues.apache.org/jira/browse/YARN-2291
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
>
> The TimelineServer and the RM web services have very similar requirements and 
> implementation for authentication via delegation tokens apart from the fact 
> that the RM web services requires delegation tokens to be passed as a header. 
> They should use the same code base instead of different implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2292) RM web services should use hadoop-common for authentication using delegation tokens

2015-02-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2292.
---
Resolution: Duplicate

Closing with the right status.

> RM web services should use hadoop-common for authentication using delegation 
> tokens
> ---
>
> Key: YARN-2292
> URL: https://issues.apache.org/jira/browse/YARN-2292
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
>
> HADOOP-10771 refactors the WebHDFS authentication code to hadoop-common. 
> YARN-2290 will add support for passing delegation tokens via headers. Once 
> support is added RM web services should use the authentication code from 
> hadoop-common



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-671) Add an interface on the RM to move NMs into a maintenance state

2015-02-09 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-671.
--
Resolution: Duplicate

I'd risk a guess that [~sseth] originally opened for use-cases already covered 
by YARN-914 and close this as duplicate.

Others, please reopen it if there are other use-cases for this.

> Add an interface on the RM to move NMs into a maintenance state
> ---
>
> Key: YARN-671
> URL: https://issues.apache.org/jira/browse/YARN-671
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.0.4-alpha
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3228) Deadlock altering user resource queue

2015-02-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3228.
---
Resolution: Incomplete

Not sure how/why this is related to Hadoop. In any case, please first try to 
resolve user issues in the user mailing lists 
(http://hadoop.apache.org/mailing_lists.html).

The JIRA is a place to address existing bugs/new features in the project. 
Closing this for now. Thanks.

> Deadlock altering user resource queue
> -
>
> Key: YARN-3228
> URL: https://issues.apache.org/jira/browse/YARN-3228
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager, scheduler
>Affects Versions: 2.0.1-alpha
> Environment: hadoop yarn, postgresql 
>Reporter: Christian Hott
>Priority: Blocker
>  Labels: newbie
>   Original Estimate: 203h
>  Remaining Estimate: 203h
>
> let me introduce you with my problem:
> all of this began after we created some resources queues on postgresql,
> well we created it, assign it to the users and all was fine...
> until we run a process (a large one iterative query) and I do an Alter Role 
> over the user and the resource queue that he was using, before that I can't 
> login whit the user and got a message saying "deadlock detection, locking 
> against self"
> does you have any idea why this for? or if have any comprensible log in to I 
> can search for more information?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3283) Scheduler MinimumAllocation is not refreshable

2015-03-02 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3283:
-

 Summary: Scheduler MinimumAllocation is not refreshable
 Key: YARN-3283
 URL: https://issues.apache.org/jira/browse/YARN-3283
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Found this while reviewing YARN-3265. Scheduler's maximumAllocation is already 
refreshable and per queue. May be we should do the same for the 
minimum-allocation too?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3306) [Umbrella] Proposing per-queue Policy driven scheduling in YARN

2015-03-09 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3306:
-

 Summary: [Umbrella] Proposing per-queue Policy driven scheduling 
in YARN
 Key: YARN-3306
 URL: https://issues.apache.org/jira/browse/YARN-3306
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Scheduling layout in Apache Hadoop YARN today is very coarse grained. This 
proposal aims at converting today’s rigid scheduling in YARN to a per­-queue 
policy driven architecture.

We propose the creation of a c​ommon policy framework​ and implement a​common 
set of policies​ that administrators can pick and chose per queue
 - Make scheduling policies configurable per queue
 - Initially, we limit ourselves to a new type of scheduling policy that 
determines the ordering of applications within the l​eaf ­queue
 - In the near future, we will also pursue parent­ queue level policies and 
potential algorithm reuse through a separate type of policies that control 
resource limits per queue, user, application etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3332) [Umbrella] Unified Resource Statistics Collection per node

2015-03-10 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3332:
-

 Summary: [Umbrella] Unified Resource Statistics Collection per node
 Key: YARN-3332
 URL: https://issues.apache.org/jira/browse/YARN-3332
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Today in YARN, NodeManager collects statistics like per container resource 
usage and overall physical resources available on the machine. Currently this 
is used internally in YARN by the NodeManager for only a limited usage: 
automatically determining the capacity of resources on node and enforcing 
memory usage to what is reserved per container.

This proposal is to extend the existing architecture and collect statistics for 
usage b​eyond​ the existing use­cases.

Proposal attached in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3402) Security support for new timeline service.

2015-03-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3402.
---
Resolution: Duplicate

Dup of YARN-3053.

> Security support for new timeline service.
> --
>
> Key: YARN-3402
> URL: https://issues.apache.org/jira/browse/YARN-3402
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>
> We should support YARN security for new TimelineService.
> Basically, there should be security token exchange between AM, NMs and 
> app-collectors to prevent anyone who knows the service address of 
> app-collector can post faked/unwanted information. Also, there should be 
> tokens exchange between app-collector/RMTimelineCollector and backend storage 
> (HBase, Phoenix, etc.) that enabling security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3441) Introduce the notion of policies for a parent queue

2015-04-02 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3441:
-

 Summary: Introduce the notion of policies for a parent queue
 Key: YARN-3441
 URL: https://issues.apache.org/jira/browse/YARN-3441
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Similar to the policy being added in YARN-3318 for leaf-queues, we need to 
extend this notion to parent-queue too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3442) Consider abstracting out user, app limits etc into some sort of a LimitPolicy

2015-04-02 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-3442:
-

 Summary: Consider abstracting out user, app limits etc into some 
sort of a LimitPolicy
 Key: YARN-3442
 URL: https://issues.apache.org/jira/browse/YARN-3442
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


Similar to the policies being added in YARN-3318 and YARN-3441 for leaf and 
parent queues, we should consider extracting an abstraction for limits too.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (YARN-2168) SCM/Client/NM/Admin protocols

2015-04-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli closed YARN-2168.
-

> SCM/Client/NM/Admin protocols
> -
>
> Key: YARN-2168
> URL: https://issues.apache.org/jira/browse/YARN-2168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Fix For: 2.7.0
>
> Attachments: YARN-2168-trunk-v1.patch, YARN-2168-trunk-v2.patch
>
>
> This jira is meant to be used to review the main shared cache APIs. They are 
> as follows:
> * ClientSCMProtocol - The protocol between the yarn client and the cache 
> manager. This protocol controls how resources in the cache are claimed and 
> released.
> ** UseSharedCacheResourceRequest
> ** UseSharedCacheResourceResponse
> ** ReleaseSharedCacheResourceRequest
> ** ReleaseSharedCacheResourceResponse
> * SCMAdminProtocol - This is an administrative protocol for the cache 
> manager. It allows administrators to manually trigger cleaner runs.
> ** RunSharedCacheCleanerTaskRequest
> ** RunSharedCacheCleanerTaskResponse
> * NMCacheUploaderSCMProtocol - The protocol between the NodeManager and the 
> cache manager. This allows the NodeManager to coordinate with the cache 
> manager when uploading new resources to the shared cache.
> ** NotifySCMRequest
> ** NotifySCMResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2168) SCM/Client/NM/Admin protocols

2015-04-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2168.
---
   Resolution: Duplicate
Fix Version/s: 2.7.0

Resolving this instead as a duplicate.

> SCM/Client/NM/Admin protocols
> -
>
> Key: YARN-2168
> URL: https://issues.apache.org/jira/browse/YARN-2168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Fix For: 2.7.0
>
> Attachments: YARN-2168-trunk-v1.patch, YARN-2168-trunk-v2.patch
>
>
> This jira is meant to be used to review the main shared cache APIs. They are 
> as follows:
> * ClientSCMProtocol - The protocol between the yarn client and the cache 
> manager. This protocol controls how resources in the cache are claimed and 
> released.
> ** UseSharedCacheResourceRequest
> ** UseSharedCacheResourceResponse
> ** ReleaseSharedCacheResourceRequest
> ** ReleaseSharedCacheResourceResponse
> * SCMAdminProtocol - This is an administrative protocol for the cache 
> manager. It allows administrators to manually trigger cleaner runs.
> ** RunSharedCacheCleanerTaskRequest
> ** RunSharedCacheCleanerTaskResponse
> * NMCacheUploaderSCMProtocol - The protocol between the NodeManager and the 
> cache manager. This allows the NodeManager to coordinate with the cache 
> manager when uploading new resources to the shared cache.
> ** NotifySCMRequest
> ** NotifySCMResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2654) Revisit all shared cache config parameters to ensure quality names

2015-04-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2654.
---
Resolution: Won't Fix

Closing as 'Won't Fix'

> Revisit all shared cache config parameters to ensure quality names
> --
>
> Key: YARN-2654
> URL: https://issues.apache.org/jira/browse/YARN-2654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Blocker
> Attachments: shared_cache_config_parameters.txt
>
>
> Revisit all the shared cache config parameters in YarnConfiguration and 
> yarn-default.xml to ensure quality names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3037) [Storage implementation] Create HBase cluster backing storage implementation for ATS writes

2015-04-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3037.
---
Resolution: Duplicate

This seems like a dup of YARN-3411. Closing, please reopen if you disagree.

> [Storage implementation] Create HBase cluster backing storage implementation 
> for ATS writes
> ---
>
> Key: YARN-3037
> URL: https://issues.apache.org/jira/browse/YARN-3037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Vrushali C
>
> Per design in YARN-2928, create a backing storage implementation for ATS 
> writes based on a full HBase cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3036) [Storage implementation] Create standalone HBase backing storage implementation for ATS writes

2015-04-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3036.
---
Resolution: Duplicate

This seems like a dup of YARN-3411. Closing, please reopen if you disagree.

> [Storage implementation] Create standalone HBase backing storage 
> implementation for ATS writes
> --
>
> Key: YARN-3036
> URL: https://issues.apache.org/jira/browse/YARN-3036
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Zhijie Shen
>
> Per design in YARN-2928, create a (default) standalone HBase backing storage 
> implementation for ATS writes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1876) Document the REST APIs of timeline and generic history services

2015-04-29 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1876.
---
Resolution: Duplicate

Duplicate is the right resolution.

> Document the REST APIs of timeline and generic history services
> ---
>
> Key: YARN-1876
> URL: https://issues.apache.org/jira/browse/YARN-1876
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>  Labels: documentaion
> Attachments: YARN-1876.1.patch, YARN-1876.2.patch, YARN-1876.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3481) Report NM aggregated container resource utilization in heartbeat

2015-04-30 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3481.
---
Resolution: Duplicate

> Report NM aggregated container resource utilization in heartbeat
> 
>
> Key: YARN-3481
> URL: https://issues.apache.org/jira/browse/YARN-3481
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Inigo Goiri
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> To allow the RM take better scheduling decisions, it should be aware of the 
> actual utilization of the containers. The NM would aggregate the 
> ContainerMetrics and report it in every heartbeat.
> Related to YARN-1012 but aggregated to reduce the heartbeat overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-556) [Umbrella] RM Restart phase 2 - Work preserving restart

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-556.
--
Resolution: Fixed
  Assignee: (was: Bikas Saha)

Makes sense. Resolved as fixed. Keeping it unassigned given multiple 
contributors. No fix-version given the tasks spanned across releases.

> [Umbrella] RM Restart phase 2 - Work preserving restart
> ---
>
> Key: YARN-556
> URL: https://issues.apache.org/jira/browse/YARN-556
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Bikas Saha
> Attachments: Work Preserving RM Restart.pdf, 
> WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch
>
>
> YARN-128 covered storing the state needed for the RM to recover critical 
> information. This umbrella jira will track changes needed to recover the 
> running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-59) "Text File Busy" errors launching MR tasks

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-59.
-
Resolution: Duplicate

>From what I know, this is fixed via YARN-1271 + YARN-1295. Resolving as dup. 
>Please reopen if you disagree.

> "Text File Busy" errors launching MR tasks
> --
>
> Key: YARN-59
> URL: https://issues.apache.org/jira/browse/YARN-59
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Todd Lipcon
>Assignee: Andy Isaacson
>
> Some very small percentage of tasks fail with a "Text file busy" error.
> The following was the original diagnosis:
> {quote}
> Our use of PrintWriter in TaskController.writeCommand is unsafe, since that 
> class swallows all IO exceptions. We're not currently checking for errors, 
> which I'm seeing result in occasional task failures with the message "Text 
> file busy" - assumedly because the close() call is failing silently for some 
> reason.
> {quote}
> .. but turned out to be another issue as well (see below)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-47) [Umbrella] Security issues in YARN

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-47.
-
Resolution: Fixed

> [Umbrella] Security issues in YARN
> --
>
> Key: YARN-47
> URL: https://issues.apache.org/jira/browse/YARN-47
> Project: Hadoop YARN
>  Issue Type: Bug
>        Reporter: Vinod Kumar Vavilapalli
>
> JIRA tracking YARN related security issues.
> Moving over YARN only stuff from MAPREDUCE-3101.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-100) container-executor should deal with stdout, stderr better

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-100.
--
Resolution: Later

LOGFILE and ERRORFILE were always this way, and it has worked out for long 
enough.

I don't see requests to change them to point them to other files, going to 
close it as later for now. Please revert back if you disagree.

> container-executor should deal with stdout, stderr better
> -
>
> Key: YARN-100
> URL: https://issues.apache.org/jira/browse/YARN-100
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.1-alpha
>Reporter: Colin Patrick McCabe
>Priority: Minor
>
> container-executor.c contains the following code:
> {code}
>   fclose(stdin);
>   fflush(LOGFILE);
>   if (LOGFILE != stdout) {
> fclose(stdout);
>   }
>   if (ERRORFILE != stderr) {
> fclose(stderr);
>   }
>   if (chdir(primary_app_dir) != 0) {
> fprintf(LOGFILE, "Failed to chdir to app dir - %s\n", strerror(errno));
> return -1;
>   }
>   execvp(args[0], args);
> {code}
> Whenever you open a new file descriptor, its number is the lowest available 
> number.  So if {{stdout}} (fd number 1) has been closed, and you do 
> open("/my/important/file"), you'll get assigned file descriptor 1.  This 
> means that any printf statements in the program will be now printing to 
> /my/important/file.  Oops!
> The correct way to get rid of stdin, stdout, or stderr is not to close them, 
> but to make them point to /dev/null.  {{dup2}} can be used for this purpose.
> It looks like LOGFILE and ERRORFILE are always set to stdout and stderr at 
> the moment.  However, this is a latent bug that should be fixed in case these 
> are ever made configurable (which seems to have been the intent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-148) CapacityScheduler shouldn't explicitly need YarnConfiguration

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-148.
--
Resolution: Invalid

I don't see this issue anymore, seems like it got resolved along the way.

Resolving this old ticket.

> CapacityScheduler shouldn't explicitly need YarnConfiguration
> -
>
> Key: YARN-148
> URL: https://issues.apache.org/jira/browse/YARN-148
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> This was done in MAPREDUCE-3773. None of our service APIs warrant 
> YarnConfiguration. We affect the proper loading of yarn-site.xml by 
> explicitly creating YarnConfiguration in all the main classes - 
> ResourceManager, NodeManager etc.
> Due to this extra dependency, tests are failing, see 
> https://builds.apache.org/job/PreCommit-YARN-Build/74//testReport/org.apache.hadoop.yarn.client/TestYarnClient/testClientStop/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-149) [Umbrella] ResourceManager (RM) Fail-over

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-149.
--
Resolution: Fixed

Resolving this umbrella JIRA. RM failover has largely been complete/stable in 
YARN since this ticket was opened. And as new requirements/bugs come in, we can 
open new tickets.
- Will leave the open sub-tasks as they are.
- No fix-version as this was done across releases.

> [Umbrella] ResourceManager (RM) Fail-over
> -
>
> Key: YARN-149
> URL: https://issues.apache.org/jira/browse/YARN-149
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Harsh J
>  Labels: patch
> Attachments: YARN ResourceManager Automatic 
> Failover-rev-07-21-13.pdf, YARN ResourceManager Automatic 
> Failover-rev-08-04-13.pdf, rm-ha-phase1-approach-draft1.pdf, 
> rm-ha-phase1-draft2.pdf
>
>
> This jira tracks work needed to be done to support one RM instance failing 
> over to another RM instance so that we can have RM HA. Work includes leader 
> election, transfer of control to leader and client re-direction to new leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-156) WebAppProxyServlet does not support http methods other than GET

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-156.
--
Resolution: Duplicate

Seems like there is more movement on this issue at YARN-2031. Given this, I am 
closing this as dup even though this was the earlier ticket to be created. 
Please revert back if you disagree.

> WebAppProxyServlet does not support http methods other than GET
> ---
>
> Key: YARN-156
> URL: https://issues.apache.org/jira/browse/YARN-156
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Thomas Weise
>
> Should support all methods so that applications can use it for full web 
> service access to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-492) Too many open files error to launch a container

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-492.
--
Resolution: Won't Fix

Agree with analysis from [~hitesh] and [~revans2]. Resolving this as won't fix.

May be YARN-220 can help though, if we come around to having a fix there.

> Too many open files error to launch a container
> ---
>
> Key: YARN-492
> URL: https://issues.apache.org/jira/browse/YARN-492
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
> Environment: RedHat Linux
>Reporter: Krishna Kishore Bonagiri
>
> I am running a date command with YARN's distributed shell example in a loop 
> of 1000 times in this way:
> yarn jar 
> /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
>  org.apache.hadoop.yarn.applications.distributedshell.Client --jar 
> /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
>  --shell_command date --num_containers 2
> Around 730th time or so, I am getting an error in node manager's log saying 
> that it failed to launch container because there are "Too many open files" 
> and when I observe through lsof command,I find that there is one instance of 
> this kind of file is left for each run of Application Master, and it kept 
> growing as I am running it in loop.
> node1:44871->node1:50010
> Thanks,
> Kishore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-501) Application Master getting killed randomly reporting excess usage of memory

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-501.
--
Resolution: Not A Problem

Haven't gotten a response on my last comment in a while. IAC, it is unlikely 
YARN can do much in this situation. Closing this again as not-a-problem.

> Application Master getting killed randomly reporting excess usage of memory
> ---
>
> Key: YARN-501
> URL: https://issues.apache.org/jira/browse/YARN-501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell, nodemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Krishna Kishore Bonagiri
>Assignee: Omkar Vinit Joshi
>
> I am running a date command using the Distributed Shell example in a loop of 
> 500 times. It ran successfully all the times except one time where it gave 
> the following error.
> 2013-03-22 04:33:25,280 INFO  [main] distributedshell.Client 
> (Client.java:monitorApplication(605)) - Got application report from ASM for, 
> appId=222, clientToken=null, appDiagnostics=Application 
> application_1363938200742_0222 failed 1 times due to AM Container for 
> appattempt_1363938200742_0222_01 exited with  exitCode: 143 due to: 
> Container [pid=21141,containerID=container_1363938200742_0222_01_01] is 
> running beyond virtual memory limits. Current usage: 47.3 Mb of 128 Mb 
> physical memory used; 611.6 Mb of 268.8 Mb virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1363938200742_0222_01_01 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 21147 21141 21141 21141 (java) 244 12 532643840 11802 
> /home_/dsadm/yarn/jdk//bin/java -Xmx128m 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster 
> --container_memory 10 --num_containers 2 --priority 0 --shell_command date
> |- 21141 8433 21141 21141 (bash) 0 0 108642304 298 /bin/bash -c 
> /home_/dsadm/yarn/jdk//bin/java -Xmx128m 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster 
> --container_memory 10 --num_containers 2 --priority 0 --shell_command date 
> 1>/tmp/logs/application_1363938200742_0222/container_1363938200742_0222_01_01/AppMaster.stdout
>  
> 2>/tmp/logs/application_1363938200742_0222/container_1363938200742_0222_01_01/AppMaster.stderr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-526) [Umbrella] Improve test coverage in YARN

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-526.
--
Resolution: Fixed
  Assignee: Andrey Klochkov

Assigning the umbrella also to [~aklochkov] and closing it as fixed as all 
sub-tasks currently present are done.

> [Umbrella] Improve test coverage in YARN
> 
>
> Key: YARN-526
> URL: https://issues.apache.org/jira/browse/YARN-526
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Andrey Klochkov
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-543) [Umbrella] NodeManager localization related issues

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-543.
--
Resolution: Fixed

Resolving this very old umbrella JIRA. Most of the originally identified issues 
are resolved. And as new bugs come in, we can open new tickets.
- Will leave the open sub-tasks as they are.
- No fix-version as this was done across releases.

> [Umbrella] NodeManager localization related issues
> --
>
> Key: YARN-543
> URL: https://issues.apache.org/jira/browse/YARN-543
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Reporter: Vinod Kumar Vavilapalli
>
> Seeing a bunch of localization related issues being worked on, this is the 
> tracking ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-128) [Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery

2015-05-02 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-128.
--
Resolution: Fixed

Resolving this umbrella JIRA. RM recovery has largely been nearly 
complete/stable in YARN since this ticket was opened, what with its ultimate 
usage for rolling-upgrades (YARN-666).
 - As new issues come in, we can open new tickets.
- Will leave the open sub-tasks as they are.
- No fix-version as this was done across releases.

> [Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery
> -
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Arun C Murthy
> Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
> RMRestartPhase1.pdf, YARN-128.full-code-4.patch, YARN-128.full-code.3.patch, 
> YARN-128.full-code.5.patch, YARN-128.new-code-added-4.patch, 
> YARN-128.new-code-added.3.patch, YARN-128.old-code-removed.3.patch, 
> YARN-128.old-code-removed.4.patch, YARN-128.patch, 
> restart-12-11-zkstore.patch, restart-fs-store-11-17.patch, 
> restart-zk-store-11-17.patch
>
>
> This umbrella jira tracks the work needed to preserve critical state 
> information and reload them upon RM restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-255) Support secure AM launch for unmanaged AM's

2015-05-06 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-255.
--
Resolution: Duplicate

The only tokens needed by the AM from the RM via the launch are AMRMTokens. 
That is fixed via YARN-937. Closing this as dup, please reopen if you disagree.

> Support secure AM launch for unmanaged AM's
> ---
>
> Key: YARN-255
> URL: https://issues.apache.org/jira/browse/YARN-255
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/unmanaged-AM-launcher
>Affects Versions: 3.0.0
>Reporter: Bikas Saha
>
> Currently unmanaged AM launch does not get security tokens because tokens are 
> passed by the RM to the AM via the NM during AM container launch. For 
> unmanaged AM's the RM can send tokens in the SubmitApplicationResponse to the 
> secure client. The client can then pass these onto the AM in a manner similar 
> to the NM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >