date:20181108

Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2018-11-08 Thread Apache Jenkins Server

For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/

[Nov 7, 2018 2:17:35 AM] (aajisaka) YARN-8233. NPE in 
CapacityScheduler#tryCommit when handling
[Nov 7, 2018 2:48:07 AM] (wwei) YARN-8976. Remove redundant modifiers in 
interface ApplicationConstants.
[Nov 7, 2018 5:54:08 AM] (yqlin) HDDS-809. Refactor SCMChillModeManager.
[Nov 7, 2018 8:45:16 AM] (wwei) HADOOP-15907. Add missing maven modules in 
BUILDING.txt. Contributed
[Nov 7, 2018 9:26:07 AM] (tasanuma) YARN-8866. Fix a parsing error for 
crossdomain.xml.
[Nov 7, 2018 2:20:49 PM] (jlowe) MAPREDUCE-7148. Fast fail jobs when exceeds 
dfs quota limitation.
[Nov 7, 2018 2:42:22 PM] (wwei) YARN-8977. Remove unnecessary type casting when 
calling




-1 overall


The following subsystems voted -1:
asflicense findbugs hadolint pathlen unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.util.TestReadWriteDiskValidator 
   hadoop.util.TestBasicDiskValidator 
   hadoop.util.TestDiskCheckerWithDiskIo 
   hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore 
   hadoop.yarn.client.api.impl.TestAMRMProxy 
   
hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
 
  

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-compile-javac-root.txt
  [324K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-checkstyle-root.txt
  [17M]

   hadolint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   pathlen:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-patch-pylint.txt
  [40K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-patch-shellcheck.txt
  [68K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-patch-shelldocs.txt
  [12K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/whitespace-eol.txt
  [9.3M]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/whitespace-tabs.txt
  [1.1M]

   findbugs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-hdds_client.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-hdds_container-service.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-hdds_framework.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-hdds_server-scm.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-hdds_tools.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_client.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_common.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_objectstore-service.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_ozone-manager.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_ozonefs.txt
  [12K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_s3gateway.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/branch-findbugs-hadoop-ozone_tools.txt
  [8.0K]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/diff-javadoc-javadoc-root.txt
  [752K]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
  [196K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/951/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.tx

Re: Run Distributed TensorFlow on YARN

2018-11-08 Thread Robert Grandl

 Thanks a lot for your reply. 
Sunil,
I was trying to follow the steps from: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md

to run the tensorflow standalone using submarine. I have installed hadoop 
3.3.0-SNAPSHOT. 
However, when I run the:yarn jar 
path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
   job run --name tf-job-001 --verbose --docker_image 
hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
   --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
   --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && 
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% 
--train-steps=1 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 
--sync" \
   --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
command, I get the following error:2018-11-07 21:48:55,831 INFO  [main] 
client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to Application 
History server at /128.105.144.236:10200Exception in thread "main" 
java.lang.IllegalArgumentException: Unacceptable no of cpus specified, either 
zero or negative for component master (or at the global level)        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
        at 
org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
        at 
org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
        at 
org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)   
     at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)       
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)        at 
org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at 
org.apache.hadoop.util.RunJar.main(RunJar.java:236)

It seems that I don't configure somewhere some corresponding resources for a 
master component. However I have a hard time understanding where and what to 
configure. I also looked at the design document you pointed 
at:https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

and it has a --master_resources flag. However this is not available in 3.3.0.
Could you please advise how to proceed with this?
Thank you,- Robert

On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung 
 wrote:  
 
 Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY 
(TensorFlow on YARN) which is a platform built for this purpose.

Jonathan

From: Sunil G 
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-dev@hadoop.apache.org; yarn-dev-h...@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220  was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl 
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issu

[jira] [Created] (YARN-8989) Move DockerCommandPlugin volume related APIs' invocation from DockerLinuxContainerRuntime#prepareContainer to #launchContainer

2018-11-08 Thread Zhankun Tang (JIRA)

Zhankun Tang created YARN-8989:
--

 Summary: Move DockerCommandPlugin volume related APIs' invocation 
from DockerLinuxContainerRuntime#prepareContainer to #launchContainer
 Key: YARN-8989
 URL: https://issues.apache.org/jira/browse/YARN-8989
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


This seems required before we implement isolation in pluggable device framework 
for default container and Docker container with LinuxContainerExecutor.

To find a place for plugin "onDevicesAllocated" in current operation flow when 
running a container with LCE.
{code:java}
ContainerLaunch#call() ->
    1.ContainerLaunch#prepareContainer() - >
          LCE#prepareContainer ->
                 DelegatingLinuxContainerRuntime#prepareContainer ->
                      DockerLinuxContainerRuntime#prepareContainer ->
                            DockerCommandPlugin#getCreateDockerVolumeCommand ->
onDeviceAllocated(null,docker); create volume? 
 
     2.ContainerLaunch#launchContainer
           LCE#launchContainer() ->
                resourceHandlerChain#preStart() ->
                      DeviceResourceHandlerImpl#preStart() ->
                             onDeviceAllocated(alloc,docker)
allocate device and do isolation for default container with cgroup
{code}
 

What I want to do here is to move the DockerCommandPlugin APIs invocation from 
DockerLinuxContainerRuntime#prepareContainer to #launchContainer. This won't 
bring any incompatibility and can benefit the pluggable device framework's 
interaction with the device plugin.

The "DeviceRuntimeSpec onDevicesAllocated(Setallocation, yarnRuntime)" 
implemented by device plugin is to let the plugin do some preparation and 
return a spec on how to run the container with the allocated device. We 
designed a VolumeClaim field in DeviceRuntimeSpec object for the plugin to 
declare what volume they need to create.

In current code flow, call this "onDevicesAllocated" in the 
DockerCommandPlugin's methods seems weird and can only pass a null value as 
allocation. This will complex the vendor device plugin implementation to handle 
a null value.

Once we move the DockerCommandPlugin API invocation, it will like this:
{code:java}
ContainerLaunch#call() ->
     ContainerLaunch#launchContainer
           LCE#launchContainer() ->
                resourceHandlerChain#preStart() ->
                      DeviceResourceHandlerImpl#preStart() ->
                             onDeviceAllocated(alloc,docker)
allocate device and do isolation for default container with cgroup

DelegatingLinuxContainerRuntime#launchContaienr ->
DockerLinuxContainerRuntime#launchContainer->
   DockerCommandPlugin#getCreateDockerVolumeCommand ->

get allocation;onDeviceAllocated(alloc,docker);create volume{code}
After changes, the flow is more smooth and also simplify the plugin 
implementation for "onDevicesAllocated"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8988) Reduce the verbose log on RM heartbeat path when distributed node-attributes is enabled

2018-11-08 Thread Weiwei Yang (JIRA)

Weiwei Yang created YARN-8988:
-

 Summary: Reduce the verbose log on RM heartbeat path when 
distributed node-attributes is enabled
 Key: YARN-8988
 URL: https://issues.apache.org/jira/browse/YARN-8988
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Weiwei Yang


When I get distributed node-attributes enabled, RM log is flooded with 
following logs

{noformat}

2018-11-08 08:20:48,901 INFO 
org.apache.hadoop.yarn.server.resourcemanager.nodelabels.NodeAttributesManagerImpl:
 Updated NodeAttribute event to RM:[[nm.yarn.io/osType(STRING)=redhat, 
nm.yarn.io/osVersion(STRING)=2.6]]

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Re: Zeppelin add hadoop submarine(machine learning) interpreter

2018-11-08 Thread liuxun

hi, community

I updated the design documentation, 
https://docs.google.com/document/d/16YN8Kjmxt1Ym3clx5pDnGNXGajUT36hzQxjaik1cP4A/edit#
 

added the interactive development of machine learning algorithms and the 
overall design of submitting Note packages to YARN for job execution.

I have completed the development of the pre-research part of the code, and 
established JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-3856 


welcome comments, thank you!

> 在 2018年10月30日，下午10:11，liuxun  写道：
> 
> HI, community
> 
> 
> I updated the design document and submitted it to the hadoop community.
> https://docs.google.com/document/d/16YN8Kjmxt1Ym3clx5pDnGNXGajUT36hzQxjaik1cP4A/edit#
>  
> 
> 
> If you have any questions, you can ask them in the email or directly in the 
> document. Thank you!
> 
> I developed the submarine interpreter module in zeppelin. Below is the 
> interface for the system to run.
> 
> 1. zeppelin submarine interpreter properties
> 
> 1.1 System administrators can configure multiple interpreters, and different 
> interpreters can have different resource configurations.
> 1.2 Allow different users to use different interpreters for resource 
> allocation and management.
> 
> 2. zeppelin submarine Tensorflow interpreter
> 
> 2.1 Users use their own interpreter, First through the %submarine.tensorflow 
> paragraph, To write python code for tensorflow.
> 2.2 After the user writes the python code of tensorflow, Click the [RUN] 
> button and zeppelin will upload the python code to the specified HDFS 
> directory. The submarine is loaded into the docker container when it is 
> running.
> 
> 3. zeppelin submarine interpreter
> 
> 3.1 The user sets the call parameter value of the tensorflow python code, 
> Then enter the job run command.
> 3.2 The zeppelin submarine interpreter first checks that all parameters of 
> the submarine run are set completely. After the check is passed, the 
> implementation of YARN is performed via submarine.jar.
> 3.3 The progress and log information of the submarine will be displayed in 
> zeppelin's note.
> 3.4 You can enter submarine's other commands in the submarine interpreter, 
> view all the jobs in the submarine, etc.
> 
> 
>> 在 2018年10月30日，下午6:40，liuxun mailto:neliu...@163.com>> 写道：
>> 
>> 
>> HI,
>> 
>> 
>>  I updated the design document and submitted it to the hadoop community.
>>  
>> https://docs.google.com/document/d/16YN8Kjmxt1Ym3clx5pDnGNXGajUT36hzQxjaik1cP4A/edit#
>>  
>> 
>>  
>> >  
>> >
>>  
>>  If you have any questions, you can ask them in the email or directly in 
>> the document. Thank you!
>> 
>>  I developed the submarine interpreter module in zeppelin. Below is the 
>> interface for the system to run.
>> 
>> zeppelin submarine interpreter properties
>> 
>> System administrators can configure multiple interpreters, and different 
>> interpreters can have different resource configurations.
>> Allow different users to use different interpreters for resource allocation 
>> and management.
>> zeppelin submarine Tensorflow interpreter
>> 
>> Users use their own interpreter, First through the %submarine.tensorflow 
>> paragraph, To write python code for tensorflow.
>> After the user writes the python code of tensorflow, Click the [RUN] button 
>> and zeppelin will upload the python code to the specified HDFS directory. 
>> The submarine is loaded into the docker container when it is running.
>> 
>> zeppelin submarine interpreter
>> 
>> The user sets the call parameter value of the tensorflow python code, Then 
>> enter the job run command.
>> The zeppelin submarine interpreter first checks that all parameters of the 
>> submarine run are set completely. After the check is passed, the 
>> implementation of YARN is performed via submarine.jar.
>> The progress and log information of the submarine will be displayed in 
>> zeppelin's note.
>> You can enter submarine's other commands in the submarine interpreter, view 
>> all the jobs in the submarine, etc.
>> 
>> 
>> 
>>> 在 2018年10月21日，上午6:26，Felix Cheung >> > 写道：
>>> 
>>> Very cool!
>>> 
>>> 
>>> 
>>> From: Jeff Zhang mailto:zjf...@gmail.com>>
>>> Sent: Friday, October 19, 2018 7:14 AM
>>> To: d...@zeppelin.apache.org 
>>> Subject: Re: Zeppelin add hadoop submarine(machine learning framework) 
>>> interpreter
>>> 
>>> Thanks xun. This would be a great addon for zeppelin to support deep
>>> learning. I will check

[jira] [Created] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

2018-11-08 Thread Hidayat Teonadi (JIRA)

Hidayat Teonadi created YARN-8991:
-

 Summary: nodemanager not cleaning blockmgr directories inside 
appcache 
 Key: YARN-8991
 URL: https://issues.apache.org/jira/browse/YARN-8991
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Hidayat Teonadi
 Attachments: yarn-nm-log.txt

Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
noticing that during the lifetime of my spark streaming application, the nm 
appcache folder is building up with blockmgr directories (filled with 
shuffle_*.data).

Looking into the nm logs, it seems like the blockmgr directories is not part of 
the cleanup process of the application. Eventually disk will fill up and app 
will crash. I have both 
{{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
{{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its a 
configuration issue.

What is stumping me is the executor ID listed by spark during the external 
shuffle block registration doesn't match the executor ID listed in yarn's nm 
log. Maybe this executorID disconnect explains why the cleanup is not done ? 
I'm assuming that blockmgr directories are supposed to be cleaned up ?

 
{noformat}
2018-11-05 15:01:21,349 INFO 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}

 {noformat}
 

seems similar to https://issues.apache.org/jira/browse/YARN-7070, although I'm 
not sure if the behavior I'm seeing is spark use related.

[https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
 has a stop gap solution of cleaning up via cron.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8986) publish all exposed ports to random ports when using bridge network

2018-11-08 Thread Charo Zhang (JIRA)

Charo Zhang created YARN-8986:
-

 Summary: publish all exposed ports to random ports when using 
bridge network
 Key: YARN-8986
 URL: https://issues.apache.org/jira/browse/YARN-8986
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.1.1
Reporter: Charo Zhang
 Fix For: 3.1.2
 Attachments: 20181108155450.png

it's better to publish all exposed ports to random ports when using bridge 
network for docker container.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8985) FSParentQueue: debug log missing when assigning container

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)

Wilfred Spiegelenburg created YARN-8985:
---

 Summary: FSParentQueue: debug log missing when assigning container
 Key: YARN-8985
 URL: https://issues.apache.org/jira/browse/YARN-8985
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Tracking assignments in the queue hierarchy is not possible at a DEBUG level 
because the FSParentQueue does not log a node being offered to the queue.
This means that if a parent queue has no leaf queues then it will be impossible 
to track the offering leaving a hole in the tracking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Re: Run Distributed TensorFlow on YARN

2018-11-08 Thread Wangda Tan

Forgot to add Xun in my last email.

On Thu, Nov 8, 2018 at 11:55 AM Wangda Tan  wrote:

> Hi Robert,
>
> Submarine in 3.2.0 only support Docker container runtime, and in future
> releases (maybe 3.2.1), we plan to add support for non-docker containers.
>
> In order to try Submarine, you need to properly configure docker-on-yarn
> first.
>
> You can check
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
> for installation guide about how to properly setup Docker container on
> multiple containers. Submarine embedded an interactive shell to help you
> set up this should be straightforward. Added Xun Liu who is the original
> author for the installation interactive shell.
>
> Once you get Docker on YARN properly set up, you can follow
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
> to run the first application.
>
> Also, you can check Submarine slides to better understand how it works.
> See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0
>
> Any questions please don't hesitate to let us know.
>
> Thanks,
> Wangda
>
>
>
> On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl 
> wrote:
>
>>  Thanks a lot for your reply.
>> Sunil,
>> I was trying to follow the steps from:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>>
>> to run the tensorflow standalone using submarine. I have installed hadoop
>> 3.3.0-SNAPSHOT.
>> However, when I run the:yarn jar
>> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>>job run --name tf-job-001 --verbose --docker_image
>> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>>--input_path hdfs://default/dataset/cifar-10-data \
>>--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>>--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>>--num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>>--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
>> && python cifar10_main.py --data-dir=%input_path%
>> --job-dir=%checkpoint_path% --train-steps=1 --eval-batch-size=16
>> --train-batch-size=16 --num-gpus=2 --sync" \
>>--tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
>> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
>> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
>> Application History server at /128.105.144.236:10200Exception in thread
>> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
>> specified, either zero or negative for component master (or at the global
>> level)at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>>   at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>>   at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>>   at
>> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>>   at
>> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>>   at
>> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>>   at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>   at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:498)at
>> org.apache.hadoop.util.RunJar.run(RunJar.java:323)at
>> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>>
>> It seems that I don't configure somewhere some corresponding resources
>> for a master component. However I have a hard time understanding where and
>> what to configure. I also looked at the design document you pointed at:
>> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>>
>> and it has a --master_resources flag. However this is not available in
>> 3.3.0.
>> Could you please advise how to proceed with this?
>> Thank you,- Robert
>>
>> On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
>> jyhung2...@gmail.com> wrote:
>>
>>  Hi Robert, I also encourage you to check out
>> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a
>> platform built for this purpose.
>>
>> Jonathan
>> 
>> From: Sunil G 
>> Sent: Tuesday, November 6, 2018 10:05:14 PM
>> To: Robert Grandl
>> Cc: yarn-dev@hadoop.apache.org; yarn-dev-h

[jira] [Created] (YARN-8990) FS" race condition in app submit and queue cleanup

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)

Wilfred Spiegelenburg created YARN-8990:
---

 Summary: FS" race condition in app submit and queue cleanup
 Key: YARN-8990
 URL: https://issues.apache.org/jira/browse/YARN-8990
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


With the introduction of the dynamic queue deletion in YARN-8191 a race 
condition was introduced that can cause a queue to be removed while an 
application submit is in progress.

The issue occurs in {{FairScheduler.addApplication()}} when an application is 
submitted to a dynamic queue which is empty or the queue does not exist yet. If 
during the processing of the application submit the 
{{AllocationFileLoaderService}} kicks of for an update the queue clean up will 
be run first. The application submit first creates the queue and get a 
reference back to the queue. 
Other checks are performed and as the last action before getting ready to 
generate an AppAttempt the queue is updated to show the submitted application 
ID..

The time between the queue creation and the queue update to show the submit is 
long enough for the queue to be removed. The application however is lost and 
will never get any resources assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8987) Usability improvements for adding centralised node-attributes from CLI

2018-11-08 Thread Weiwei Yang (JIRA)

Weiwei Yang created YARN-8987:
-

 Summary: Usability improvements for adding centralised 
node-attributes from CLI
 Key: YARN-8987
 URL: https://issues.apache.org/jira/browse/YARN-8987
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Weiwei Yang


I setup a single node cluster, then trying to add node-attributes with CLI,

first I tried:

{code}

./bin/yarn nodeattributes -add localhost:hostname(STRING)=localhost

{code}

this command returns exit code 0, however the node-attribute was not added.

Then I tried to replace "localhost" with the host ID, and it worked.

We need to ensure the command fails with proper error message when adding was 
not succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Re: Run Distributed TensorFlow on YARN

2018-11-08 Thread Wangda Tan

Hi Robert,

Submarine in 3.2.0 only support Docker container runtime, and in future
releases (maybe 3.2.1), we plan to add support for non-docker containers.

In order to try Submarine, you need to properly configure docker-on-yarn
first.

You can check
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
for installation guide about how to properly setup Docker container on
multiple containers. Submarine embedded an interactive shell to help you
set up this should be straightforward. Added Xun Liu who is the original
author for the installation interactive shell.

Once you get Docker on YARN properly set up, you can follow
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
to run the first application.

Also, you can check Submarine slides to better understand how it works.
See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0

Any questions please don't hesitate to let us know.

Thanks,
Wangda



On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl 
wrote:

>  Thanks a lot for your reply.
> Sunil,
> I was trying to follow the steps from:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>
> to run the tensorflow standalone using submarine. I have installed hadoop
> 3.3.0-SNAPSHOT.
> However, when I run the:yarn jar
> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>job run --name tf-job-001 --verbose --docker_image
> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>--input_path hdfs://default/dataset/cifar-10-data \
>--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>--num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
> && python cifar10_main.py --data-dir=%input_path%
> --job-dir=%checkpoint_path% --train-steps=1 --eval-batch-size=16
> --train-batch-size=16 --num-gpus=2 --sync" \
>--tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
> Application History server at /128.105.144.236:10200Exception in thread
> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
> specified, either zero or negative for component master (or at the global
> level)at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>   at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>   at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>   at
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>   at
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>   at
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>   at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)at
> org.apache.hadoop.util.RunJar.run(RunJar.java:323)at
> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>
> It seems that I don't configure somewhere some corresponding resources for
> a master component. However I have a hard time understanding where and what
> to configure. I also looked at the design document you pointed at:
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>
> and it has a --master_resources flag. However this is not available in
> 3.3.0.
> Could you please advise how to proceed with this?
> Thank you,- Robert
>
> On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
> jyhung2...@gmail.com> wrote:
>
>  Hi Robert, I also encourage you to check out
> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform
> built for this purpose.
>
> Jonathan
> 
> From: Sunil G 
> Sent: Tuesday, November 6, 2018 10:05:14 PM
> To: Robert Grandl
> Cc: yarn-dev@hadoop.apache.org; yarn-dev-h...@hadoop.apache.org; General
> Subject: Re: Run Distributed TensorFlow on YARN
>
> Hi Robert
>
> {Submarine} project helps to run Distributed Tensorflow on top of YARN with
> ease. YARN-8220

[jira] [Created] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

2018-11-08 Thread Haibo Chen (JIRA)

Haibo Chen created YARN-8992:


 Summary: Fair scheduler can delete a dynamic queue while an 
application attempt is being added to the queue
 Key: YARN-8992
 URL: https://issues.apache.org/jira/browse/YARN-8992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.1
Reporter: Haibo Chen


QueueManager can see a leaf queue being empty while FSLeafQueue.addApp() is 
called in the middle of  
{code:java}
return queue.getNumRunnableApps() == 0 &&
  leafQueue.getNumNonRunnableApps() == 0 &&
  leafQueue.getNumAssignedApps() == 0;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8993) [Submarine] Add support to run deep learning workload in non-Docker containers

2018-11-08 Thread Wangda Tan (JIRA)

Wangda Tan created YARN-8993:


 Summary: [Submarine] Add support to run deep learning workload in 
non-Docker containers
 Key: YARN-8993
 URL: https://issues.apache.org/jira/browse/YARN-8993
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Now Submarine can well support Docker container, there're some needs to run TF 
w/o Docker container. This JIRA is targeted to support non-docker container 
deep learning workload orchestration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8994) Fix for race condition in move app and queue cleanup in FS

2018-11-08 Thread Wilfred Spiegelenburg (JIRA)

Wilfred Spiegelenburg created YARN-8994:
---

 Summary: Fix for race condition in move app and queue cleanup in FS
 Key: YARN-8994
 URL: https://issues.apache.org/jira/browse/YARN-8994
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Similar to YARN-8990 and also introduced by YARN-8191 there is a race condition 
while moving an application. The pre-move check looks for the queue and when it 
finds the queue it progresses. The real move then retrieves the queue and does 
further check before updating the app and queues.

The move uses the retrieved queue object but the queue could have become empty 
while checks are performed. If the cleanup runs at that same time the app will 
be moved to a deleted queue and lost.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2018-11-08 Thread Apache Jenkins Server

For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/

[Nov 8, 2018 4:23:00 AM] (wwei) YARN-8880. Add configurations for pluggable 
plugin framework.
[Nov 8, 2018 9:47:18 AM] (wwei) YARN-8988. Reduce the verbose log on RM 
heartbeat path when distributed
[Nov 8, 2018 1:03:38 PM] (nanda) HDDS-737. Introduce Incremental Container 
Report. Contributed by Nanda
[Nov 8, 2018 3:41:43 PM] (yqlin) HDDS-802. Container State Manager should get 
open pipelines for
[Nov 8, 2018 5:21:40 PM] (stevel) HADOOP-15846. ABFS: fix mask related bugs in 
setAcl, modifyAclEntries
[Nov 8, 2018 6:01:19 PM] (xiao) HDFS-14039. ec -listPolicies doesn't show 
correct state for the default
[Nov 8, 2018 6:35:45 PM] (shashikant) HDDS-806. Update Ratis to latest snapshot 
version in ozone. Contributed
[Nov 8, 2018 10:52:24 PM] (gifuma) HADOOP-15903. Allow HttpServer2 to discover 
resources in /static when
[Nov 9, 2018 12:02:48 AM] (haibochen) YARN-8990. Fix fair scheduler race 
condition in app submit and queue




-1 overall


The following subsystems voted -1:
findbugs hadolint pathlen shadedclient unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-compile-javac-root.txt
  [324K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-checkstyle-root.txt
  [17M]

   hadolint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   pathlen:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-patch-pylint.txt
  [40K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-patch-shellcheck.txt
  [68K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-patch-shelldocs.txt
  [12K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/whitespace-eol.txt
  [9.3M]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/whitespace-tabs.txt
  [1.1M]

   findbugs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-hdds_client.txt
  [24K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-hdds_container-service.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-hdds_framework.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-hdds_server-scm.txt
  [12K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-hdds_tools.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_client.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_common.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_objectstore-service.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_ozone-manager.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_ozonefs.txt
  [16K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_s3gateway.txt
  [44K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/branch-findbugs-hadoop-ozone_tools.txt
  [8.0K]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/diff-javadoc-javadoc-root.txt
  [752K]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/patch-unit-hadoop-common-project_hadoop-minikdc.txt
  [12K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/patch-unit-hadoop-common-project_hadoop-auth.txt
  [40K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
  [720K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/952/artifact/out/patch-unit-hadoop-common-

[jira] [Created] (YARN-8995) Log the event type of the too big event queue size, and add the information to the metrics.

2018-11-08 Thread zhuqi (JIRA)

zhuqi created YARN-8995:
---

 Summary: Log the event type of the too big event queue size, and 
add the information to the metrics. 
 Key: YARN-8995
 URL: https://issues.apache.org/jira/browse/YARN-8995
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: metrics, nodemanager, resourcemanager
Affects Versions: 3.1.0
Reporter: zhuqi
Assignee: zhuqi


In our growing cluster，there are unexpected situations that cause some event 
queues to block the performance of the cluster, such as the bug of  
https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to log 
the event type of the too big event queue size, and add the information to the 
metrics, and the  threshold of queue size is a parametor which can be changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8996) [Submarine] Simplify the logic in YarnServiceJobSubmitter#needHdfs

2018-11-08 Thread Zhankun Tang (JIRA)

Zhankun Tang created YARN-8996:
--

 Summary: [Submarine] Simplify the logic in 
YarnServiceJobSubmitter#needHdfs
 Key: YARN-8996
 URL: https://issues.apache.org/jira/browse/YARN-8996
 Project: Hadoop YARN
  Issue Type: Improvement
 Environment: In YarnServiceJobSubmitter#needHdfs. Below code can be 
simplified to just one line.
{code:java}
if (content != null && content.contains("hdfs://")) {
  return true;
}
return false;{code}
{code:java}
return content != null && content.contains("hdfs://");{code}
Reporter: Zhankun Tang
Assignee: Zhankun Tang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8997) [Submarine] Simplify the logic in YarnServiceJobSubmitter#needHdfs

2018-11-08 Thread Zhankun Tang (JIRA)

Zhankun Tang created YARN-8997:
--

 Summary: [Submarine] Simplify the logic in 
YarnServiceJobSubmitter#needHdfs
 Key: YARN-8997
 URL: https://issues.apache.org/jira/browse/YARN-8997
 Project: Hadoop YARN
  Issue Type: Improvement
 Environment: In YarnServiceJobSubmitter#needHdfs. Below code can be 
simplified to just one line.
{code:java}
if (content != null && content.contains("hdfs://")) {
  return true;
}
return false;{code}
{code:java}
return content != null && content.contains("hdfs://");{code}
Reporter: Zhankun Tang
Assignee: Zhankun Tang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8998) Simplify the condition check in CliUtils#argsForHelp

2018-11-08 Thread Zhankun Tang (JIRA)

Zhankun Tang created YARN-8998:
--

 Summary: Simplify the condition check in CliUtils#argsForHelp
 Key: YARN-8998
 URL: https://issues.apache.org/jira/browse/YARN-8998
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhankun Tang
Assignee: Zhankun Tang
 Attachments: YARN-8998-trunk-001.patch


{code:java}
if (args[0].equals("-h") || args[0].equals("--help")) {
  return true;
}
{code}


Can be simlified to:

{code:java}
 return args[0].equals("-h") || args[0].equals("--help");
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

[jira] [Created] (YARN-8999) [Submarine] Remove redundant local variables

2018-11-08 Thread Zhankun Tang (JIRA)

Zhankun Tang created YARN-8999:
--

 Summary: [Submarine] Remove redundant local variables
 Key: YARN-8999
 URL: https://issues.apache.org/jira/browse/YARN-8999
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Several methods have redundant local variables that can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

Re: Run Distributed TensorFlow on YARN

[jira] [Created] (YARN-8989) Move DockerCommandPlugin volume related APIs' invocation from DockerLinuxContainerRuntime#prepareContainer to #launchContainer

[jira] [Created] (YARN-8988) Reduce the verbose log on RM heartbeat path when distributed node-attributes is enabled

Re: Zeppelin add hadoop submarine(machine learning) interpreter

[jira] [Created] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

[jira] [Created] (YARN-8986) publish all exposed ports to random ports when using bridge network

[jira] [Created] (YARN-8985) FSParentQueue: debug log missing when assigning container

Re: Run Distributed TensorFlow on YARN

[jira] [Created] (YARN-8990) FS" race condition in app submit and queue cleanup

[jira] [Created] (YARN-8987) Usability improvements for adding centralised node-attributes from CLI

Re: Run Distributed TensorFlow on YARN

[jira] [Created] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

[jira] [Created] (YARN-8993) [Submarine] Add support to run deep learning workload in non-Docker containers

[jira] [Created] (YARN-8994) Fix for race condition in move app and queue cleanup in FS

Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

[jira] [Created] (YARN-8995) Log the event type of the too big event queue size, and add the information to the metrics.

[jira] [Created] (YARN-8996) [Submarine] Simplify the logic in YarnServiceJobSubmitter#needHdfs

[jira] [Created] (YARN-8997) [Submarine] Simplify the logic in YarnServiceJobSubmitter#needHdfs

[jira] [Created] (YARN-8998) Simplify the condition check in CliUtils#argsForHelp

[jira] [Created] (YARN-8999) [Submarine] Remove redundant local variables

21 matches

Site Navigation

Mail list logo

Footer information