[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-11-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018122#comment-15018122
 ] 

Jason Lowe commented on YARN-4225:
--

Patch looks good overall, with just one minor issue regarding forward 
compatibility.  If a newer client talks to an older ResourceManager then it 
will report a preemption status when none was provided.  This could lead to 
some confusion when the preemption actually is disabled but the client defaults 
the missing field to false and reports preemption is enabled.  Not sure that is 
must-fix for this scenario.  If it is we'd have to expose something like the 
{{hasPreemptionDisabled}} protobuf method so the client could avoid reporting 
the value if the server didn't provide it.

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018131#comment-15018131
 ] 

Jason Lowe commented on YARN-4374:
--

On second thought, I think the first approach is more appropriate.  Float 
formatting is already truncating to 6 digits of precision after the decimal 
point (and discarding unnecessary zeroes).  Since this is a case where we are 
not applying any kind of calculation to the value parsed from the config then I 
think we can just rely on the standard float formatting to convey the value we 
parsed.  If we were calculating the user limit then we could definitely run 
into precision annoyances like 0.250001 instead of 0.25, but since we're not 
calculating we shouldn't hit them.

So +1 for the original patch.  Will commit later today if there are no 
objections.

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout

2015-11-20 Thread Tsuyoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-4348:
-
Priority: Blocker  (was: Major)

> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> 
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, 
> YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout

2015-11-20 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018159#comment-15018159
 ] 

Tsuyoshi Ozawa commented on YARN-4348:
--

Found that this is caused by the lock ordering.

1. (In main thread of RM) locking ZKRMStateStore(startInternal) -> waiting for 
lock.await() 
2. ZK's eventThread: Got SyncConnected event from ZK -> Calling 
ForwardingWatcher#process -> processWatchEvent called, but ZKRMStateStore has 
been locked since 1
3. (In main thread of RM)  timeout and IOException -> unlocking 
ZKRMStateStore() -> the callback, processEvent, of sync is fired.

I will attach a patch to address this problem.

> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> 
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, 
> YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4377) Container process not killed

2015-11-20 Thread Jaromir Vanek (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaromir Vanek updated YARN-4377:

Description: 
It seems my processes in containers are not killed when the whole job is 
killed. Containers will hang in {{KILLING}} state until forever.

The root of this problem is that signals sent to the container process are sent 
with wrong user permissions. 

>From the nodemanager log:
{quote}
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Got pid 20748 for container container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sending signal to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 15 to pid 20748 as user _submitter_
2015-11-20 15:38:22,069 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sent signal SIGTERM to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03, result=failed
2015-11-20 15:38:22,319 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 9 to pid 20748 as user _submitter_
{quote}

{{SIGTERM}} and following {{SIGKILL}} signals are sent with the *submitter* 
user permissions, but the container process is running under *yarn* user by 
default (when using {{DefaultContainerExecutor}} which is true in my case). The 
result is that signals are ignored and container will run forever.

Am I doing something wrong or is it a bug?

  was:
It seems my processes in containers are not killed when the whole job is 
killed. Containers will hang in {{KILLING}} state until forever.

The root of this problem is that signals sent to the container process are sent 
with wrong user permissions. 

>From the nodemanager log:
{quote}
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Got pid 20748 for container container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sending signal to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 15 to pid 20748 as user _submitter_
2015-11-20 15:38:22,069 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sent signal SIGTERM to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03, result=failed
2015-11-20 15:38:22,319 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 9 to pid 20748 as user _submitter_
{quote}

{{SIGTERM}} and following {{SIGKILL}} signals are sent with the *submitter* 
user permissions, but the container process is running under *yarn* user by 
default (when using {{DefaultContainerExecutor}} which is true in my case}}. 
The result is that signals are ignored and container will run forever.

Am I doing something wrong or is it a bug?


> Container process not killed
> 
>
> Key: YARN-4377
> URL: https://issues.apache.org/jira/browse/YARN-4377
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment: Debian linux
>Reporter: Jaromir Vanek
>Priority: Critical
>
> It seems my processes in containers are not killed when the whole job is 
> killed. Containers will hang in {{KILLING}} state until forever.
> The root of this problem is that signals sent to the container process are 
> sent with wrong user permissions. 
> From the nodemanager log:
> {quote}
> 2015-11-20 15:38:22,063 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Got pid 20748 for container container_1443786884805_2298_01_03
> 2015-11-20 15:38:22,063 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Sending signal to pid 20748 as user _submitter_ for container 
> container_1443786884805_2298_01_03
> 2015-11-20 15:38:22,063 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
> signal 15 to pid 20748 as user _submitter_
> 2015-11-20 15:38:22,069 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Sent signal SIGTERM to pid 20748 as user _submitter_ for container 
> container_1443786884805_2298_01_03, result=failed
> 2015-11-20 15:38:22,319 DEBUG 
> 

[jira] [Created] (YARN-4377) Container process not killed

2015-11-20 Thread Jaromir Vanek (JIRA)
Jaromir Vanek created YARN-4377:
---

 Summary: Container process not killed
 Key: YARN-4377
 URL: https://issues.apache.org/jira/browse/YARN-4377
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment: Debian linux
Reporter: Jaromir Vanek
Priority: Critical


It seems my processes in containers are not killed when the whole job is 
killed. Containers will hang in {{KILLING}} state until forever.

The root of this problem is that signals sent to the container process are sent 
with wrong user permissions. 

>From the nodemanager log:
{quote}
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Got pid 20748 for container container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sending signal to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03
2015-11-20 15:38:22,063 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 15 to pid 20748 as user _submitter_
2015-11-20 15:38:22,069 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Sent signal SIGTERM to pid 20748 as user _submitter_ for container 
container_1443786884805_2298_01_03, result=failed
2015-11-20 15:38:22,319 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Sending 
signal 9 to pid 20748 as user _submitter_
{quote}

{{SIGTERM}} and following {{SIGKILL}} signals are sent with the *submitter* 
user permissions, but the container process is running under *yarn* user by 
default (when using {{DefaultContainerExecutor}} which is true in my case}}. 
The result is that signals are ignored and container will run forever.

Am I doing something wrong or is it a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4372) Cannot enable system-metrics-publisher inside MiniYARNCluster

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018418#comment-15018418
 ] 

Naganarasimha G R commented on YARN-4372:
-

hi [~vinodkv],
bq. Can you try my uploaded patch here together with YARN-4350 and see if you 
still find issues?
took the latest code from trunk as YARN-2859 is commited, later just added the 
below modification in 
{{TestDistributedShell}}
{code}
@@ -80,6 +80,7 @@ protected void setupInternal(int numNodeManager) throws 
Exception {
 conf.setInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB, 128);
 conf.set("yarn.log.dir", "target");
 conf.setBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, true);
+conf.setBoolean(YarnConfiguration.RM_SYSTEM_METRICS_PUBLISHER_ENABLED, 
true);
 conf.set(YarnConfiguration.RM_SCHEDULER, 
CapacityScheduler.class.getName());
 conf.setBoolean(YarnConfiguration.NODE_LABELS_ENABLED, true);
 conf.set("mapreduce.jobhistory.address",
{code}
Even after the patch {{TestDistributedShell.testDSShellWithoutDomain}} is 
failing (test case passes but the in the console logs there were logs for 
unreachable timlineserver for each smp events). Crossverified the {{resURI}} is 
getting set properly in {{TimelineClientImpl}}. This is the same which i had 
faced when i was last analyzing and i too had done the similar modifications in 
MiniYarnCluster. 
Few console logs 
{code}
2015-11-20 23:17:15,867 ERROR [AsyncDispatcher event handler] 
impl.TimelineClientImpl (TimelineClientImpl.java:doPosting(336)) - Failed to 
get the response from the timeline server.
2015-11-20 23:17:15,868 ERROR [AsyncDispatcher event handler] 
metrics.SystemMetricsPublisher (SystemMetricsPublisher.java:putEntity(485)) - 
Error when publishing entity 
[YARN_CONTAINER,container_1448041615020_0001_01_02]
{code}
bq. The separate thread is not an issue as RM will only start after AHS 
successfully starts.
Yes you are right, my mistake had just observed that it was started as part of 
thread but we are further checking whether the service is started and throwing 
exception if necessary.

bq. Even if that store is specified, TimelineDataManager will continue to be 
instantiated in the server, so your assumption is wrong that levelDB is not 
created. Agree?
Oops had wrongly mentioned *levelDBtimelinestore* but wanted to mention as 
{{ApplicationHistoryManagerOnTimelineStore}}. 
As any way we are starting Timelinestore so better to have 
ApplicationHistoryManagerOnTimelineStore than the almost deprecated 
{{yarn.timeline-service.generic-application-history.store-class}} interface . 
(not a must fix but better to have)

> Cannot enable system-metrics-publisher inside MiniYARNCluster
> -
>
> Key: YARN-4372
> URL: https://issues.apache.org/jira/browse/YARN-4372
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-4372-20151119.1.txt
>
>
> [~Naganarasimha] found this at YARN-2859, see [this 
> comment|https://issues.apache.org/jira/browse/YARN-2859?focusedCommentId=15005746=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15005746].
> The way daemons are started inside MiniYARNCluster, RM is not setup correctly 
> to send information to TimelineService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018486#comment-15018486
 ] 

Naganarasimha G R commented on YARN-3623:
-

This would be good [~vinodkv], 
but was under the impression that timeline service version is not required for 
versions before ATS 1.5 and this would be introduced in YARN-4234 (as part of 
ATS 1.5) and the required modifications where ever applicable in ATSv2 would be 
handled in this jira (earlier ATS subjira). Ok anyway can create another 
subjira in 2928 to handle usage of this newly introduced version.

> We should have a config to indicate the Timeline Service version
> 
>
> Key: YARN-3623
> URL: https://issues.apache.org/jira/browse/YARN-3623
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Naganarasimha G R
> Attachments: YARN-3623-2015-11-19.1.patch
>
>
> So far RM, MR AM, DA AM added/changed new config to enable the feature to 
> write the timeline data to v2 server. It's good to have a YARN 
> timeline-service.version config like timeline-service.enable to indicate the 
> version of the running timeline service with the given YARN cluster. It's 
> beneficial for users to more smoothly move from v1 to v2, as they don't need 
> to change the existing config, but switch this config from v1 to v2. And each 
> framework doesn't need to have their own v1/v2 config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4365) FileSystemNodeLabelStore should check for root dir existence on startup

2015-11-20 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4365:
--
Attachment: YARN-4365-1.patch

Checks whether path exists before calling mkdir() for root path. Unit Test 
included.

> FileSystemNodeLabelStore should check for root dir existence on startup
> ---
>
> Key: YARN-4365
> URL: https://issues.apache.org/jira/browse/YARN-4365
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
> Attachments: YARN-4365-1.patch
>
>
> If the namenode is in safe mode for some reason then FileSystemNodeLabelStore 
> will prevent the RM from starting since it unconditionally tries to create 
> the root directory for the label store.  If the root directory already exists 
> and no labels are changing then we shouldn't prevent the RM from starting 
> even if the namenode is in safe mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4053) Change the way metric values are stored in HBase Storage

2015-11-20 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018447#comment-15018447
 ] 

Varun Saxena commented on YARN-4053:


Thanks [~sjlee0] for the review and commit. And [~jrottinghuis] and 
[~vrushalic] for the reviews.

> Change the way metric values are stored in HBase Storage
> 
>
> Key: YARN-4053
> URL: https://issues.apache.org/jira/browse/YARN-4053
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-4053-YARN-2928.01.patch, 
> YARN-4053-YARN-2928.02.patch, YARN-4053-feature-YARN-2928.03.patch, 
> YARN-4053-feature-YARN-2928.04.patch, YARN-4053-feature-YARN-2928.05.patch, 
> YARN-4053-feature-YARN-2928.06.patch, YARN-4053-feature-YARN-2928.07.patch, 
> YARN-4053-feature-YARN-2928.08.patch
>
>
> Currently HBase implementation uses GenericObjectMapper to convert and store 
> values in backend HBase storage. This converts everything into a string 
> representation(ASCII/UTF-8 encoded byte array).
> While this is fine in most cases, it does not quite serve our use case for 
> metrics. 
> So we need to decide how are we going to encode and decode metric values and 
> store them in HBase.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018478#comment-15018478
 ] 

Naganarasimha G R commented on YARN-4183:
-

Thanks for the proposal [~sjlee0], 
Only additional query is should the timeline server run if the 
{{yarn.timeline-service.enabled}} is enabled ? My opinion would be to have it 
so that the configurations is much more stronger, thoughts ?

bq. there needs to be a strong client-side config for each framework (MR or 
tez) that controls whether it wants to use the timeline service;
How about the config be 
{{yarn.timeline-service.client.require-delegation-token}} or 
{{yarn.timeline-service.client.delegation-token.enabled}} ?

bq. I hope the last point should address the original issue of this JIRA
And to further clarify and also as mentioned earlier, there is one more config 
{{yarn.timeline-service.client.best-effort}} which will avoid clients to fail 
when delegation token fails to be retreived.

> Enabling generic application history forces every job to get a timeline 
> service delegation token
> 
>
> Key: YARN-4183
> URL: https://issues.apache.org/jira/browse/YARN-4183
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-4183.1.patch
>
>
> When enabling just the Generic History Server and not the timeline server, 
> the system metrics publisher will not publish the events to the timeline 
> store as it checks if the timeline server and system metrics publisher are 
> enabled before creating a timeline client.
> To make it work, if the timeline service flag is turned on, it will force 
> every yarn application to get a delegation token.
> Instead of checking if timeline service is enabled, we should be checking if 
> application history server is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-11-20 Thread Brook Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018487#comment-15018487
 ] 

Brook Zhou commented on YARN-3223:
--

Makes sense. 

One thing that I'm not sure about - RMNodeImpl does not know directly the 
amount of usedResource in order to trigger an RMNodeResourceUpdateEvent. I can 
use rmNode.context.getScheduler().(rmNode.getNodeID()).getUsedResource(), but 
I'm not sure if adding that dependency on scheduler is okay.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3454) RLESparseResourceAllocation does not handle removal of partial intervals (+ introducing support for efficient "merge" operations)

2015-11-20 Thread Carlo Curino (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-3454:
---
Attachment: YARN-3454.5.patch

> RLESparseResourceAllocation does not handle removal of partial intervals (+ 
> introducing support for efficient "merge" operations) 
> --
>
> Key: YARN-3454
> URL: https://issues.apache.org/jira/browse/YARN-3454
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0, 2.7.1, 2.6.2
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-3454.1.patch, YARN-3454.2.patch, YARN-3454.3.patch, 
> YARN-3454.4.patch, YARN-3454.5.patch, YARN-3454.patch
>
>
> The RLESparseResourceAllocation.removeInterval(...) method handles well exact 
> match interval removals, but does not handles correctly partial overlaps. 
> In the context of this fix, we also introduced static methods to "merge" two 
> RLESparseResourceAllocation, while applying an operator in the process 
> (add/subtract/min/max/subtractTestPositive)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4365) FileSystemNodeLabelStore should check for root dir existence on startup

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018550#comment-15018550
 ] 

Hadoop QA commented on YARN-4365:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 31s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 17s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 24s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
26s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m 29s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | hadoop.yarn.webapp.TestWebApp |
| JDK v1.7.0_85 Failed junit tests | hadoop.yarn.webapp.TestWebApp |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:date2015-11-20 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12773553/YARN-4365-1.patch |
| JIRA Issue | YARN-4365 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux ff28fcdb5368 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 

[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.2.patch

fix broken test of TestLeveldbRMStateStore and TestYarnConfigurationFields. The 
other broken tests are not related.
Also add additional tests for statestore.

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018657#comment-15018657
 ] 

Naganarasimha G R commented on YARN-3623:
-

Hi [~vinodkv], 
bq. There should be an explicit yarn.timeline-service.version which tells 
YarnClient to get tokens or not - yes for non-present version (default), v1, v2 
but no for v1.5.
IIUC this contradicts the [~sjlee0]'s proposal which was discussed in YARN-4183 
and the approach which i had discussed in YARN-4234 with [~gtCarrera], 
I can envisage it as follows (including the opinions of [~sjlee0]):
* timeline service will be configured with {{"yarn.timeline-service.enabled"}} 
and {{"TIMELINE_SERVICE_VERSION"}}  to be true based on which appropriate 
timeline handler is selected
* RM/NM (SMP) will be configured with {{"yarn.timeline-service.enabled"}} and 
{{yarn.resourcemanager.system-metrics-publisher.enabled}} to be set to true and 
along with it {{"TIMELINE_SERVICE_VERSION"}} based on which appropriate client 
will be selected.
* Client apps who ever want to communicate with Timeline server will enable 
{{"yarn.timeline-service.enabled"}} and a new config like 
{{yarn.timeline-service.client.require-delegation-token}} to be set to true to 
publish the ATS events.
* If {{security , "yarn.timeline-service.enabled" and 
"yarn.timeline-service.client.require-delegation-token"}} are enabled then 
delegation token is automatically fetched in the yarn client as earlier
* when the timeline client is initialized it {{contacts server for the version 
and related configs}}  through REST and once it receives it initializes itself. 
If user tries to use invalid methods (not appropriate to the server timeline 
version) then timelineclient throws exception.

Thus this will avoid improper configurations between client and server and also 
allows client to have flexible configurations from server to avoid getting 
delegation tokens from ATS  based on server config. 
A note : In the above proposal have not considered the multiple versions which 
is still under discussion in YARN-4368 

> We should have a config to indicate the Timeline Service version
> 
>
> Key: YARN-3623
> URL: https://issues.apache.org/jira/browse/YARN-3623
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Naganarasimha G R
> Attachments: YARN-3623-2015-11-19.1.patch
>
>
> So far RM, MR AM, DA AM added/changed new config to enable the feature to 
> write the timeline data to v2 server. It's good to have a YARN 
> timeline-service.version config like timeline-service.enable to indicate the 
> version of the running timeline service with the given YARN cluster. It's 
> beneficial for users to more smoothly move from v1 to v2, as they don't need 
> to change the existing config, but switch this config from v1 to v2. And each 
> framework doesn't need to have their own v1/v2 config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018590#comment-15018590
 ] 

Naganarasimha G R commented on YARN-3784:
-

Thanks for considering [~sunilg],
May be if i am informed with the usage of the preemption timeout then i could 
understand better.
IIUC if its for to take better decision,  then assume  a scenario such that 
AM's heartbeat interval is 3 seconds and the preemption timeout is 5 seconds, 
and assume just after a am heartbeat  container was targeted to be preempted 
then in the next hearbeat that is after 3 seconds it will be informed that a 
container will be timed out in 5 seconds instead of 2 seconds which i feel is 
of little help or might even lead to wrong decisions!. Hence suggested 
timestamp instead of just the timeout value. Thoughts? 
Correct me if i am wrong !

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch, 
> 0003-YARN-3784.patch, 0004-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018637#comment-15018637
 ] 

Hadoop QA commented on YARN-4334:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 5m 31s 
{color} | {color:red} Docker failed to build yetus/hadoop:date2015-11-20. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12773579/YARN-4334.2.patch |
| JIRA Issue | YARN-4334 |
| Powered by | Apache Yetus   http://yetus.apache.org |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9749/console |


This message was automatically generated.



> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4368) Support Multiple versions of the timeline service at the same time

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018509#comment-15018509
 ] 

Naganarasimha G R commented on YARN-4368:
-

Thanks for sharing the views [~sjlee0] & [~vrushalic],
bq. For example, what would we allow in terms of client behavior? Assuming both 
v.1 and v.2 are enabled on the cluster, should clients still have ability to 
pick which version it would write to (e.g. v.1 only, v.2 only, v.1 + v.2)?
Any particular reason for not having the below configurations {{v.1 + v.1.5}} 
or {{v.1.5 + v.2}} ? 

bq. I would lean towards not having it as a client side setting since adhoc 
jobs may choose to not emit stats to ATS v2 and that will disrupt the 
accountability of users on the cluster.
IIUC jobs will just inform from the client side that it will not use the 
timelineservice to emit its events to the ATS but RM & NM's publishers will not 
be impacted with this client configuration. And hence YARN level ATS events 
will be anyway collected.

Given the complexities and the advantages got, my initial opinion would be 
similar to [~vrushalic]'s, just to have one of the versions enabled in the 
cluster at any given point in time.

> Support Multiple versions of the timeline service at the same time
> --
>
> Key: YARN-4368
> URL: https://issues.apache.org/jira/browse/YARN-4368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>
> During rolling updgrade it will be helpfull to have the older version of the 
> timeline server to be also running so that the existing apps can submit to 
> the older version of ATS .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM scheduler UI rounds user limit factor

2015-11-20 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018709#comment-15018709
 ] 

Chang Li commented on YARN-4374:


Thanks [~jlowe] for further review! I am good with going with first patch

> RM scheduler UI rounds user limit factor
> 
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4334:
---
Attachment: YARN-4334.3.patch

> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.2.patch, YARN-4334.3.patch, YARN-4334.patch, 
> YARN-4334.wip.2.patch, YARN-4334.wip.3.patch, YARN-4334.wip.4.patch, 
> YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-20 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3840:
---
Attachment: YARN-3840.reopened.01.patch

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> YARN-3840.reopened.01.patch, yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4376) Memory Timeline Store return incorrect results on fromId paging

2015-11-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018856#comment-15018856
 ] 

Jonathan Eagles commented on YARN-4376:
---

According the 
[PriorityQueue|http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 java doc.

{code}
The Iterator provided in method iterator() is not guaranteed to traverse the 
elements of the priority queue in any particular order. If you need ordered 
traversal, consider using Arrays.sort(pq.toArray()).
{code}

> Memory Timeline Store return incorrect results on fromId paging
> ---
>
> Key: YARN-4376
> URL: https://issues.apache.org/jira/browse/YARN-4376
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>
> As pointed out correctly by [~jlowe]. 
> https://issues.apache.org/jira/browse/TEZ-2628?focusedCommentId=14715831=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14715831
> The MemoryTimelineStore cannot page correctly when using fromId. This is due 
> switching between data structures that apparently have different natural 
> sorting. In addition, the approach of creating a new data structure every 
> time from scratch is costly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018848#comment-15018848
 ] 

Jason Lowe commented on YARN-4132:
--

Thanks for updating the patch!  I think we're close.

* The property docs should mention the fallback if unspecified, otherwise it's 
confusing to have no specified value in yarn-default
* The new createRetryPolicy method simply can be private and therefore doesn't 
need the decorators indicating it is
* Nit: new createRetryPolicy overload should have Configuration as first 
parameter for consistency with existing method.


> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3454) RLESparseResourceAllocation does not handle removal of partial intervals (+ introducing support for efficient "merge" operations)

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018859#comment-15018859
 ] 

Hadoop QA commented on YARN-3454:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
39s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 54s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 57s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
28s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 141m 51s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
| JDK v1.7.0_85 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:date2015-11-20 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12773572/YARN-3454.5.patch |
| JIRA Issue | YARN-3454 |
| Optional Tests | 

[jira] [Commented] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019081#comment-15019081
 ] 

Hadoop QA commented on YARN-4334:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
36s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
41s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 16s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s 
{color} | {color:red} Patch generated 8 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 609, now 614). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 4 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 21s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 introduced 2 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 1m 50s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 50s {color} 
| {color:red} 

[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019090#comment-15019090
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #689 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/689/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* hadoop-yarn-project/CHANGES.txt


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3769:
-
Summary: Consider user limit when calculating total pending resource for 
preemption policy in Capacity Scheduler  (was: Preemption occurring 
unnecessarily because preemption doesn't consider user limit)

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019110#comment-15019110
 ] 

Wangda Tan commented on YARN-3769:
--

Thanks [~eepayne], committing..

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019120#comment-15019120
 ] 

Hadoop QA commented on YARN-4132:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
46s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 0s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 25s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
11s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 45s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 9s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 59s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 59s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 48s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 48s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 35s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 217, now 216). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
5s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 35s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 36s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 34s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 6s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 32s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 31s {color} 
| {color:red} hadoop-yarn-common in the patch failed 

[jira] [Commented] (YARN-3878) AsyncDispatcher can hang while stopping if it is configured for draining events on stop

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019117#comment-15019117
 ] 

Sangjin Lee commented on YARN-3878:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> AsyncDispatcher can hang while stopping if it is configured for draining 
> events on stop
> ---
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3878.01.patch, YARN-3878.02.patch, 
> YARN-3878.03.patch, YARN-3878.04.patch, YARN-3878.05.patch, 
> YARN-3878.06.patch, YARN-3878.07.patch, YARN-3878.08.patch, 
> YARN-3878.09.patch, YARN-3878.09_reprorace.pat_h
>
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's 
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On 
> {{serviceStop}}, we will check if all events have been drained and wait for 
> event queue to drain(as RM State Store dispatcher is configured for queue to 
> drain on stop). 
> # This condition never becomes true and AsyncDispatcher keeps on waiting 
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService 
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state 
> STOPPED
> 2015-06-27 20:08:35,923 WARN  [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x7fb980222800 nid=0x4b1e 
> waiting on condition [0x7fb9654e9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000700b79250> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x7fb98000a800 

[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019119#comment-15019119
 ] 

Sangjin Lee commented on YARN-3857:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019144#comment-15019144
 ] 

Hudson commented on YARN-3769:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #8834 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8834/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019152#comment-15019152
 ] 

Hudson commented on YARN-3769:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1427 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1427/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019151#comment-15019151
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1427 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1427/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019163#comment-15019163
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2630 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2630/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019201#comment-15019201
 ] 

Hudson commented on YARN-3769:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #702 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/702/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3849:
-
Fix Version/s: 2.7.3

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019210#comment-15019210
 ] 

Wangda Tan commented on YARN-3849:
--

Committed to branch-2.7 as well, thanks [~sunilg]!

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3849:
-
Target Version/s:   (was: 2.7.3)

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019058#comment-15019058
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8833 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8833/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* hadoop-yarn-project/CHANGES.txt


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4132:
---
Attachment: YARN-4132.7.patch

Thanks [~jlowe] for further review! have updated .7 patch accordingly

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.5.patch, YARN-4132.6.2.patch, YARN-4132.6.patch, YARN-4132.7.patch, 
> YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4376) Memory Timeline Store return incorrect results on fromId paging

2015-11-20 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-4376:
--
Attachment: YARN-4376.2.patch

Posting a patch that addresses the issues above. It's not easy to reproduce 
this error since the iteration order is implementation dependent. I went with 
adding the TreeSet as secondary view into the entities. As far as extra memory 
requirements needed, documentation states 40 bytes * capacity. For 1,000,000 
entities the extra memory requirements will be 8MB. The insert time is 
increased for 1,000,000 entries from 3.5 seconds to 11 seconds. Looking into 
whether this is significant on amortization. Will test with a huge entity set 
to validate the performance before and after.

> Memory Timeline Store return incorrect results on fromId paging
> ---
>
> Key: YARN-4376
> URL: https://issues.apache.org/jira/browse/YARN-4376
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-4376.2.patch
>
>
> As pointed out correctly by [~jlowe]. 
> https://issues.apache.org/jira/browse/TEZ-2628?focusedCommentId=14715831=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14715831
> The MemoryTimelineStore cannot page correctly when using fromId. This is due 
> switching between data structures that apparently have different natural 
> sorting. In addition, the approach of creating a new data structure every 
> time from scratch is costly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4374:
-
Component/s: capacityscheduler
Summary: RM capacity scheduler UI rounds user limit factor  (was: RM 
scheduler UI rounds user limit factor)

> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019097#comment-15019097
 ] 

Sangjin Lee commented on YARN-4209:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-4209.000.patch, YARN-4209.001.patch, 
> YARN-4209.002.patch, YARN-4209.branch-2.7.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019099#comment-15019099
 ] 

Sangjin Lee commented on YARN-4180:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> AMLauncher does not retry on failures when talking to NM 
> -
>
> Key: YARN-4180
> URL: https://issues.apache.org/jira/browse/YARN-4180
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-4180-branch-2.7.2.txt, YARN-4180.001.patch, 
> YARN-4180.002.patch, YARN-4180.002.patch, YARN-4180.002.patch
>
>
> We see issues with RM trying to launch a container while a NM is restarting 
> and we get exceptions like NMNotReadyException. While YARN-3842 added retry 
> for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing 
> there intermittent errors to cause job failures. This can manifest during 
> rolling restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3925) ContainerLogsUtils#getContainerLogFile fails to read container log files from full disks.

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019112#comment-15019112
 ] 

Sangjin Lee commented on YARN-3925:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> ContainerLogsUtils#getContainerLogFile fails to read container log files from 
> full disks.
> -
>
> Key: YARN-3925
> URL: https://issues.apache.org/jira/browse/YARN-3925
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3925.000.patch, YARN-3925.001.patch
>
>
> ContainerLogsUtils#getContainerLogFile fails to read files from full disks.
> {{getContainerLogFile}} depends on 
> {{LocalDirsHandlerService#getLogPathToRead}} to get the log file, but 
> {{LocalDirsHandlerService#getLogPathToRead}} calls 
> {{logDirsAllocator.getLocalPathToRead}} and {{logDirsAllocator}} uses 
> configuration {{YarnConfiguration.NM_LOG_DIRS}}, which will be updated to not 
> include full disks in {{LocalDirsHandlerService#checkDirs}}:
> {code}
> Configuration conf = getConfig();
> List localDirs = getLocalDirs();
> conf.setStrings(YarnConfiguration.NM_LOCAL_DIRS,
> localDirs.toArray(new String[localDirs.size()]));
> List logDirs = getLogDirs();
> conf.setStrings(YarnConfiguration.NM_LOG_DIRS,
>   logDirs.toArray(new String[logDirs.size()]));
> {code}
> ContainerLogsUtils#getContainerLogFile is used by NMWebServices#getLogs and 
> ContainerLogsPage.ContainersLogsBlock#render to read the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019114#comment-15019114
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #701 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/701/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* hadoop-yarn-project/CHANGES.txt


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019127#comment-15019127
 ] 

Sangjin Lee commented on YARN-3535:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019125#comment-15019125
 ] 

Sangjin Lee commented on YARN-3697:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
> 

[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019190#comment-15019190
 ] 

Hudson commented on YARN-3769:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #690 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/690/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019224#comment-15019224
 ] 

Hudson commented on YARN-3849:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8835 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8835/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019225#comment-15019225
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8835 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8835/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-11-20 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019226#comment-15019226
 ] 

Sunil G commented on YARN-3784:
---

As mentioned earlier, preemption timeout for a container can vary. (PCPP or 
decommission).  

Assume container1 is marked for preemption at time frame X.  After that 
container2 at time frame X+1. Similarly container3 at X+2. If AM heartbeat 
interval is 3sec, next heartbeat will come at X+3. Hence there is an elapsed 
time for each 3 containers which are to be preempted in the scheduler wait 
queue. Like container1 has 3sec, container2 has 2sec and container3 has 1sec. 
If this elapsed time can be subtracted from each containers proposed timeout,  
a more realistic and correct timeout will reach AM. 
This can help in taking a better decision whether to a graceful checkpoint 
preemption or do some immediate local copy of some o/p file etc. I ll raise 
these AM side improvements as needed after this ticket. Also this is a good 
metric in AM,  user can also get this information. If any containers crossed 
it's timeout while waiting for AM heartbeat like u mentioned,  I ll mark as - 1 
or 0 to indicate to AM that a possible action is already taken in RM mostly,  
and AM need not to do any graceful preemption on that.  This will also be an 
improvement in AM side which is already in plan. I ll be handling this in the 
MR side ticket. 

Does this make sense? 

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch, 
> 0003-YARN-3784.patch, 0004-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019235#comment-15019235
 ] 

Hudson commented on YARN-3849:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1428 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1428/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019236#comment-15019236
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1428 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1428/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019262#comment-15019262
 ] 

Hudson commented on YARN-3769:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2631 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2631/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019264#comment-15019264
 ] 

Hudson commented on YARN-3849:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #691 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/691/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019265#comment-15019265
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #691 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/691/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019269#comment-15019269
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #703 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/703/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019268#comment-15019268
 ] 

Hudson commented on YARN-3849:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #703 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/703/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020182#comment-15020182
 ] 

Hudson commented on YARN-3849:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2632 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2632/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020183#comment-15020183
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2632 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2632/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019292#comment-15019292
 ] 

Hadoop QA commented on YARN-3840:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 39s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
0s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 50s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
21s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 20s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 43s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 43s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 53s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 21s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 1m 50s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 53s 
{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the 
patch passed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 25s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 22s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 8s {color} | 
{color:red} hadoop-mapreduce-client-app in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 6s {color} | 
{color:red} hadoop-yarn-common in the patch failed with JDK 

[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-11-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020191#comment-15020191
 ] 

Naganarasimha G R commented on YARN-3784:
-

Thanks [~sunilg], for updating offline on this issue, basically i was missing 
the part that in FiCaSchedulerApp you are subtracting the elapsed time and 
hence you were able send the effective timeout to the AM during Hearbeat, but i 
can see following issues with current approach than having probable timestamp  
(current time + preemption timeout) during the creation of PreemptionContainer 
and share this to AM
* There can be a small delta between actual timeout value and the time when it 
can actually timeout 
* some additional loops during creation of response during heartbeat response 
(though not a thing of high performance impact but nevertheless can be avoided 
) 
* Avoid additional storage of {{containersWithFirstNotifyTime}} in 
FiCaSchedulerApp

But may current approach is more simpler for users to understand a numerical 
value than a timestamp!, thoughts from others ?

Also few issues with the patch :
* possible leak in {{containersWithFirstNotifyTime}} as remove is not being 
called?
* can there be a case where {{containersWithFirstNotifyTime}} be not filled in 
for a preempted container ? if not i feel additional if check {{if 
(containersWithFirstNotifyTime.containsKey(c))}} in the for loop is not 
required.


> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch, 
> 0003-YARN-3784.patch, 0004-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020238#comment-15020238
 ] 

Hudson commented on YARN-4374:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #623 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/623/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* hadoop-yarn-project/CHANGES.txt


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-11-20 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020210#comment-15020210
 ] 

Sunil G commented on YARN-3784:
---

Thanks [~Naganarasimha Garla] for the comments.

As discussed offline, this patch is proposing a timeout with each 
to-be-preempted container to AM. As mentioned above, this timeout is derived by 
calculating,
{noformat}
(Timeout associated with Each container) - (Total Timewait happened for this 
container in the queue for AM heartbeat)
{noformat}

For eg: if 15 sec is the proposed timeout for a container to get a hard kill 
from RM and 3 sec is the AM heartbeat interval, then in worst case 12sec will 
be timeout proposed to AM.

So few points here based on the comments above.
- 12 seconds to timeout is a meaningful unit to AM to understand that the 
specified container will be preempted.
- A Delta in terms of milli seconds may happen as the time taken to reach 
heartbeat response from RM to AM.  As I see this, this may not very big in 
terms of considerable time unit. I would like to get an opinion from community 
on this part.

If this DELTA is not much considerable amount of unit, then giving a 
{{timeout}} to AM is more meaningful and easily understandable. Else we may 
give a time unit like {{time from epoch time + timeout}}, which AM needs to 
re-translate by looking at current time in AM's end and take a difference (a 
LONG value will be passed in heartbeat). This is also point to note as the 
information passed is not very direct or understandable like "going to get 
timeout in X milli seconds".
As mentioned by Naga, this latter option will definitely help in avoiding few 
extra logic in RM end (calculation of timeout difference) but comes with a cost 
of added complexity/less clarity in the information passed. [~leftnoteasy] and 
[~djp], could you also pls share your views on this.

[~Naganarasimha Garla], I will see the other comments based on the feedback for 
the first points. Thank You!!

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch, 
> 0003-YARN-3784.patch, 0004-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4378) FairScheduler could not support DRF policy in lead queue when parent queue is fair policy

2015-11-20 Thread lachisis (JIRA)
lachisis created YARN-4378:
--

 Summary: FairScheduler could not support DRF policy in lead queue 
when parent queue is fair policy
 Key: YARN-4378
 URL: https://issues.apache.org/jira/browse/YARN-4378
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.2
Reporter: lachisis


If I configure fair-scheduler.xml as following, then the application submitted 
to queue root.queueA.queueA1 will keep on Accepted status. And the resource 
requirement of it's task will not be satisfied, because the queue 
root.queueA.queueA1 have zero cpu. 


  

  fair
  
drf
  



  
  
  
  






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020308#comment-15020308
 ] 

Hudson commented on YARN-4326:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2562 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2562/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020309#comment-15020309
 ] 

Hudson commented on YARN-3769:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2562 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2562/])
YARN-3769. Consider user limit when calculating total pending resource (wangda: 
rev 2346fa3141bf28f25a90b6a426a1d3a3982e464f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicyForNodePartitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769-branch-2.7.005.patch, YARN-3769-branch-2.7.006.patch, 
> YARN-3769-branch-2.7.007.patch, YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch, YARN-3769.003.patch, YARN-3769.004.patch, 
> YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020307#comment-15020307
 ] 

Hudson commented on YARN-3849:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2562 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2562/])
move YARN-4326/YARN-3849 from 2.8.0 to 2.7.3 (wangda: rev 
a30eccb38c83b20af5d0705f9834165b74468314)
* hadoop-yarn-project/CHANGES.txt


> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0, 2.7.3
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849-branch2-7.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4374) RM capacity scheduler UI rounds user limit factor

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020231#comment-15020231
 ] 

Hudson commented on YARN-4374:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2561 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2561/])
YARN-4374. RM capacity scheduler UI rounds user limit factor. (jlowe: rev 
060cdcbe5db56383e74e282ac61eedff8248c11b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* hadoop-yarn-project/CHANGES.txt


> RM capacity scheduler UI rounds user limit factor
> -
>
> Key: YARN-4374
> URL: https://issues.apache.org/jira/browse/YARN-4374
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.3
>
> Attachments: Screenshot1.png, YARN-4374.2.patch, YARN-4374.patch
>
>
> RM scheduler UI rounds user limit factor, such as from  0.25 up to 0.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)