[jira] [Commented] (YARN-1493) Schedulers don't recognize apps separately from app-attempts

2013-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856070#comment-13856070
 ] 

Hadoop QA commented on YARN-1493:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12620290/YARN-1493.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 19 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 9 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2719//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2719//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2719//console

This message is automatically generated.

> Schedulers don't recognize apps separately from app-attempts
> 
>
> Key: YARN-1493
> URL: https://issues.apache.org/jira/browse/YARN-1493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-1493.1.patch, YARN-1493.2.patch, YARN-1493.3.patch, 
> YARN-1493.4.patch, YARN-1493.5.patch, YARN-1493.6.patch
>
>
> Today, scheduler is tied to attempt only.
> We need to separate app-level handling logic in scheduler. We can add new 
> app-level events to the scheduler and separate the app-level logic out. This 
> is good for work-preserving AM restart, RM restart, and also needed for 
> differentiating app-level metrics and attempt-level metrics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1493) Schedulers don't recognize apps separately from app-attempts

2013-12-23 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1493:
--

Attachment: YARN-1493.6.patch

> Schedulers don't recognize apps separately from app-attempts
> 
>
> Key: YARN-1493
> URL: https://issues.apache.org/jira/browse/YARN-1493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-1493.1.patch, YARN-1493.2.patch, YARN-1493.3.patch, 
> YARN-1493.4.patch, YARN-1493.5.patch, YARN-1493.6.patch
>
>
> Today, scheduler is tied to attempt only.
> We need to separate app-level handling logic in scheduler. We can add new 
> app-level events to the scheduler and separate the app-level logic out. This 
> is good for work-preserving AM restart, RM restart, and also needed for 
> differentiating app-level metrics and attempt-level metrics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1529:


Issue Type: Improvement  (was: Bug)

> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1529.v01.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856042#comment-13856042
 ] 

Gera Shegalov commented on YARN-1529:
-

Hi [~hitesh] thanks for chiming in!

> Does the cache ratio account for the local resource visibility i.e. public 
> cache misses are more important than cache misses for application visibility?

The current patch does not differentiate between cache visibilities. I am open 
to suggestions whether a finer breakdown for cache misses can be helpful. The 
goal of this and a follow-up MAPREDUCE is to raise awareness at the aggregate 
leve that shipping computation to data is not free

>  I assume the "LocalizationDownloadNanos" is an average per container? How 
> does an average help when there are numerous application types with diff no. 
> of resources and each container facing a different cache hit ratio? Is this 
> something which needs to be augmented into the container status and not a 
> general NM metric? 

LocalizationDownloadNanos is a total sum of container launch delay due to 
localization. An average can be obtained as {code}LocalizationDownloadNanos / 
ContainersLaunched{code}.

> For that matter, what is the better option - trackinglocalization metrics on 
> the NM level or tracking them on a per container/per app level?

I am preparing a patch that exposes this information MR counters for MRv2. Is 
there a better way to achieve this in an application-agnostic manner such that 
it is visible in the webUI.

> Shouldn't there be a metric that tracks the actual size of the local resource 
> cache on disk?
This is a very good idea in my opinion.

> What about different resource types - file/archive/pattern?
Currently all resource types are lumped together. We can have a discussion 
whether it's helpful to expose a finer break down at the NM level or the 
app-level.




> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1529.v01.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-321) Generic application history service

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-321:
-

Assignee: (was: Vinod Kumar Vavilapalli)

Not the only one working on it, marking it Unassigned.

> Generic application history service
> ---
>
> Key: YARN-321
> URL: https://issues.apache.org/jira/browse/YARN-321
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Luke Lu
> Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
> Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java
>
>
> The mapreduce job history server currently needs to be deployed as a trusted 
> server in sync with the mapreduce runtime. Every new application would need a 
> similar application history server. Having to deploy O(T*V) (where T is 
> number of type of application, V is number of version of application) trusted 
> servers is clearly not scalable.
> Job history storage handling itself is pretty generic: move the logs and 
> history data into a particular directory for later serving. Job history data 
> is already stored as json (or binary avro). I propose that we create only one 
> trusted application history server, which can have a generic UI (display json 
> as a tree of strings) as well. Specific application/version can deploy 
> untrusted webapps (a la AMs) to query the application history server and 
> interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-321) Generic application history service

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856021#comment-13856021
 ] 

Vinod Kumar Vavilapalli commented on YARN-321:
--

Tx Zhijie for answering most of Sandy's questions, you are spot on. I'll update 
the design doc to clarify things where it isn't clear.


bq. What is the jira for app specific history data?
I just filed YARN-1530, will post more information soon.


bq. Could you describe the security requirements a bit further. Its not clear 
to everyone how everything works currently. To be clear, what exactly needs to 
be done to make apps write and read history data.
The data covered in this JIRA is generic and only RM gets to write it. The 
consumers of this data are *both* the cluster admins for historical analyses as 
well as individual apps that chose to not use features that come out of 
YARN-1530. As such, we cannot let apps write history data.


bq. How is the shared bus different from writing to a file. I would think one 
would cover the other.
Yes, writing to a file is one example of a shared-bus. I'll fix it if the doc 
is confusing w.r.t this.

> Generic application history service
> ---
>
> Key: YARN-321
> URL: https://issues.apache.org/jira/browse/YARN-321
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Luke Lu
>Assignee: Vinod Kumar Vavilapalli
> Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
> Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java
>
>
> The mapreduce job history server currently needs to be deployed as a trusted 
> server in sync with the mapreduce runtime. Every new application would need a 
> similar application history server. Having to deploy O(T*V) (where T is 
> number of type of application, V is number of version of application) trusted 
> servers is clearly not scalable.
> Job history storage handling itself is pretty generic: move the logs and 
> history data into a particular directory for later serving. Job history data 
> is already stored as json (or binary avro). I propose that we create only one 
> trusted application history server, which can have a generic UI (display json 
> as a tree of strings) as well. Specific application/version can deploy 
> untrusted webapps (a la AMs) to query the application history server and 
> interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856023#comment-13856023
 ] 

Hitesh Shah commented on YARN-1529:
---

[~jira.shegalov] Could you add more details on how users should interpret these 
new metrics? Does the cache ratio account for the local resource visibility 
i.e. public cache misses are more important than cache misses for application 
visibility? I assume the "LocalizationDownloadNanos" is an average per 
container? How does an average help when there are numerous application types 
with diff no. of resources and each container facing a different cache hit 
ratio? Is this something which needs to be augmented into the container status 
and not a general NM metric? For that matter, what is the better option - 
trackinglocalization metrics on the NM level or tracking them on a per 
container/per app level? 

Further thoughts:
 - Shouldn't there be a metric that tracks the actual size of the local 
resource cache on disk?
 - How are public/private/application caches being considered?
 - What about different resource types - file/archive/pattern? 





> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1529.v01.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856016#comment-13856016
 ] 

Vinod Kumar Vavilapalli commented on YARN-1530:
---

Working on a design doc that explains requirements and the solution space. I 
hope to push it out soon..

> [Umbrella] Store, manage and serve per-framework application-timeline data
> --
>
> Key: YARN-1530
> URL: https://issues.apache.org/jira/browse/YARN-1530
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>
> This is a sibling JIRA for YARN-321.
> Today, each application/framework has to do store, and serve per-framework 
> data all by itself as YARN doesn't have a common solution. This JIRA attempts 
> to solve the storage, management and serving of per-framework data from 
> various applications, both running and finished. The aim is to change YARN to 
> collect and store data in a generic manner with plugin points for frameworks 
> to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-1530:
-

 Summary: [Umbrella] Store, manage and serve per-framework 
application-timeline data
 Key: YARN-1530
 URL: https://issues.apache.org/jira/browse/YARN-1530
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


This is a sibling JIRA for YARN-321.

Today, each application/framework has to do store, and serve per-framework data 
all by itself as YARN doesn't have a common solution. This JIRA attempts to 
solve the storage, management and serving of per-framework data from various 
applications, both running and finished. The aim is to change YARN to collect 
and store data in a generic manner with plugin points for frameworks to do 
their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856014#comment-13856014
 ] 

Hadoop QA commented on YARN-1529:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12620280/YARN-1529.v01.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2718//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2718//console

This message is automatically generated.

> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1529.v01.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1521) Mark appropriate protocol methods with the idempotent annotation

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1521:
--

Summary: Mark appropriate protocol methods with the idempotent annotation  
(was: Mark appropriate methods of ApplicationClientProtocol, 
ResourceManagerAdminist, ApplicationMasterProtocol and ResourceTracker with the 
idempotent annotation)

> Mark appropriate protocol methods with the idempotent annotation
> 
>
> Key: YARN-1521
> URL: https://issues.apache.org/jira/browse/YARN-1521
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>
> After YARN-1028, we add the automatically failover into RMProxy. This JIRA is 
> to identify whether we need to add idempotent annotation and which methods 
> can be marked as idempotent.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1529:


Attachment: YARN-1529.v01.patch

{noformat}
$ curl -s 
http://somehost:8042/jmx?qry="Hadoop:service=NodeManager,name=NodeManagerMetrics";
 | python -mjson.tool 
{
"beans": [
{
"AllocatedContainers": 0,
"AllocatedGB": 0,
"AvailableGB": 8,
"ContainersCompleted": 1,
"ContainersFailed": 0,
"ContainersIniting": 0,
"ContainersKilled": 1,
"ContainersLaunched": 2,
"ContainersRunning": 0,
"LocalizationDownloadNanos": 1803959000,
"LocalizedBytesCached": 1529454,
"LocalizedBytesCachedRatio": 49,
"LocalizedBytesMissed": 1529546,
"LocalizedFilesCached": 2,
"LocalizedFilesCachedRatio": 33,
"LocalizedFilesMissed": 4,
"modelerType": "NodeManagerMetrics",
"name": "Hadoop:service=NodeManager,name=NodeManagerMetrics",
"tag.Context": "yarn",
"tag.Hostname": "somehost"
}
]
}
{noformat}

> Add Localization overhead metrics to NM
> ---
>
> Key: YARN-1529
> URL: https://issues.apache.org/jira/browse/YARN-1529
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: YARN-1529.v01.patch
>
>
> Users are often unaware of localization cost that their jobs incur. To 
> measure effectiveness of localization caches it is necessary to expose the 
> overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be 
> fetched from a central location, typically on HDFS, that results in a number 
> of download requests for the files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache 
> misses.
> LocalizedFilesCached: total localization requests that were served from local 
> caches. Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
> were served out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container 
> to go from ResourceRequestTransition to LocalizedTransition



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (YARN-1529) Add Localization overhead metrics to NM

2013-12-23 Thread Gera Shegalov (JIRA)
Gera Shegalov created YARN-1529:
---

 Summary: Add Localization overhead metrics to NM
 Key: YARN-1529
 URL: https://issues.apache.org/jira/browse/YARN-1529
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov


Users are often unaware of localization cost that their jobs incur. To measure 
effectiveness of localization caches it is necessary to expose the overhead in 
the form of metrics.

We propose addition of the following metrics to NodeManagerMetrics.

When a container is about to launch, its set of LocalResources has to be 
fetched from a central location, typically on HDFS, that results in a number of 
download requests for the files missing in caches.

LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache misses.

LocalizedFilesCached: total localization requests that were served from local 
caches. Cache hits.

LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.

LocalizedBytesCached: total bytes satisfied from local caches.

Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that 
were served out of cache: ratio = 100 * caches / (caches + misses)

LocalizationDownloadNanos: total elapsed time in nanoseconds for a container to 
go from ResourceRequestTransition to LocalizedTransition





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1527) yarn rmadmin command prints wrong usage info:

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1527:
--

Labels: newbie  (was: )

> yarn rmadmin command prints wrong usage info:
> -
>
> Key: YARN-1527
> URL: https://issues.apache.org/jira/browse/YARN-1527
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>  Labels: newbie
>
> The usage should be: yarn rmadmin, instead of java RMAdmin, and the 
> -refreshQueues should be in the second line.
> {code} Usage: java RMAdmin   -refreshQueues 
>-refreshNodes 
>-refreshSuperUserGroupsConfiguration 
>-refreshUserToGroupsMappings 
>-refreshAdminAcls 
>-refreshServiceAcl 
>-getGroups [username]
>-help [cmd]
>-transitionToActive 
>-transitionToStandby 
>-failover [--forcefence] [--forceactive]  
>-getServiceState 
>-checkHealth 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container

2013-12-23 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855958#comment-13855958
 ] 

Arun C Murthy commented on YARN-1488:
-

I was thinking along the lines of a protocol such as:

Resource Request -> Resource granted -> AM then does (Launch Container or 
Delegate Container) via NM which re-works cgroups if required.

We have explicitly avoided any container management directly between RM & NM in 
YARN i.e. launch container or re-size container. Of course, RM can still revoke 
containers during preemption. Having this done via AM is better for scale and 
simplicity (all of resource acquisition is limited to AM-NM protocol).


> Allow containers to delegate resources to another container
> ---
>
> Key: YARN-1488
> URL: https://issues.apache.org/jira/browse/YARN-1488
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
>
> We should allow containers to delegate resources to another container. This 
> would allow external frameworks to share not just YARN's resource-management 
> capabilities but also it's workload-management capabilities.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855948#comment-13855948
 ] 

Hadoop QA commented on YARN-1029:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12620257/yarn-1029-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 4 new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  org.apache.hadoop.yarn.server.TestContainerManagerSecurity
  org.apache.hadoop.yarn.server.TestRMNMSecretKeys

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2717//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2717//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2717//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2717//console

This message is automatically generated.

> Allow embedding leader election into the RM
> ---
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch, 
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-2.patch, 
> yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1029) Allow embedding leader election into the RM

2013-12-23 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1029:
---

Attachment: yarn-1029-2.patch

Patch that incorporates suggestions from [~bikassaha].

> Allow embedding leader election into the RM
> ---
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch, 
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-2.patch, 
> yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (YARN-1528) ZKRMStateStore and EmbeddedElector should allow setting ZK auth information

2013-12-23 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-1528:
--

 Summary: ZKRMStateStore and EmbeddedElector should allow setting 
ZK auth information
 Key: YARN-1528
 URL: https://issues.apache.org/jira/browse/YARN-1528
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Priority: Minor


ZK store and embedded election allow setting ZK-acls but not auth information



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855800#comment-13855800
 ] 

Karthik Kambatla commented on YARN-1029:


bq. There is a separate jira open to add a cluster-id
Here, we use cluster-id to make sure the RM to which the bread-crumb 
corresponds to is in the same cluster. In HDFS, they directly check for the 
other NN's id, which restricts us to a single standby. For RM HA, there is no 
reason to limit ourselves to two RMs, even though that is probably going to be 
default deployment. The actual token-related logic can be handled in the other 
JIRA. 

bq. this is probably not enough. we need to notify the rm.
Just to be sure, are you suggesting we add a new event and a handler in the RM 
for that event? 

I have addressed other comments, and looking at the test failure from the 
previous patch. Will incorporate any other comments and post a patch at the 
earliest.

> Allow embedding leader election into the RM
> ---
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch, 
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-321) Generic application history service

2013-12-23 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855707#comment-13855707
 ] 

Robert Joseph Evans commented on YARN-321:
--

The way it currently works is based off of group permissions on a directory 
(this is from memory from a while ago so I could be off on a few things).  In 
HDFS when you create a file the group of the file is the group of the directory 
the file is a part of, similar to the sticky bit on a directory in Linux.  When 
an MR job completes it will copy it's history log file, along with a few other 
files, to a drop box like location called intermediate done and atomically 
rename it from a temp name to the final name.  The directory is world writable, 
but only readable by a special group that the history server is a part of, but 
general users are not.  The history server then wakes up periodically and will 
scan that directory for new files, when it sees new files it will move them to 
a final location that is owned by the headless history server user.  If a query 
comes in for a job that the history server is not aware of, it will also scan 
the intermediate done directory before failing.

Reading history data is done through RPC to the history server, or through the 
web interface, including RESTful APIs.  There is no supported way for an app to 
read history data directly though the file system.  I hope this helps.

> Generic application history service
> ---
>
> Key: YARN-321
> URL: https://issues.apache.org/jira/browse/YARN-321
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Luke Lu
>Assignee: Vinod Kumar Vavilapalli
> Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
> Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java
>
>
> The mapreduce job history server currently needs to be deployed as a trusted 
> server in sync with the mapreduce runtime. Every new application would need a 
> similar application history server. Having to deploy O(T*V) (where T is 
> number of type of application, V is number of version of application) trusted 
> servers is clearly not scalable.
> Job history storage handling itself is pretty generic: move the logs and 
> history data into a particular directory for later serving. Job history data 
> is already stored as json (or binary avro). I propose that we create only one 
> trusted application history server, which can have a generic UI (display json 
> as a tree of strings) as well. Specific application/version can deploy 
> untrusted webapps (a la AMs) to query the application history server and 
> interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1506) Replace set resource change on RMNode/SchedulerNode directly with event notification.

2013-12-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855513#comment-13855513
 ] 

Junping Du commented on YARN-1506:
--

Hi [~bikassaha], thanks for comments. You are right that if we don't go through 
resource update to RMNode first, overcommitment timeout may not be useful to 
RMNode. However, I would prefer this timeout to be a per operation behavior 
rather than cluster configuration as user have flexibility to tell YARN how 
urgent he want to balloon out resources. Thoughts?

> Replace set resource change on RMNode/SchedulerNode directly with event 
> notification.
> -
>
> Key: YARN-1506
> URL: https://issues.apache.org/jira/browse/YARN-1506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, scheduler
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Blocker
>
> According to Vinod's comments on YARN-312 
> (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087),
>  we should replace RMNode.setResourceOption() with some resource change event.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)