[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566998#comment-14566998
 ] 

zhihai xu commented on YARN-3017:
-

Hi [~mufeed.usman], thanks for working on this issue.
The name for appAttemptIdAndEpochFormat will become confusing after the fix.
Could you rename appAttemptIdAndEpochFormat to epochFormat?

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2194:
---
Summary: Cgroups cease to work in RHEL7  (was: Add Cgroup support for 
RedHat 7)

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch
>
>
>In previous versions of RedHat, we can build custom cgroup hierarchies 
> with use of the cgconfig command from the libcgroup package. From RedHat 7, 
> package libcgroup is deprecated and it is not recommended to use it since it 
> can easily create conflicts with the default cgroup hierarchy. The systemd is 
> provided and recommended for cgroup management. We need to add support for 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2194:
---
  Component/s: nodemanager
  Description: 
In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
controller name leads to container launch failure. 

RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd 
has certain shortcomings as identified in this JIRA (see comments). 

This JIRA only fixes the failure, and doesn't try to use systemd.

  was:   In previous versions of RedHat, we can build custom cgroup hierarchies 
with use of the cgconfig command from the libcgroup package. From RedHat 7, 
package libcgroup is deprecated and it is not recommended to use it since it 
can easily create conflicts with the default cgroup hierarchy. The systemd is 
provided and recommended for cgroup management. We need to add support for this.

 Priority: Critical  (was: Major)
 Target Version/s: 2.8.0
Affects Version/s: 2.7.0
   Issue Type: Bug  (was: Improvement)

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567025#comment-14567025
 ] 

Karthik Kambatla commented on YARN-2194:


Verified the patch works. Can we add more comments to say clarify why the patch 
replaces cpu,cpuacct with cpu? May be something along the lines of - "In RHEL7, 
the CPU controller is named 'cpu,cpuacct'. The comma in the controller name 
leads to container launch failure. Symlinks 'cpu' and 'cpuacct' point to 
'cpu,cpuacct'. Using 'cpu' solves the issue."

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567024#comment-14567024
 ] 

Hadoop QA commented on YARN-2716:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 46s | The applied patch generated  3 
new checkstyle issues (total was 42, now 8). |
| {color:green}+1{color} | whitespace |   0m  5s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   1m 30s | The patch appears to introduce 2 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 23s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 44s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-resourcemanager |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736498/yarn-2716-1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 5cc3fce |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8147/console |


This message was automatically generated.

> Refactor ZKRMStateStore retry code with Apache Curator
> --
>
> Key: YARN-2716
> URL: https://issues.apache.org/jira/browse/YARN-2716
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Karthik Kambatla
> Attachments: yarn-2716-1.patch, yarn-2716-prelim.patch, 
> yarn-2716-prelim.patch, yarn-2716-super-prelim.patch
>
>
> Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
> simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567029#comment-14567029
 ] 

Karthik Kambatla commented on YARN-2194:


+1 otherwise. 

[~vinodkv], [~tucu00] - is this somewhat hacky approach reasonable? 

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-01 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567038#comment-14567038
 ] 

Akira AJISAKA commented on YARN-3069:
-

Thanks [~rchiang] for updating the patch.
# Would you reflect the previous comment for 
{{yarn.node-labels.fs-store.retry-policy-spec}}?
# For YARN registry, the parameters are written in core-site.xml. Can we remove 
them from the patch?

My review is almost done.
@Watchers: I would appreciate if you could review this patch. It includes a lot 
of descriptions for parameters, so it should be reviewed by a lot of developers.

> Document missing properties in yarn-default.xml
> ---
>
> Key: YARN-3069
> URL: https://issues.apache.org/jira/browse/YARN-3069
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: BB2015-05-TBR, supportability
> Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
> YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
> YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
> YARN-3069.009.patch, YARN-3069.010.patch
>
>
> The following properties are currently not defined in yarn-default.xml.  
> These properties should either be
>   A) documented in yarn-default.xml OR
>   B)  listed as an exception (with comments, e.g. for internal use) in the 
> TestYarnConfigurationFields unit test
> Any comments for any of the properties below are welcome.
>   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
>   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
>   security.applicationhistory.protocol.acl
>   yarn.app.container.log.backups
>   yarn.app.container.log.dir
>   yarn.app.container.log.filesize
>   yarn.client.app-submission.poll-interval
>   yarn.client.application-client-protocol.poll-timeout-ms
>   yarn.is.minicluster
>   yarn.log.server.url
>   yarn.minicluster.control-resource-monitoring
>   yarn.minicluster.fixed.ports
>   yarn.minicluster.use-rpc
>   yarn.node-labels.fs-store.retry-policy-spec
>   yarn.node-labels.fs-store.root-dir
>   yarn.node-labels.manager-class
>   yarn.nodemanager.container-executor.os.sched.priority.adjustment
>   yarn.nodemanager.container-monitor.process-tree.class
>   yarn.nodemanager.disk-health-checker.enable
>   yarn.nodemanager.docker-container-executor.image-name
>   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>   yarn.nodemanager.linux-container-executor.group
>   yarn.nodemanager.log.deletion-threads-count
>   yarn.nodemanager.user-home-dir
>   yarn.nodemanager.webapp.https.address
>   yarn.nodemanager.webapp.spnego-keytab-file
>   yarn.nodemanager.webapp.spnego-principal
>   yarn.nodemanager.windows-secure-container-executor.group
>   yarn.resourcemanager.configuration.file-system-based-store
>   yarn.resourcemanager.delegation-token-renewer.thread-count
>   yarn.resourcemanager.delegation.key.update-interval
>   yarn.resourcemanager.delegation.token.max-lifetime
>   yarn.resourcemanager.delegation.token.renew-interval
>   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
>   yarn.resourcemanager.metrics.runtime.buckets
>   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
>   yarn.resourcemanager.reservation-system.class
>   yarn.resourcemanager.reservation-system.enable
>   yarn.resourcemanager.reservation-system.plan.follower
>   yarn.resourcemanager.reservation-system.planfollower.time-step
>   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
>   yarn.resourcemanager.webapp.spnego-keytab-file
>   yarn.resourcemanager.webapp.spnego-principal
>   yarn.scheduler.include-port-in-node-name
>   yarn.timeline-service.delegation.key.update-interval
>   yarn.timeline-service.delegation.token.max-lifetime
>   yarn.timeline-service.delegation.token.renew-interval
>   yarn.timeline-service.generic-application-history.enabled
>   
> yarn.timeline-service.generic-application-history.fs-history-store.compression-type
>   yarn.timeline-service.generic-application-history.fs-history-store.uri
>   yarn.timeline-service.generic-application-history.store-class
>   yarn.timeline-service.http-cross-origin.enabled
>   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-06-01 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567041#comment-14567041
 ] 

Akira AJISAKA commented on YARN-3432:
-

Thanks [~brahmareddy] for taking this issue. The patch seems to revert 
YARN-656. I'm thinking that's not fine because it will break FairScheduler. 
This issue should fix CapacityScheduler only.

> Cluster metrics have wrong Total Memory when there is reserved memory on CS
> ---
>
> Key: YARN-3432
> URL: https://issues.apache.org/jira/browse/YARN-3432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3432.patch
>
>
> I noticed that when reservations happen when using the Capacity Scheduler, 
> the UI and web services report the wrong total memory.
> For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
> and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
> This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
> there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567054#comment-14567054
 ] 

Sunil G commented on YARN-3585:
---

Hi [~bibinchundatt] and [~rohithsharma]
This recent exception trace is different from the focus of  this Jira, and the 
root cause is given by Rohith. I feel you can separate this to another ticket.

For DB Close vs Container Launch, we can add a check whether DB is closed while 
we move container from ACQUIRED state.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567150#comment-14567150
 ] 

Hudson commented on YARN-3725:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-06-01 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567153#comment-14567153
 ] 

Brahma Reddy Battula commented on YARN-3528:


Thanks [~ste...@apache.org] and [~rkanter] for your inputs...Going to write one 
common utility where   
1) one method for "0" port,we can set back the port for same config
2)another method,As some places above one is not possible, we can use similiar 
way as [~rkanter] mentioned..

> Tests with 12345 as hard-coded port break jenkins
> -
>
> Key: YARN-3528
> URL: https://issues.apache.org/jira/browse/YARN-3528
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
>Priority: Blocker
>  Labels: test
>
> A lot of the YARN tests have hard-coded the port 12345 for their services to 
> come up on.
> This makes it impossible to have scheduled or precommit tests to run 
> consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
> appear to get ignored completely.
> A quick grep of "12345" shows up many places in the test suite where this 
> practise has developed.
> * All {{BaseContainerManagerTest}} subclasses
> * {{TestNodeManagerShutdown}}
> * {{TestContainerManager}}
> + others
> This needs to be addressed through portscanning and dynamic port allocation. 
> Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567151#comment-14567151
 ] 

Hudson commented on YARN-2900:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #215 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/215/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567158#comment-14567158
 ] 

Hudson commented on YARN-2900:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/945/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567157#comment-14567157
 ] 

Hudson commented on YARN-3725:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #945 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/945/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567184#comment-14567184
 ] 

Rohith commented on YARN-3733:
--

Thanks [~devaraj.k] and [~sunilg] for review

bq. Can we check for lhs/rhs emptiness and compare these before ending up with 
infinite values? 
If we calculater for emptyness, this would affect specific input values like 
clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Then which one is considered as 
dominant? bcs directly dominant component can not be retrieved by memory or cpu.

And I listed out what are the possible combination of inputs would ocure in 
YARN. These are
||Sl.no||clusterResorce||lhs||rhs||Remark||
|1|<0,0>|<0,0>|<0,0>|Valid Input;Handled|
|2|<0,0>||<0,0>|NaN vs Infinity: Patch 
Handle This scenario|
|3|<0,0>|<0,0>||Nan vs Infinity: Patch 
Handle This scenario|
|4|<0,0>|||Infinity vs Infinity: Can this type can ocur in YARN?|
|5|<0,0>||<0,positive integer>|Is this valid input? Can 
this type can ocur in YARN?|


>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567186#comment-14567186
 ] 

Rohith commented on YARN-3733:
--

bq. 2. The newly added code is duplicated in two places, can you eliminate the 
duplicate code?
sencond time validation is not required ICO NaN,will remove this in next patch.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567189#comment-14567189
 ] 

Rohith commented on YARN-3733:
--

bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 
Since infinity is derived from lhs and rhs, infinity can not be differentiated 
for the clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Method 
{{getResourceAsValue()}} return infinity for both l and r which cant compare it.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567196#comment-14567196
 ] 

Rohith commented on YARN-3585:
--

Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can 
validate the issue there?

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567201#comment-14567201
 ] 

Rohith commented on YARN-3585:
--

-1 for findbug, does not show any error report, but not sure why -1 given.
Test failure is unrelated to this patch.

[~jlowe] Kindly review the patch. 

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.2.patch

Upload a new patch to fix test cases.
Lots of previous tests "The HA Configuration has multiple addresses that match 
local node's address." is because I forgot to set YarnConfiguration.RM_HA_ID 
before starting NM.

The patch also contains 2 minor fix, changed getting conf value of 
RM_SCHEDULER_ADDRESS from serviceStart to serviceInit in 
ApplicationMasterService, changed duplicates setRpcAddressForRM in tests to 
HAUtil.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567291#comment-14567291
 ] 

Steve Loughran commented on YARN-1042:
--

I like this, though I'd also like PREFERRED to have two Rs in the middle :).

Thinking about how I'd use this in slider, I'd probably want to keep the 
escalation logic, when to decide when to accept shared-note placement, in my 
own code. That way the AM can choose to wait 1 minute or more for an 
anti-affine placement before giving up and accepting a node already in use. We 
already do that when asking for a container back on the host where an instance 
ran previously. 

> add ability to specify affinity/anti-affinity in container requests
> ---
>
> Key: YARN-1042
> URL: https://issues.apache.org/jira/browse/YARN-1042
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Arun C Murthy
> Attachments: YARN-1042-demo.patch
>
>
> container requests to the AM should be able to request anti-affinity to 
> ensure that things like Region Servers don't come up on the same failure 
> zones. 
> Similarly, you may be able to want to specify affinity to same host or rack 
> without specifying which specific host/rack. Example: bringing up a small 
> giraph cluster in a large YARN cluster would benefit from having the 
> processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted

2015-06-01 Thread David Moore (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Moore updated YARN-3747:
--
Attachment: YARN-3747.patch

Please review this patch - 
Thank you


Copyright 2015 David Moore

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

> TestLocalDirsHandlerService.java: test directory logDir2 not deleted
> 
>
> Key: YARN-3747
> URL: https://issues.apache.org/jira/browse/YARN-3747
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.7.0
>Reporter: David Moore
>Priority: Minor
> Fix For: 2.7.0
>
> Attachments: YARN-3747.patch
>
>
> During a code review of 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
>  I noted that logDir2 is never deleted while logDir1 is deleted twice. This 
> is not in keeping with the rest of the function and appears to be a bug. 
> I will be submitting a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567323#comment-14567323
 ] 

Anubhav Dhoot commented on YARN-3751:
-

There was no failure for this class in the jenkins run for YARN-3467.
The change LGTM.

> TestAHSWebServices fails after YARN-3467
> 
>
> Key: YARN-3751
> URL: https://issues.apache.org/jira/browse/YARN-3751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Sunil G
> Attachments: 0001-YARN-3751.patch
>
>
> YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
> not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: YARN-3170-009.patch

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI

2015-06-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567327#comment-14567327
 ] 

Anubhav Dhoot commented on YARN-3467:
-

Did you mean its very verbose?

> Expose allocatedMB, allocatedVCores, and runningContainers metrics on running 
> Applications in RM Web UI
> ---
>
> Key: YARN-3467
> URL: https://issues.apache.org/jira/browse/YARN-3467
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp, yarn
>Affects Versions: 2.5.0
>Reporter: Anthony Rojas
>Assignee: Anubhav Dhoot
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: ApplicationAttemptPage.png, Screen Shot 2015-05-26 at 
> 5.46.54 PM.png, YARN-3467.001.patch, YARN-3467.002.patch, yarn-3467-1.patch
>
>
> The YARN REST API can report on the following properties:
> *allocatedMB*: The sum of memory in MB allocated to the application's running 
> containers
> *allocatedVCores*: The sum of virtual cores allocated to the application's 
> running containers
> *runningContainers*: The number of containers currently running for the 
> application
> Currently, the RM Web UI does not report on these items (at least I couldn't 
> find any entries within the Web UI).
> It would be useful for YARN Application and Resource troubleshooting to have 
> these properties and their corresponding values exposed on the RM WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-06-01 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567340#comment-14567340
 ] 

Chang Li commented on YARN-2556:


[~jeagles] could you please help review the latest patch? Thanks!

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.2.patch, 
> YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, YARN-2556.6.patch, 
> YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, YARN-2556.patch, 
> yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567341#comment-14567341
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 54s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 55s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 13s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736559/YARN-3170-009.patch |
| Optional Tests | site |
| git revision | trunk / 63e3fee |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8150/console |


This message was automatically generated.

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567343#comment-14567343
 ] 

Sean Busbey commented on YARN-3748:
---

+1 lgtm, presuming that failed test passes locally.

nit: I think the numContainers in AbstractCSQueue can be made private now maybe?

> Cleanup Findbugs volatile warnings
> --
>
> Key: YARN-3748
> URL: https://issues.apache.org/jira/browse/YARN-3748
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gabor Liptak
>Priority: Minor
> Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: (was: YARN-3170-009.patch)

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: YARN-3170-009.patch

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567354#comment-14567354
 ] 

Brahma Reddy Battula commented on YARN-3170:


[~aw] thanks a lot for your comments.. Updated the patch based on your 
comments..Kindly review..

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3747) TestLocalDirsHandlerService.java: test directory logDir2 not deleted

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567355#comment-14567355
 ] 

Hadoop QA commented on YARN-3747:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   6m 34s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 52s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 29s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 12s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   6m 29s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  25m  7s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.TestDockerContainerExecutor |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736555/YARN-3747.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8149/console |


This message was automatically generated.

> TestLocalDirsHandlerService.java: test directory logDir2 not deleted
> 
>
> Key: YARN-3747
> URL: https://issues.apache.org/jira/browse/YARN-3747
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.7.0
>Reporter: David Moore
>Priority: Minor
>  Labels: patch, test, yarn
> Attachments: YARN-3747.patch
>
>
> During a code review of 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
>  I noted that logDir2 is never deleted while logDir1 is deleted twice. This 
> is not in keeping with the rest of the function and appears to be a bug. 
> I will be submitting a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567358#comment-14567358
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567357#comment-14567357
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2143 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2143/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567366#comment-14567366
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   3m  1s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 55s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 20s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736570/YARN-3170-009.patch |
| Optional Tests | site |
| git revision | trunk / 63e3fee |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8151/console |


This message was automatically generated.

> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567369#comment-14567369
 ] 

Hadoop QA commented on YARN-3749:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 20s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 35s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |  49m 27s | Tests failed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |  50m 14s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 52s | Tests passed in 
hadoop-yarn-server-tests. |
| | | 147m 20s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestNMClient |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736544/YARN-3749.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 5cc3fce |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8148/console |


This message was automatically generated.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567380#comment-14567380
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567379#comment-14567379
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #204 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/204/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567417#comment-14567417
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* hadoop-yarn-project/CHANGES.txt


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567416#comment-14567416
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #213 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/213/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567447#comment-14567447
 ] 

Hudson commented on YARN-3725:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/])
YARN-3725. App submission via REST API is broken in secure mode due to Timeline 
DT service address is empty. (Zhijie Shen via wangda) (wangda: rev 
5cc3fced957a8471733e0e9490878bd68429fe24)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


> App submission via REST API is broken in secure mode due to Timeline DT 
> service address is empty
> 
>
> Key: YARN-3725
> URL: https://issues.apache.org/jira/browse/YARN-3725
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: YARN-3725.1.patch
>
>
> YARN-2971 changes TimelineClient to use the service address from Timeline DT 
> to renew the DT instead of configured address. This break the procedure of 
> submitting an YARN app via REST API in the secure mode.
> The problem is that service address is set by the client instead of the 
> server in Java code. REST API response is an encode token Sting, such that 
> it's so inconvenient to deserialize it and set the service address and 
> serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2900) Application (Attempt and Container) Not Found in AHS results in Internal Server Error (500)

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567448#comment-14567448
 ] 

Hudson commented on YARN-2900:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2161 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2161/])
YARN-2900. Application (Attempt and Container) Not Found in AHS results (xgong: 
rev 9686261ecb872ad159fac3ca44f1792143c6d7db)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebServices.java
* hadoop-yarn-project/CHANGES.txt


> Application (Attempt and Container) Not Found in AHS results in Internal 
> Server Error (500)
> ---
>
> Key: YARN-2900
> URL: https://issues.apache.org/jira/browse/YARN-2900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Fix For: 2.7.1
>
> Attachments: YARN-2900-b2-2.patch, YARN-2900-b2.patch, 
> YARN-2900-branch-2.7.20150530.patch, YARN-2900.20150529.patch, 
> YARN-2900.20150530.patch, YARN-2900.20150530.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, YARN-2900.patch, 
> YARN-2900.patch, YARN-2900.patch, YARN-2900.patch
>
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToApplicationReport(ApplicationHistoryManagerImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:222)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices$2.run(WebServices.java:219)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApp(WebServices.java:218)
>   ... 59 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated YARN-3748:
---
Attachment: YARN-3748.4.patch

> Cleanup Findbugs volatile warnings
> --
>
> Key: YARN-3748
> URL: https://issues.apache.org/jira/browse/YARN-3748
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gabor Liptak
>Priority: Minor
> Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, 
> YARN-3748.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567506#comment-14567506
 ] 

Hadoop QA commented on YARN-3748:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 58s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:red}-1{color} | javac |   3m 19s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736589/YARN-3748.4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8152/console |


This message was automatically generated.

> Cleanup Findbugs volatile warnings
> --
>
> Key: YARN-3748
> URL: https://issues.apache.org/jira/browse/YARN-3748
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gabor Liptak
>Priority: Minor
> Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch, 
> YARN-3748.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2194:
--
Attachment: YARN-2194-3.patch

Thanks, [~kasha]. Updated a patch adding more comments.

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3686) CapacityScheduler should trim default_node_label_expression

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567542#comment-14567542
 ] 

Sunil G commented on YARN-3686:
---

Thank You [~leftnoteasy] for reviewing and committing the same!

> CapacityScheduler should trim default_node_label_expression
> ---
>
> Key: YARN-3686
> URL: https://issues.apache.org/jira/browse/YARN-3686
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: 0001-YARN-3686.patch, 0002-YARN-3686.patch, 
> 0003-YARN-3686.patch, 0004-YARN-3686.patch
>
>
> We should trim default_node_label_expression for queue before using it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3647) RMWebServices api's should use updated api from CommonNodeLabelsManager to get NodeLabel object

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567546#comment-14567546
 ] 

Sunil G commented on YARN-3647:
---

Thank You [~leftnoteasy] for committing the patch.

> RMWebServices api's should use updated api from CommonNodeLabelsManager to 
> get NodeLabel object
> ---
>
> Key: YARN-3647
> URL: https://issues.apache.org/jira/browse/YARN-3647
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sunil G
>Assignee: Sunil G
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3647.patch, 0002-YARN-3647.patch
>
>
> After YARN-3579, RMWebServices apis can use the updated version of apis in 
> CommonNodeLabelsManager which gives full NodeLabel object instead of creating 
> NodeLabel object from plain label name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev reassigned YARN-3542:
---

Assignee: Varun Vasudev

> Re-factor support for CPU as a resource using the new ResourceHandler 
> mechanism
> ---
>
> Key: YARN-3542
> URL: https://issues.apache.org/jira/browse/YARN-3542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Critical
>
> In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
> addition of new resource types in the nodemanager (this was used for network 
> as a resource - See YARN-2140 ). We should refactor the existing CPU 
> implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
> the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567573#comment-14567573
 ] 

Junping Du commented on YARN-3699:
--

Hi [~jrottinghuis] and [~vrushalic], thanks for your comments and sorry for 
replying late on this as traveling last week. 
I fully agree with Joep's above comments that there is no right or wrong schema 
but just fit-in one for priority scenarios:
- if we need more for flow_run under specific flow/flows, then making flow 
version as column will make this query more efficient.
- if we equally (or more) need for flow_run under specific flow version(s), 
then our decision here could be different.
To me, the tricky/interesting part here is the boundary between different flows 
and flow versions could vague in practice: How big/small changes we made on a 
flow should start a new flow or new flow version? Why we have more active flow 
versions instead of having only one active flow version (with adding more 
flows). These trade-offs in application concepts also affect our trade-off in 
schema design which is pretty common thing that I saw also from other apps.
I would like to trust your priority here given your experience from hRaven 
which is already in production running well for years. So I agree Phoenix 
schema should be adjusted slightly to get closed to HBase one. 
May be we should have a new JIRA for this (Phoenix schema) change? We can 
either keep this JIRA open for discussion or resolve it as later so in future, 
if others from community bring other solid scenarios in practice, we can 
continue the discussion here and try to make better trade-off or innovation. 
Thoughts?

> Decide if  flow version should be part of row key or column
> ---
>
> Key: YARN-3699
> URL: https://issues.apache.org/jira/browse/YARN-3699
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>
> Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
> discussion on putting the flow version in rowkey or column. 
> Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567609#comment-14567609
 ] 

Vinod Kumar Vavilapalli commented on YARN-1462:
---

bq. Committed to trunk/branch-2/branch-2.7.
[~zjshen]/[~xgong], why are we putting this in 2.7? Looks more of an 
enhancement to me. Unless there is a strong requirement, we should revert it 
from branch-2.7.

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.7.1
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567621#comment-14567621
 ] 

Sunil G commented on YARN-2872:
---

Hi [~kasha]
I would like to take up this for CS. I will work on a patch for same.
Kindly reassign if otherwise.

> CapacityScheduler: Add disk I/O resource to DRF
> ---
>
> Key: YARN-2872
> URL: https://issues.apache.org/jira/browse/YARN-2872
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2872) CapacityScheduler: Add disk I/O resource to DRF

2015-06-01 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-2872:
-

Assignee: Sunil G

> CapacityScheduler: Add disk I/O resource to DRF
> ---
>
> Key: YARN-2872
> URL: https://issues.apache.org/jira/browse/YARN-2872
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Sunil G
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3492) AM fails to come up because RM and NM can't connect to each other

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3492.
---
Resolution: Cannot Reproduce

Closing this based on previous comments. Please reopen this in case you run 
into it again.

> AM fails to come up because RM and NM can't connect to each other
> -
>
> Key: YARN-3492
> URL: https://issues.apache.org/jira/browse/YARN-3492
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
> Environment: pseudo-distributed cluster on a mac
>Reporter: Karthik Kambatla
>Priority: Blocker
> Attachments: mapred-site.xml, 
> yarn-kasha-nodemanager-kasha-mbp.local.log, 
> yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml
>
>
> Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The 
> container gets allocated, but doesn't get launched. The NM can't talk to the 
> RM. Logs to follow. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567672#comment-14567672
 ] 

Hadoop QA commented on YARN-3751:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 56s | Pre-patch trunk has 3 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 54s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 55s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 35s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  1s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-server-common. |
| | |  38m 16s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736418/0001-YARN-3751.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 63e3fee |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8153/console |


This message was automatically generated.

> TestAHSWebServices fails after YARN-3467
> 
>
> Key: YARN-3751
> URL: https://issues.apache.org/jira/browse/YARN-3751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Sunil G
> Attachments: 0001-YARN-3751.patch
>
>
> YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
> not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3751) TestAHSWebServices fails after YARN-3467

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567682#comment-14567682
 ] 

Sunil G commented on YARN-3751:
---

Findbugs warning can be skipped as exception handling is done in another method 
in WebServices.java and giving here as false positive. 

Existing test case in TestAHSWebServices covers this scenario.

> TestAHSWebServices fails after YARN-3467
> 
>
> Key: YARN-3751
> URL: https://issues.apache.org/jira/browse/YARN-3751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Sunil G
> Attachments: 0001-YARN-3751.patch
>
>
> YARN-3467 changed AppInfo and assumed that used resource is not null. It's 
> not true as this information is not published to timeline server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)
Masatake Iwasaki created YARN-3752:
--

 Summary: TestRMFailover fails due to intermittent 
UnknownHostException
 Key: YARN-3752
 URL: https://issues.apache.org/jira/browse/YARN-3752
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor


Client fails to create connection due to UnknownHostException while client 
retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567744#comment-14567744
 ] 

Sangjin Lee commented on YARN-3699:
---

Thanks [~djp] for your comments!

Once everyone's comfortable with the decision of not making the flow version 
part of the row key, then we could resolve this JIRA by recording that decision 
(+1 or -1). Then we could open a separate JIRA for the phoenix writer to 
relocate the flow version (remove it from the PK). But a bigger question there 
is, if we're going with the native HBase schema, what is the status of the 
phoenix writer implementation?

For me, (if it wasn't obvious in previous comments), I'm +1 with the flow 
version *not* being in the row key.

> Decide if  flow version should be part of row key or column
> ---
>
> Key: YARN-3699
> URL: https://issues.apache.org/jira/browse/YARN-3699
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>
> Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
> discussion on putting the flow version in rowkey or column. 
> Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3699) Decide if flow version should be part of row key or column

2015-06-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567756#comment-14567756
 ] 

Li Lu commented on YARN-3699:
-

I'm +1 on removing the flow version section from the row key. I can make the 
change to our Phoenix writer. However, I agree with [~sjlee0] that we're not 
sure about the next step plan on the Phoenix writer. I'm OK with leaving it as 
an aggregation-only writer for now. 

> Decide if  flow version should be part of row key or column
> ---
>
> Key: YARN-3699
> URL: https://issues.apache.org/jira/browse/YARN-3699
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>
> Based on discussions in YARN-3411 with [~djp], filing jira for continuing 
> discussion on putting the flow version in rowkey or column. 
> Either phoenix/hbase approach will update the jira with the conclusions..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567762#comment-14567762
 ] 

Masatake Iwasaki commented on YARN-3752:


My /etc/hosts works for pseudo distributed cluster and unit tests for 
NameNode-HA in HDFS.

Hostnames are successfully resolved at first in TestRMFailover too at first.
{noformat}
java.net.ConnectException: Call From centos7/127.0.0.1 to 0.0.0.0:28031 failed 
on connection exception: java.net.ConnectException: Connection refused; For 
more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
...
java.io.EOFException: End of File Exception between local host is: 
"centos7/127.0.0.1"; destination host is: "0.0.0.0":18031; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException
...
{noformat}

Client fails to create connection due to UnknownHostException while client 
retries to connect to next RM after failover.
{noformat}
java.io.IOException: java.util.concurrent.ExecutionException: 
java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
destination host is: "centos7":28032; java.net.UnknownHostException; For more 
details see:  http://wiki.apache.org/hadoop/UnknownHost
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1487)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1371)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getApplications(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy16.getApplications(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:484)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:461)
at 
org.apache.hadoop.yarn.client.TestRMFailover.verifyClientConnection(TestRMFailover.java:119)
at 
org.apache.hadoop.yarn.client.TestRMFailover.verifyConnections(TestRMFailover.java:133)
at 
org.apache.hadoop.yarn.client.TestRMFailover.testExplicitFailover(TestRMFailover.java:168
.

Caused by: java.net.UnknownHostException: Invalid host name: local host is: 
(unknown); destination host is: "centos7":28032; java.net.UnknownHostException; 
For more details see:  http://wiki.apache.org/hadoop/UnknownHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:744)
at org.apache.hadoop.ipc.Client$Connection.(Client.java:408)
at org.apache.hadoop.ipc.Client$1.call(Client.java:1483)
at org.apache.hadoop.ipc.Client$1.call(Client.java:1480)
at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
... 49 more
{noformat}

It may be timing/environment issue because the test seems to succeed in QA 
tests.


> TestRMFailover fails due to intermittent UnknownHostException
> -
>
> Key: YARN-3752
> URL: https://issues.apache.org/jira/browse/YARN-3752
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
>
> Client fails to create connection due to UnknownHostException while client 
> retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1462:

Fix Version/s: (was: 2.7.1)
   2.8.0

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567758#comment-14567758
 ] 

Xuan Gong commented on YARN-1462:
-

Okay. Reverted it from branch-2.7. And changed the fix version from 
branch-2.7.1 to branch-2.8.

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567763#comment-14567763
 ] 

Masatake Iwasaki commented on YARN-3752:


While I am doing trial-and-error, using 127.0.0.1 instead of 0.0.0.0 for server 
addresses fixed this.

> TestRMFailover fails due to intermittent UnknownHostException
> -
>
> Key: YARN-3752
> URL: https://issues.apache.org/jira/browse/YARN-3752
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
>
> Client fails to create connection due to UnknownHostException while client 
> retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3752) TestRMFailover fails due to intermittent UnknownHostException

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567775#comment-14567775
 ] 

Masatake Iwasaki commented on YARN-3752:


I don't know the exact reason yet but using independent Configuration instances 
for each RM in MiniYARNCluster as YARN-3749 do also worked.

> TestRMFailover fails due to intermittent UnknownHostException
> -
>
> Key: YARN-3752
> URL: https://issues.apache.org/jira/browse/YARN-3752
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
>
> Client fails to create connection due to UnknownHostException while client 
> retries to connect to next RM after failover in unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567778#comment-14567778
 ] 

Hudson commented on YARN-1462:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7940 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7940/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567851#comment-14567851
 ] 

Jason Lowe commented on YARN-3585:
--

Thanks for the patch, Rohith!

I think it would be safer/simpler to assume we shouldn't be calling Exit unless 
NodeManager.main() was invoked (i.e.: we're likely running in a JVM whose sole 
purpose is to be the nodemanager).  In that sense I'm wondering if we should 
flip the logic to not exit but then have NodeManager.main override that.  This 
probably precludes the need to update existing tests.

We should be using ExitUtil instead of System.exit directly.

Nit: "setexitOnShutdownEvent" s/b "setExitOnShutdownEvent"


> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567969#comment-14567969
 ] 

Vinod Kumar Vavilapalli commented on YARN-2194:
---

Thinking out aloud, should we do OS specific checks for this?

Also, does the newer CGroupsHandlerImpl also need to change? /cc [~vvasudev].

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568005#comment-14568005
 ] 

Matthew Jacobs commented on YARN-2194:
--

While this may work for the default RHEL7 configuration, this will break if 
someone happens to have mounted the same controllers to 
"/sys/fs/cgroup/cpuacct,cpu", or if the user mounted other controllers at the 
same path as well. What do you think about creating the symlink from 
"/sys/fs/cgroup/cpu" to the mounted path for cpu in all cases (unless it was 
actually mounted at /sys/fs/cgroup/cpu of course).

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568063#comment-14568063
 ] 

Sidharta Seethana commented on YARN-2194:
-

Isn't it better to use a different separator that is less likely to be in use ( 
e.g ':' or '|' instead of ',' ) when invoking container-executor ? Granted that 
this is a (slightly) bigger change, but it seems like the right thing to do. 

> Cgroups cease to work in RHEL7
> --
>
> Key: YARN-2194
> URL: https://issues.apache.org/jira/browse/YARN-2194
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch
>
>
> In RHEL7, the CPU controller is named "cpu,cpuacct". The comma in the 
> controller name leads to container launch failure. 
> RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
> systemd has certain shortcomings as identified in this JIRA (see comments). 
> This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3542:
--
Target Version/s: 2.8.0

Let's do this in the 2.8 timeline before the two implementations diverge more.

> Re-factor support for CPU as a resource using the new ResourceHandler 
> mechanism
> ---
>
> Key: YARN-3542
> URL: https://issues.apache.org/jira/browse/YARN-3542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Critical
>
> In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
> addition of new resource types in the nodemanager (this was used for network 
> as a resource - See YARN-2140 ). We should refactor the existing CPU 
> implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
> the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568083#comment-14568083
 ] 

Sidharta Seethana commented on YARN-3542:
-

+1 to this

> Re-factor support for CPU as a resource using the new ResourceHandler 
> mechanism
> ---
>
> Key: YARN-3542
> URL: https://issues.apache.org/jira/browse/YARN-3542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Critical
>
> In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
> addition of new resource types in the nodemanager (this was used for network 
> as a resource - See YARN-2140 ). We should refactor the existing CPU 
> implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
> the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3542) Re-factor support for CPU as a resource using the new ResourceHandler mechanism

2015-06-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568084#comment-14568084
 ] 

Sidharta Seethana commented on YARN-3542:
-

+1 to this

> Re-factor support for CPU as a resource using the new ResourceHandler 
> mechanism
> ---
>
> Key: YARN-3542
> URL: https://issues.apache.org/jira/browse/YARN-3542
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Sidharta Seethana
>Assignee: Varun Vasudev
>Priority: Critical
>
> In YARN-3443 , a new ResourceHandler mechanism was added which enabled easier 
> addition of new resource types in the nodemanager (this was used for network 
> as a resource - See YARN-2140 ). We should refactor the existing CPU 
> implementation ( LinuxContainerExecutor/CgroupsLCEResourcesHandler ) using 
> the new ResourceHandler mechanism. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Sumana Sathish (JIRA)
Sumana Sathish created YARN-3753:


 Summary: RM failed to come up with "java.io.IOException: Wait for 
ZKClient creation timed out"
 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Priority: Critical


RM failed to come up with the following error while submitting an mapreduce job.
{code:title=RM log}
015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(179)) - Error storing app: 
application_1432956515242_0006
java.io.IOException: Wait for ZKClient creation timed out
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:handle(750)) - Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
java.io.IOException: Wait for ZKClient creation timed out
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMSt

[jira] [Assigned] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-3753:
-

Assignee: Jian He

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yar

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-01 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568128#comment-14568128
 ] 

Li Lu commented on YARN-3051:
-

Hi [~varun_saxena], thanks for the work! Not sure if you've already made 
progress since the latest patch, but I'm posting some of my comments and 
questions w.r.t the reader API design in the 003 patch. I may have more 
comments in the near future, but I won't mind to see a new patch before posting 
them. 

# I noticed there is a _readerLimit_ for read operations, which works for ATS 
v1. I'm wondering if it's fine to use -1 to indicate there's no such limit? Not 
sure if this feature is already there. 
# The {{fromId}} parameter, we may need to be careful on the concept of "id". 
In timeline v2 we need context information to identify each entity, such as 
cluster, user, flow, run. When querying with {{fromId}}, what kind of 
assumptions should we make on the "id" here? Are we assuming all entities are 
of the same cluster, user, and/or flow, or the "id" is a concatenation of all 
information, or it's something else? 
# For all filters related parameters, I'm not sure if the current object model 
and storage implementation support a trivial solution. I'd certainly welcome 
any comments/suggestions on this problem. 
# Based on the previous two issues, a more general question is, shall we focus 
on a evolution of the v1 API here, or we start a v2 reader API design from the 
scratch, and then try to make them compatible to the v1 APIs? The current patch 
looks to be pursuing the evolution approach. 
# In some APIs, we're requiring clusterID and appID, but not having flow/run 
information. In the current writer implementations, this indicates a full table 
scan. Maybe we can have flow and run information as optional parameters so that 
we can avoid full table scans when the caller does have flow and run 
information?
# The current APIs require a pretty long list of parameters. For most of the 
use cases, I think we can abstract something much simpler. Do we plan to add 
those "simple APIs" in a higher layer? I think having a lot of nulls when 
calling reader API looks suboptimal, but with only these few APIs we may need 
to do this frequently?  

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---
>
> Key: YARN-3051
> URL: https://issues.apache.org/jira/browse/YARN-3051
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-3051-YARN-2928.003.patch, 
> YARN-3051-YARN-2928.03.patch, YARN-3051.wip.02.YARN-2928.patch, 
> YARN-3051.wip.patch, YARN-3051_temp.patch
>
>
> Per design in YARN-2928, create backing storage read interface that can be 
> implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568145#comment-14568145
 ] 

Jian He commented on YARN-3753:
---

This happens because this exception {{new IOException("Wait for ZKClient 
creation timed out");}} is not retried by upper level runWithRetries method 
which causes RM to fail.  we've seen quite a few issues regarding the retry 
logic of zk-store, YARN-2716 should be the long-term solution to fix all these. 
 In the interim, I'm writing a quick work-around patch for this, as this 
problem makes RM unavailable. 

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMach

[jira] [Updated] (YARN-3748) Cleanup Findbugs volatile warnings

2015-06-01 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated YARN-3748:
---
Attachment: (was: YARN-3748.4.patch)

> Cleanup Findbugs volatile warnings
> --
>
> Key: YARN-3748
> URL: https://issues.apache.org/jira/browse/YARN-3748
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gabor Liptak
>Priority: Minor
> Attachments: YARN-3748.1.patch, YARN-3748.2.patch, YARN-3748.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568187#comment-14568187
 ] 

zhihai xu commented on YARN-3749:
-

Hi [~chenchun], thanks for filing and working on this issue.
The patch seems reasonable to me.
Some nits: 
1. It looks like setRpcAddressForRM and setConfForRM are only used by test 
code. Should we create a new HA test utility file to include these functions?

2. Do we really need the following change at {{MiniYARNCluster#serviceInit}}
{{code}}
conf.set(YarnConfiguration.RM_HA_ID, "rm0");
{{code}}
Because I saw {{initResourceManager}} will also configure {{RM_HA_ID}}.

3. Is any particular reason to configure {{YarnConfiguration.RM_HA_ID}} as 
{{RM2_NODE_ID}}  instead of {{RM1_NODE_ID}} in ProtocolHATestBase?
{{code}}
conf.set(YarnConfiguration.RM_HA_ID, RM2_NODE_ID);
{{code}}

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568189#comment-14568189
 ] 

Sergey Shelukhin commented on YARN-1462:


This commit changes newInstance API, breaking Tez build. It is hard to make it 
compatible with both pre-2.8 and 2.8... is it possible to preserve both 
versions of the method?

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568222#comment-14568222
 ] 

Karthik Kambatla commented on YARN-3753:


[~jianhe] - YARN-2716 is ready for review. I can make time for addressing any 
comments to get this in for trunk and branch-2. Given that, would it make sense 
to limit this fix to branch-2.7? 

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache

[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568284#comment-14568284
 ] 

Jian He commented on YARN-3753:
---

[~kasha], sure, make sense,  this can go into branch-2.7 only. And YARN-2716 
can get in for trunk and branch-2. 

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.a

[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568288#comment-14568288
 ] 

Jian He commented on YARN-3753:
---

After investigating more with Xuan, The problem is actually that the wait time 
in below method is set to zkSessionTimeout (10 seconds only), which doesn't 
actually make much sense. Here, the wait time is to wait for the zk-connection 
to be re-established 
{code}
while (zkClient == null) {
  ZKRMStateStore.this.wait(zkSessionTimeout);
  if (zkClient != null) {
break;
  }
{code}

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.Stat

[jira] [Updated] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3753:
--
Attachment: YARN-3753.patch

Upload a patch to set the wait time based on numRetries*retry-interval. 

I reproduced this issue locally in following way.
1. start RM.
2. start ZK.
3. kill ZK.
4. submit a job 
  - without the patch, RM will fail with the same IOException("Wait for 
ZKClient creation timed out") 
 - with the patch, after re-start ZK server, RM and job can continue run 
successfully. 

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalA

[jira] [Comment Edited] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568292#comment-14568292
 ] 

Jian He edited comment on YARN-3753 at 6/2/15 12:32 AM:


Upload a patch to set the wait time based on numRetries*retry-interval. 

I reproduced this issue locally in following way.
1. start ZK
2. start RM
3. kill ZK.
4. submit a job 
  - without the patch, RM will fail with the same IOException("Wait for 
ZKClient creation timed out") 
 - with the patch, after re-start ZK server, RM and job can continue run 
successfully. 


was (Author: jianhe):
Upload a patch to set the wait time based on numRetries*retry-interval. 

I reproduced this issue locally in following way.
1. start RM.
2. start ZK.
3. kill ZK.
4. submit a job 
  - without the patch, RM will fail with the same IOException("Wait for 
ZKClient creation timed out") 
 - with the patch, after re-start ZK server, RM and job can continue run 
successfully. 

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.a

[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568326#comment-14568326
 ] 

Xuan Gong commented on YARN-3753:
-

This is short time solution and this is for branch-2.7 only. The main idea here 
is to increasing the waiting for RM to re-connect to ZK. 

I am ok with this patch. Will commit it later unless [~kasha] has additional 
comments.

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:handle(750)) - Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTran

[jira] [Commented] (YARN-3682) Decouple PID-file management from ContainerExecutor

2015-06-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568376#comment-14568376
 ] 

Vinod Kumar Vavilapalli commented on YARN-3682:
---

[~sidharta-s] / [~vvasudev], want to give a look? Tx.

> Decouple PID-file management from ContainerExecutor
> ---
>
> Key: YARN-3682
> URL: https://issues.apache.org/jira/browse/YARN-3682
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: YARN-3682-20150526.1.txt, YARN-3682-20150526.txt, 
> YARN-3682-20150529.1.txt
>
>
> The PID-files management currently present in ContainerExecutor really 
> doesn't belong there. I know the original history of why we added it, that 
> was about the only right place to put it in at that point of time.
> Given the evolution of executors for Windows etc, the ContainerExecutor is 
> getting more complicated than is necessary.
> We should pull the PID-file management into its own entity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with "java.io.IOException: Wait for ZKClient creation timed out"

2015-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568409#comment-14568409
 ] 

Hadoop QA commented on YARN-3753:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 47s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  50m 33s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 42s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736694/YARN-3753.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cdc13ef |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8154/console |


This message was automatically generated.

> RM failed to come up with "java.io.IOException: Wait for ZKClient creation 
> timed out"
> -
>
> Key: YARN-3753
> URL: https://issues.apache.org/jira/browse/YARN-3753
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3753.patch
>
>
> RM failed to come up with the following error while submitting an mapreduce 
> job.
> {code:title=RM log}
> 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(179)) - Error storing app: 
> application_1432956515242_0006
> java.io.IOException: Wait for ZKClient creation timed out
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>

[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-01 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568428#comment-14568428
 ] 

Tsuyoshi Ozawa commented on YARN-3170:
--

{quote}
The Scheduler has a pluggable policy plug-in
{quote}

I think Allen means the sentence is awkward since "pluggable" "plug-in" sounds 
redundant. Could you fix it?



> YARN architecture document needs updating
> -
>
> Key: YARN-3170
> URL: https://issues.apache.org/jira/browse/YARN-3170
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Allen Wittenauer
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
> YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
> YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170.patch
>
>
> The marketing paragraph at the top, "NextGen MapReduce", etc are all 
> marketing rather than actual descriptions. It also needs some general 
> updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568429#comment-14568429
 ] 

Masatake Iwasaki commented on YARN-3749:


Thanks for working on this, [~chenchun]. I would like this fix to comes in 
because it seems to affect YARN-3752 I'm looking into.

{quote}
2. Do we really need the following change at MiniYARNCluster#serviceInit

   conf.set(YarnConfiguration.RM_HA_ID, "rm0");

Because I saw initResourceManager will also configure RM_HA_ID.
{quote}

When I tried similar to the patch, I got error below because 
{{HAUtil#getRMHAId}} called from {{YarnConfiguration#updateConnectAddr}} 
expects that there is at most 1 RM id matching to the node.

{noformat}
  2015-06-02 10:14:23,648 INFO  [Thread-284] service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
 failed in state STARTED; cause: 
org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has 
multiple addresses that match local node's address.
  org.apache.hadoop.HadoopIllegalArgumentException: The HA Configuration has 
multiple addresses that match local node's address.
  at org.apache.hadoop.yarn.conf.HAUtil.getRMHAId(HAUtil.java:204)
  at 
org.apache.hadoop.yarn.conf.YarnConfiguration.updateConnectAddr(YarnConfiguration.java:1971)
  at 
org.apache.hadoop.conf.Configuration.updateConnectAddr(Configuration.java:2129)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:357)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:467)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:321)
  at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper$1.run(MiniYARNCluster.java:562)
{noformat}

The check can be bypassed by setting dummy value to 
{{yarn.resourcemanager.ha.id}} in configuration *used by NodeManager instance*. 
I think there should be a comment explain that it is a dummy for unit test at 
least.


> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568433#comment-14568433
 ] 

Zhijie Shen commented on YARN-1462:
---

bq. This commit changes newInstance API, breaking Tez build.

{{newInstance}} is marked as \@Private, and it's not supposed to be used ouside 
Hadoop. What's the use case in Tez?

bq. is it possible to preserve both versions of the method?

It's possible, but the problem is whether should do it. Theoretically, 
compatibility is not required for the private method. If it has strong use case 
to let app report be created outside Hadoop, we should mark this method 
\@Public and keep it compatible over releases.

> AHS API and other AHS changes to handle tags for completed MR jobs
> --
>
> Key: YARN-1462
> URL: https://issues.apache.org/jira/browse/YARN-1462
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1462-branch-2.7-1.2.patch, 
> YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
> YARN-1462.3.patch
>
>
> AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.3.patch

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.4.patch

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568448#comment-14568448
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for the review [~zxu] [~iwasakims].
Upload a new patch to address your comments.

bq. 1. It looks like setRpcAddressForRM and setConfForRM are only used by test 
code. Should we create a new HA test utility file to include these functions?

Moved setRpcAddressForRM and setConfForRM to HATestUtil.java

bq. 2. Do we really need the following change at MiniYARNCluster#serviceInit 
conf.set(YarnConfiguration.RM_HA_ID, "rm0");

This is indeed necessary, as [~iwasakims]'s comments, this is used to bypass 
the check in `HAUtil#getRMHAId` used by NodeManager instance.

bq. 3. Is any particular reason to configure YarnConfiguration.RM_HA_ID as 
RM2_NODE_ID instead of RM1_NODE_ID in ProtocolHATestBase?

Not really, changed it to RM1_NODE_ID.

bq. I think there should be a comment explain that it is a dummy for unit test 
at least.

Added a comment in `MiniYARNCluster#serviceInit`

Also the newly uploaded patch YARN-3749.4.patch only make a copy of the 
configuration in initResourceManager when there are multiple RMs. If there is 
only one RM, many test case in yarn-client depends on the random ports after RM 
starts.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.5.patch

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568458#comment-14568458
 ] 

Chun Chen commented on YARN-3749:
-

Upload a YARN-3749.5.patch to set {{YarnConfiguration.RM_HA_ID}} only in 
{{MiniYARNCluster#serviceInit}}, removed that in other tests.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Target Version/s: 2.7.1

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568462#comment-14568462
 ] 

Rohith commented on YARN-3733:
--

This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568466#comment-14568466
 ] 

Sunil G commented on YARN-3733:
---

I feel "clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>" may happen. But we 
cannot differentiate which is bigger infinity here and thats not correct. Why 
could we check for clusterResource=<0,0> prior to * getResourceAsValue()* check 
and handle from there. 

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568467#comment-14568467
 ] 

Sunil G commented on YARN-3733:
---

I feel "clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>" may happen. But we 
cannot differentiate which is bigger infinity here and thats not correct. Why 
could we check for clusterResource=<0,0> prior to * getResourceAsValue()* check 
and handle from there. 

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568492#comment-14568492
 ] 

zhihai xu commented on YARN-3749:
-

[~iwasakims], thanks for the finding and explanations!
[~chenchun], thanks for updating the patch!
After looking deeper into the code, I am more convinced this is a bug in 
{{YarnConfiguration#updateConnectAddr}}.
IMHO, we should change {{if (HAUtil.isHAEnabled(this))}} to {{if 
(HAUtil.isHAEnabled(this) && getServiceAddressConfKeys(this).contains(name))}}
to match the code in {{YarnConfiguration#getSocketAddr}}. It doesn't sound 
right to add RM_HA_ID suffix to NM service address 
"yarn.nodemanager.localizer.address". Also there will be a problem if we call 
{{getSocketAddr}} after {{updateConnectAddr}} for address Property 
"yarn.nodemanager.localizer.address".
I think we can fix the HadoopIllegalArgumentException with this change.

> We should make a copy of configuration when init MiniYARNCluster with 
> multiple RMs
> --
>
> Key: YARN-3749
> URL: https://issues.apache.org/jira/browse/YARN-3749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
> Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
> YARN-3749.5.patch, YARN-3749.patch
>
>
> When I was trying to write a test case for YARN-2674, I found DS client 
> trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
> when RM failover. But I initially set 
> yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
> yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
> in ClientRMService where the value of yarn.resourcemanager.address.rm2 
> changed to 0.0.0.0:18032. See the following code in ClientRMService:
> {code}
> clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
>YarnConfiguration.RM_ADDRESS,
>
> YarnConfiguration.DEFAULT_RM_ADDRESS,
>server.getListenerAddress());
> {code}
> Since we use the same instance of configuration in rm1 and rm2 and init both 
> RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
> during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
> starting of rm1.
> So I think it is safe to make a copy of configuration when init both of the 
> rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-01 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3754:
--

 Summary: Race condition when the NodeManager is shutting down and 
container is launched
 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt



Container is launched and returned to ContainerImpl
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed}}. 


*Attaching the exception trace*
{code}
2015-05-30 02:11:49,122 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Unable to update state store diagnostics for 
container_e310_1432817693365_3338_01_02
java.io.IOException: org.iq80.leveldb.DBException: Closed
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.iq80.leveldb.DBException: Closed
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
... 15 more

{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >