from:"Miklos Szegedi \(JIRA\)"

[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

2018-05-28 Thread Miklos Szegedi (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493112#comment-16493112
 ] 

Miklos Szegedi commented on YARN-8320:
--

Thank you, [~cheersyang]. For the responses. They make sense to me in general.
{quote}how many cpuset resource on a NM and how a AM to request?
{quote}
In general, this is an adapter code, passing on a cgroup functionality to 
another API. As such it can do two things. One is being transparent, the other 
is making the original API easier to use. You try to do the latter in your 
design, which makes sense. Being transparent however would mean letting the AM 
choose cpu resources controlling cpu,cpuacct and cpuset resources controlling 
cpuset separately. I would prefer the latter, since it is transparent keeping 
all functionality without restrictions and makes any future design easier to 
implement. cpuset would have as many processors as there are available in 
cpuset.cpus of the container root cgroup that is usually {{hadoop-yarn}}. 
Individual CPUs are chosen by NM based on the number of cpuset cpus granted by 
RM.

However, I do not have a strong opinion about this.

> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6677) Preempt opportunistic containers when root container cgroup goes over memory limit

2018-05-28 Thread Miklos Szegedi (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493085#comment-16493085
 ] 

Miklos Szegedi commented on YARN-6677:
--

Thank you, [~haibochen] for the patch.
{code:java}
// Reverse order by start time{code}
I think this should mention launch time
{code}
274 DefaultOOMHandler handler = new DefaultOOMHandler(context, false) {
275   @Override
276   protected CGroupsHandler getCGroupsHandler() {
277 return cGroupsHandler;
278   }
279 };
{code}
This could be solved with an overridden {{run()}} that calls the parent's run 
in after overriding cgroups handler. The smaller amount of functions we have 
the faster the code can be understood.
{code}OverAllocationOOMHandler.java{code}
I am not so convinced that it is a good idea to add another derived class here. 
Could you just update DefaultOOMHandler with your code? Oversubscription and 
opportunistic containers are first class citizens in Hadoop, we do not need to 
add the logic as a plugin. If the logic works with all guaranteed containers, I 
am fine updating {{DefaultOOMHandler.run()}}
{code}
9* 
10   * http://www.apache.org/licenses/LICENSE-2.0
11   * 
{code}
I am not sure whether  is the standard.
{code}
40 * @param testVirtual Test virtual memory or physical
{code}
This was my mistake, but this time this parameter would deserve a more 
meaningful name.
{code}
72  candidates.sort(CONTAINER_START_TIME_COMPARATOR);
{code}
I think we could write a very simple code that follows the logic, if we had a 
two level comparator, that sorts by opportunistic flag first an then by launch 
time descending.
{code}
93  Container c3 = createContainer(currentContainerId++,true, 2);
{code}
Missing space, also it might make sense to use launch times 1,2,3 instead of 
1,2,2 keeping this 2 obviously.
{code}
309   @Test(expected = YarnRuntimeException.class)
310   public void testOOMUnresolvedAfterKillingAllContainers() throws 
Exception {
{code}
This is probably my fault but this might need a good javadoc.
{code}
885   @Override
886   public long getContainerLaunchTime() {
887 return this.containerLaunchStartTime;
888   }
{code}
It would also be super useful to have a javadoc here explaining the difference 
between start and launch times.


> Preempt opportunistic containers when root container cgroup goes over memory 
> limit
> --
>
> Key: YARN-6677
> URL: https://issues.apache.org/jira/browse/YARN-6677
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha3
>Reporter: Haibo Chen
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-6677.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-24 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8310:
-
Fix Version/s: (was: 3.0.x)
   3.0.3

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands,

[jira] [Updated] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-24 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8310:
-
Fix Version/s: 3.0.x
   3.1.1
   2.10.0

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.x
>
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For

[jira] [Updated] (YARN-5764) NUMA awareness support for launching containers

2018-05-24 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-5764:
-
Fix Version/s: 3.2.0
   3.1.0

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v10.patch, 
> YARN-5764-v11.patch, YARN-5764-v2.patch, YARN-5764-v3.patch, 
> YARN-5764-v4.patch, YARN-5764-v5.patch, YARN-5764-v6.patch, 
> YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

2018-05-24 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489398#comment-16489398
 ] 

Miklos Szegedi commented on YARN-8320:
--

[~cheersyang] / [~yangjiandan], thank you for raising this, this would be a 
very useful feature.

Thank you [~leftnoteasy] for the comments.

1) I agree with  [~leftnoteasy] about the special considerations regarding 
rounding. Because of this it might make sense to use a separate resource type 
for this feature. See my other comments regarding this below.

2) I also think like [~leftnoteasy] that users might not need the 
RESERVED/SHARED modes. It adds complexity reducing the number of users, who 
would use the feature. On the other hand I admit it nicely applies to 
cpuset.cpu_exclusive=0/1.

3) I definitely agree with [~leftnoteasy] in the use of resource types. It 
might be straightforward to have a cpuset resource type that the AMs can 
request and share the cgroups accordingly. This would also make the 
configuration more standard. The levels might not even be needed in this case. 
If an application does not request cpuset, it is shared, otherwise it is 
exclusive. The current suggestion would work but please consider using resource 
types.

4) The design lets the AM do a delayed exclusive request directly to the NM 
avoiding the RM. I think it would be more robust to request from the RM in the 
container launch context and just forward this to the NM. The RM has the chance 
to decline or delay the request in this case in the future.

5) [~yangjiandan], how can you make sure a parent cgroup does not interfere 
with a cgroup marked as {{cpuset.cpu_exclusive=1}}? What if a system service 
wakes up?

6) Let me mention that this feature negatively affects YARN-1011 and 
oversubscription. An exclusive CPU with leftover cannot be used by any other 
container and remains idle. This reduces overall cluster utilization.

7) Also, latency sensitive applications get exclusive protection but can only 
be assigned to their cpuset disallowing bursts to other CPUs when needed. I do 
not know how to solve this though.

8) If a cpuset is not exclusive it is considered as a limit by cgroups not a 
reserve. The feature uses this as a reserve which practically would mean that 
other container cgroups need to be changed and reduced every time a reserved 
container starts. Am I correct?

> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-23 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487742#comment-16487742
 ] 

Miklos Szegedi commented on YARN-4599:
--

The unit test issues are not related to the patch.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.013.patch, YARN-4599.014.patch, YARN-4599.015.patch, 
> YARN-4599.016.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-22 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.016.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.013.patch, YARN-4599.014.patch, YARN-4599.015.patch, 
> YARN-4599.016.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-22 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486637#comment-16486637
 ] 

Miklos Szegedi commented on YARN-8310:
--

I will backport this to branches branch-2, branch-3.0 and branch 3.1

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands,

[jira] [Commented] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-22 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486628#comment-16486628
 ] 

Miklos Szegedi commented on YARN-8310:
--

Committed to trunk. Thank you for the patch [~rkanter] and for the review 
[~grepas].

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For

[jira] [Commented] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-22 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486626#comment-16486626
 ] 

Miklos Szegedi commented on YARN-8310:
--

+1 LGTM.

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.003.patch, YARN-8310.branch-2.001.patch, 
> YARN-8310.branch-2.002.patch, YARN-8310.branch-2.003.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-17 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479685#comment-16479685
 ] 

Miklos Szegedi commented on YARN-8310:
--

Thank you for the patch [~rkanter].
 I have a couple of comments.
 {{readFields()}} casts from {{DataInput}} to {{DataInputStream}}, it might be 
valuable to do a check first with a warning.
Also {{DataInput}} does not have a {{reset()}} by default and we do not know, 
who called the {{mark()}} before. It might make sense to call the {{mark()}} 
or, even better do a {{DataInput.readFully()}} and apply the two parsers on the 
same byte array.
{{public void readFields(DataInput in) throws IOException}} may throw an 
exception early, if the new format expects more characters. We should fall back 
to {{readFieldsInOldFormat}} in case of an {{IOException}} and throw it only if 
that one throws the same.
It might make sense to write some unit tests that pass a {{DataInput}} that is 
not a {{DataInputStream}}
If we stick to {{reset()}} we might want to use int in a catch block as well, 
so that others can read the data, if we fail.
{code}
372 int logAggregationSize = -1;
{code}
I would probably use a separate boolean here, and set it in the success 
scenario to true, so that we cover the case of a -1 parsed and provide a 
meaningful error code.
{code}
123 String[] hostAddr = in.readUTF().split(":");
{code}
The case is not covered, when the string does not have a colon. Also we do not 
check, whether the port number is in the range of two bytes and non negative. 
It might fail later in YARN, if we omit this check.


> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8310.001.patch, YARN-8310.002.patch, 
> YARN-8310.branch-2.001.patch, YARN-8310.branch-2.002.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
>

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-17 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.015.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.013.patch, YARN-4599.014.patch, YARN-4599.015.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.014.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.013.patch, YARN-4599.014.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.013.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.013.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.012.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.012.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478004#comment-16478004
 ] 

Miklos Szegedi commented on YARN-4599:
--

[~snemeth], thanks for the review.

oomHandlerTemp is necessary since we may throw an exception after we set it. An 
external code optimized by JVM (setting a variable before the constructor is 
finished) may get a partial copy of this object in case of the exception. It is 
better to set the fields after the exception is not thrown.

3) This code has to be very fast do the same conversion for 1000s of containers 
potentially, so I will keep the plain multiplication instead of calling int 
external code that deals with strings.

 

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477692#comment-16477692
 ] 

Miklos Szegedi edited comment on YARN-7715 at 5/16/18 4:34 PM:
---

[~yangjiandan], none of these are the job of this jira. This Jira is about 
setting the cgroup based on the setting already propagated to node manager and 
taken care of. I agree the flag needs to be in state store, however this has 
nothing to do with cgroups. Please file a separate Jira for that. Also I am not 
convinced that the AM has to be notified about cgroup errors. cgroup has to be 
as transparent and failsafe as possible. Any communication to the AM would just 
add unnecessary network overhead and probably does not solve the problem. The 
information that some cgroup update failed on some node might be interesting to 
the AM but it is not actionable.


was (Author: miklos.szeg...@cloudera.com):
[~yangjiandan], none of these are the job of this jira. This Jira is about 
setting the cgroup based on the setting already propagated to node manager and 
taken care of. I agree the flag needs to be in state store, however this has 
nothing to do with cgroups. Also I am not convinced that the AM has to be 
notified about cgroup errors. cgroup has to be as transparent and failsafe as 
possible. Any communication to the AM would just add unnecessary network 
overhead and probably does not solve the problem. The information that some 
cgroup update failed on some node might be interesting to the AM but it is not 
actionable.

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477692#comment-16477692
 ] 

Miklos Szegedi commented on YARN-7715:
--

[~yangjiandan], none of these are the job of this jira. This Jira is about 
setting the cgroup based on the setting already propagated to node manager and 
taken care of. I agree the flag needs to be in state store, however this has 
nothing to do with cgroups. Also I am not convinced that the AM has to be 
notified about cgroup errors. cgroup has to be as transparent and failsafe as 
possible. Any communication to the AM would just add unnecessary network 
overhead and probably does not solve the problem. The information that some 
cgroup update failed on some node might be interesting to the AM but it is not 
actionable.

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.011.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.011.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477620#comment-16477620
 ] 

Miklos Szegedi commented on YARN-4599:
--

Fixing 2 checkstyle issues.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.010.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.010.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477614#comment-16477614
 ] 

Miklos Szegedi commented on YARN-4599:
--

The HDFS test failures are unrelated.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-15 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476852#comment-16476852
 ] 

Miklos Szegedi commented on YARN-7715:
--

updateContainer only supports execution type updates so far. A vcore increase 
or decrease does not trigger it. I agree, it should. This Jira was about 
promotion, that one is about resource change. Would you like to file a JIRA?

Do you mean state store in the second case? I think that is legit, however it 
is also out of the scope of this patch. [~haibochen], [~asuresh], what do you 
think? Do we need/have the opportunistic flag in state store?

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-15 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.009.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.009.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-15 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.008.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.008.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-4599) Set OOM control for memory cgroups

2018-05-15 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476365#comment-16476365
 ] 

Miklos Szegedi edited comment on YARN-4599 at 5/15/18 10:09 PM:


Thanks for the review [~snemeth].
{quote} [Question] In constructor: If both {{controlPhysicalMemory}} and 
{{controlVirtualMemory}} is on, you only warn log a line.
{quote}
{{this.controlPhysicalMemory = controlPhysicalMemory && 
!controlVirtualMemory;}} virtual will override
{quote}G.) In run(): I'm curious whether it can happen that in the while loop's 
statement, {{events.read}} would read more data than 8 bytes or is it perfectly 
safe to rely on that on every read, at most 8 bytes will be read?
{quote}
It cannot return more than the buffer size that is 8.

H) I like to have big logic together.

 
{quote}I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in 
time.");}}} --> I would include how much time was spent in the log message for 
better troubleshooting.
{quote}
The watchdog function handles this.
{quote}B.) In run(): Call to...
{quote}
It does not matter, it wraps into two lines anyways.
{quote}C.) In {{killContainerIfOOM()}}: 
{quote}
Not sure what you mean here. Yes there is a conversion, since the request is in 
MB and the limit is in bytes.
{quote}A.) {{testConstructorHandler():(}}
{quote}
Good catch.


was (Author: miklos.szeg...@cloudera.com):
Thanks for the review [~snemeth].
{quote}{quote} [Question] In constructor: If both {{controlPhysicalMemory}} and 
{{controlVirtualMemory}} is on, you only warn log a line.
{quote}{quote}
{{this.controlPhysicalMemory = controlPhysicalMemory && 
!controlVirtualMemory;}} virtual will override
{quote}G.) In run(): I'm curious whether it can happen that in the while loop's 
statement, {{events.read}} would read more data than 8 bytes or is it perfectly 
safe to rely on that on every read, at most 8 bytes will be read?
{quote}
It cannot return more than the buffer size that is 8.

H) I like to have big logic together.

 
{quote}I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in 
time.");}}} --> I would include how much time was spent in the log message for 
better troubleshooting.
{quote}
The watchdog function handles this.
{quote}B.) In run(): Call to...
{quote}
It does not matter, it wraps into two lines anyways.
{quote}C.) In {{killContainerIfOOM()}}: 
{quote}
Not sure what you mean here. Yes there is a conversion, since the request is in 
MB and the limit is in bytes.
{quote}A.) {{testConstructorHandler():(}}
{quote}
Good catch.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-15 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476409#comment-16476409
 ] 

Miklos Szegedi commented on YARN-7715:
--

cgroups should work independently of the AM I think. In fact the AM does not 
even know, if a container is opportunistic or guaranteed at a certain time, 
does it?

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-15 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.007.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.007.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-15 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476365#comment-16476365
 ] 

Miklos Szegedi commented on YARN-4599:
--

Thanks for the review [~snemeth].
{quote}{quote} [Question] In constructor: If both {{controlPhysicalMemory}} and 
{{controlVirtualMemory}} is on, you only warn log a line.
{quote}{quote}
{{this.controlPhysicalMemory = controlPhysicalMemory && 
!controlVirtualMemory;}} virtual will override
{quote}G.) In run(): I'm curious whether it can happen that in the while loop's 
statement, {{events.read}} would read more data than 8 bytes or is it perfectly 
safe to rely on that on every read, at most 8 bytes will be read?
{quote}
It cannot return more than the buffer size that is 8.

H) I like to have big logic together.

 
{quote}I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in 
time.");}}} --> I would include how much time was spent in the log message for 
better troubleshooting.
{quote}
The watchdog function handles this.
{quote}B.) In run(): Call to...
{quote}
It does not matter, it wraps into two lines anyways.
{quote}C.) In {{killContainerIfOOM()}}: 
{quote}
Not sure what you mean here. Yes there is a conversion, since the request is in 
MB and the limit is in bytes.
{quote}A.) {{testConstructorHandler():(}}
{quote}
Good catch.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-14 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475059#comment-16475059
 ] 

Miklos Szegedi commented on YARN-4599:
--

{quote}bq. Is 'descriptors->event_control_fd = -1;'  necessary?
{quote}
Yes it is a defense against chained errors, it may make it easier to debug when 
you get a core dump.
{quote}bq. 3) The comments for test_oom() does not quite make sense to me. My 
current understanding is that it adds the calling process to the given pgroup 
and simulates an OOM by keep asking OS for memory?
{quote}
You are mixing the parent with the child. The parent gets the child pid and the 
child gets 0 after the fork() since the child can just call getpid(). It forks 
a child process gets it's pid in the parent and adds that to a cgroup. Once the 
child notices that it is in the cgroup it starts eating memory triggering an 
OOM.
{quote}bq. 4) Can you please elaborate on how cgroup simulation is done in 
oom_listener_test_main.c? The child process that is added to the cgroup only 
does sleep().
{quote}
/*
 Unit test for cgroup testing. There are two modes.
 If the unit test is run as root and we have cgroups
 we try to crate a cgroup and generate an OOM.
 If we are not running as root we just sleep instead
 of eating memory and simulate the OOM by sending
 an event in a mock event fd mock_oom_event_as_user.
*/
{quote}bq. 5) Doing a param matching in CGroupsHandlerImpl.GetCGroupParam() 
does not seem a good practice to me.
{quote}
CGroupsHandlerImpl.GetCGroupParam() is a smart function that returns the file 
name given the parameter name. I do not see any good practice issue here. The 
tasks file is always without the controller name.
{quote}bq. 6) Let's wrap the new thread join in ContainersMonitorImpl with 
try-catch clause as we do with the monitoring thread.
{quote}
May I ask why? I thought only exceptions that will actually be thrown need to 
be caught. CGroupElasticMemoryController has a much better cleanup process than 
the monitoring thread and it does not need InterruptedException. In fact any 
interrupted exception would mean that we have likely leaked the external 
process, so I would advise against using it.
{quote}bq. 7) The configuration changes are incompatible  ... How about we 
create separate configurations for pm elastic control and vm elastic control?
{quote}
I do not necessarily agree here.

a) First of all polling and cgroups memory control did not work together before 
the patch either. NM exited with an exception, so there is no previous 
functionality that worked before and it does not work now. There is no 
compatibility break. cgroups takes a precedence indeed, that is a new feature.

b) I would like to have a clean design in the long term for configuration 
avoiding too many configuration entries and definitely avoiding confusion. If 
here is a yarn.nodemanager.pmem-check-enabled, it suggests general use, it 
would be unintuitive not to use it. We indeed cannot change it's general 
meaning anymore. I think the clean design is having 
yarn.nodemanager.resource.memory.enabled to enable cgroups, 
yarn.nodemanager.resource.memory.enforced to enforce it per container and 
yarn.nodemanager.elastic-memory-control.enabled to enforce it at the node 
level. The detailed settings like yarn.nodemanager.pmem-check-enabled and 
yarn.nodemanager.pmem-check-enabled can only intuitively apply to all of them. 
In uderstand the concern but this solution would let us keep only these five 
configuration entries.

11) Does it make sense to have the stopListening logic in `if (!watchdog.get) 
{}` block instead?

It is completely equivalent. It will be called a few milliseconds earlier 
later, but there was a missing explanation there, so I added a comment.
{quote}bq. 16) In TestDefaultOOMHandler.testBothContainersOOM(), I think we 
also need to verify container 2 is killed. Similarly, in  testOneContainerOOM() 
and  testNoContainerOOM().
{quote}
Only one container should be killed. However, I refined the verify logic to be 
even more precise verifying this.

I addressed the rest. I will provide a patch soon.

 

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-11 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: Elastic Memory Control in YARN.pdf

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, 
> YARN-4599.001.patch, YARN-4599.002.patch, YARN-4599.003.patch, 
> YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-11 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.006.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.004.patch, 
> YARN-4599.005.patch, YARN-4599.006.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8275) Create a JNI interface to interact with Windows

2018-05-11 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472663#comment-16472663
 ] 

Miklos Szegedi commented on YARN-8275:
--

[~giovanni.fumarola], I am curious about your opinion about the design of 
YARN-4599. In that case we considered JNI vs. a long running native process 
communicating with YARN over pipe. The latter seems better in terms of security 
and maintainability in case some native functions start corrupting JVM heap. 
There is only a single process start in that case, so that it does not affect 
performance. What do you think?

 

> Create a JNI interface to interact with Windows
> ---
>
> Key: YARN-8275
> URL: https://issues.apache.org/jira/browse/YARN-8275
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Major
> Attachments: WinUtils-Functions.pdf, WinUtils.CSV
>
>
> I did a quick investigation of the performance of WinUtils in YARN. In 
> average NM calls 4.76 times per second and 65.51 per container.
>  
> | |Requests|Requests/sec|Requests/min|Requests/container|
> |*Sum [WinUtils]*|*135354*|*4.761*|*286.160*|*65.51*|
> |[WinUtils] Execute -help|4148|0.145|8.769|2.007|
> |[WinUtils] Execute -ls|2842|0.0999|6.008|1.37|
> |[WinUtils] Execute -systeminfo|9153|0.321|19.35|4.43|
> |[WinUtils] Execute -symlink|115096|4.048|243.33|57.37|
> |[WinUtils] Execute -task isAlive|4115|0.144|8.699|2.05|
>  Interval: 7 hours, 53 minutes and 48 seconds
> Each execution of WinUtils does around *140 IO ops*, of which 130 are DDL ops.
> This means *666.58* IO ops/second due to WinUtils.
> We should start considering to remove WinUtils from Hadoop and creating a JNI 
> interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-10 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.005.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.004.patch, 
> YARN-4599.005.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-10 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470881#comment-16470881
 ] 

Miklos Szegedi commented on YARN-4599:
--

Fixing unit tests. [~asuresh], FYI this patch affects oversubscription.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.004.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-10 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.004.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.004.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-09 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.003.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469907#comment-16469907
 ] 

Miklos Szegedi commented on YARN-4599:
--

HDFS is not related to the patch.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8250) Create another implementation of ContainerScheduler to support NM overallocation

2018-05-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469553#comment-16469553
 ] 

Miklos Szegedi commented on YARN-8250:
--

I think the first and the last checkstyle comments could be addressed.

> Create another implementation of ContainerScheduler to support NM 
> overallocation
> 
>
> Key: YARN-8250
> URL: https://issues.apache.org/jira/browse/YARN-8250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-8250-YARN-1011.00.patch, 
> YARN-8250-YARN-1011.01.patch, YARN-8250-YARN-1011.02.patch
>
>
> YARN-6675 adds NM over-allocation support by modifying the existing 
> ContainerScheduler and providing a utilizationBased resource tracker.
> However, the implementation adds a lot of complexity to ContainerScheduler, 
> and future tweak of over-allocation strategy based on how much containers 
> have been launched is even more complicated.
> As such, this Jira proposes a new ContainerScheduler that always launch 
> guaranteed containers immediately and queues opportunistic containers. It 
> relies on a periodical check to launch opportunistic containers. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8250) Create another implementation of ContainerScheduler to support NM overallocation

2018-05-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469378#comment-16469378
 ] 

Miklos Szegedi commented on YARN-8250:
--

Thank you for the updated patch [~haibochen].
{code:java}
if (updateEvent.isIncrease()){code}
The else of this one should check for isDecrease otherwise a non-change would 
trigger a decrease.

> Create another implementation of ContainerScheduler to support NM 
> overallocation
> 
>
> Key: YARN-8250
> URL: https://issues.apache.org/jira/browse/YARN-8250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-8250-YARN-1011.00.patch, 
> YARN-8250-YARN-1011.01.patch, YARN-8250-YARN-1011.02.patch
>
>
> YARN-6675 adds NM over-allocation support by modifying the existing 
> ContainerScheduler and providing a utilizationBased resource tracker.
> However, the implementation adds a lot of complexity to ContainerScheduler, 
> and future tweak of over-allocation strategy based on how much containers 
> have been launched is even more complicated.
> As such, this Jira proposes a new ContainerScheduler that always launch 
> guaranteed containers immediately and queues opportunistic containers. It 
> relies on a periodical check to launch opportunistic containers. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-05-08 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8090:
-
Attachment: YARN-8090.002.patch

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-8090.000.patch, YARN-8090.001.patch, 
> YARN-8090.002.patch
>
>
> When a file is closed mutple times by multiple threads, all but the first 
> close will generate a WARNING message.
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-05-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468068#comment-16468068
 ] 

Miklos Szegedi commented on YARN-8090:
--

Thank you for the review [~haibochen]. I updated the patch.

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-8090.000.patch, YARN-8090.001.patch, 
> YARN-8090.002.patch
>
>
> When a file is closed mutple times by multiple threads, all but the first 
> close will generate a WARNING message.
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-08 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.002.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468062#comment-16468062
 ] 

Miklos Szegedi commented on YARN-4599:
--

Fixing unit test.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.002.patch, YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8250) Create another implementation of ContainerScheduler to support NM overallocation

2018-05-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468020#comment-16468020
 ] 

Miklos Szegedi commented on YARN-8250:
--

Thanks for the patch, [~haibochen].

ContainerScheduler could be renamed DefaultContainerScheduler to give space to 
extensions later.

You could use conf.getClass in createContainerScheduler to automatically verify 
the parent class.

getContainersUtilization and updateContainersUtilization might need to be 
synchronized or sampled (cloned).
{code:java}
141 public ContainersMonitor getContainersMonitor() {
142 return nmContext.getContainerManager().getContainersMonitor();
143 }{code}
Usually it is considered a better practice to return nmContext and rely on the 
caller to retrieve the rest.

shedQueuedOpportunisticContainers does a fifo, it might make sense to do a lifo.

 

 

> Create another implementation of ContainerScheduler to support NM 
> overallocation
> 
>
> Key: YARN-8250
> URL: https://issues.apache.org/jira/browse/YARN-8250
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-8250-YARN-1011.00.patch
>
>
> YARN-6675 adds NM over-allocation support by modifying the existing 
> ContainerScheduler and providing a utilizationBased resource tracker.
> However, the implementation adds a lot of complexity to ContainerScheduler, 
> and future tweak of over-allocation strategy based on how much containers 
> have been launched is even more complicated.
> As such, this Jira proposes a new ContainerScheduler that always launch 
> guaranteed containers immediately and queues opportunistic containers. It 
> relies on a periodical check to launch opportunistic containers. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-08 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-7715:
-
Attachment: YARN-7715.004.patch

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467721#comment-16467721
 ] 

Miklos Szegedi commented on YARN-4599:
--

The build issue seems to be unrelated. Restarting the build.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8262) get_executable in container-executor should provide meaningful error codes

2018-05-08 Thread Miklos Szegedi (JIRA)

Miklos Szegedi created YARN-8262:


 Summary: get_executable in container-executor should provide 
meaningful error codes
 Key: YARN-8262
 URL: https://issues.apache.org/jira/browse/YARN-8262
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Miklos Szegedi


Currently it calls exit(-1) that makes it difficult to debug without stderr.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-05-07 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.001.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.001.patch, 
> YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-07 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-7715:
-
Attachment: YARN-7715.003.patch

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-07 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466274#comment-16466274
 ] 

Miklos Szegedi commented on YARN-7715:
--

Thank you for the review [~haibochen].

I added a unit test for TestContainerSchedulerQueuing.

I am hesitant to update or not update cgroups based on some hash maps updated 
by asynchronous code. That might become a supportability nightmare. I already 
added a proper check of running containers by checking whether the cgroup 
directory exists.

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-04 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464275#comment-16464275
 ] 

Miklos Szegedi commented on YARN-7715:
--

Thank you for the review [~haibochen]. I updated the patch.

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-04 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-7715:
-
Attachment: YARN-7715.002.patch

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-03 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-7715:
-
Attachment: YARN-7715.001.patch

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8244) ContainersLauncher.ContainerLaunch can throw ConcurrentModificationException

2018-05-03 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463247#comment-16463247
 ] 

Miklos Szegedi commented on YARN-8244:
--

This happens with {{TestContainerSchedulerQueuing.testStartMultipleContainers}}.

> ContainersLauncher.ContainerLaunch can throw ConcurrentModificationException
> 
>
> Key: YARN-8244
> URL: https://issues.apache.org/jira/browse/YARN-8244
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Priority: Major
>
> {code:java}
> 2018-05-03 17:31:35,028 WARN [ContainersLauncher #1] launcher.ContainerLaunch 
> (ContainerLaunch.java:call(329)) - Failed to launch container.
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1471)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1469)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch$ShellScriptBuilder.orderEnvByDependencies(ContainerLaunch.java:1311)
> at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.writeLaunchEnv(ContainerExecutor.java:388)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:101)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8244) ContainersLauncher.ContainerLaunch can throw ConcurrentModificationException

2018-05-03 Thread Miklos Szegedi (JIRA)

Miklos Szegedi created YARN-8244:


 Summary: ContainersLauncher.ContainerLaunch can throw 
ConcurrentModificationException
 Key: YARN-8244
 URL: https://issues.apache.org/jira/browse/YARN-8244
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Miklos Szegedi


{code:java}
2018-05-03 17:31:35,028 WARN [ContainersLauncher #1] launcher.ContainerLaunch 
(ContainerLaunch.java:call(329)) - Failed to launch container.
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$EntryIterator.next(HashMap.java:1471)
at java.util.HashMap$EntryIterator.next(HashMap.java:1469)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch$ShellScriptBuilder.orderEnvByDependencies(ContainerLaunch.java:1311)
at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.writeLaunchEnv(ContainerExecutor.java:388)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:290)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-02 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461797#comment-16461797
 ] 

Miklos Szegedi commented on YARN-7715:
--

[~asuresh], [~haibo.chen] I attached a patch of my proposal.

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-02 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-7715:
-
Attachment: YARN-7715.000.patch

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-7715.000.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-02 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461741#comment-16461741
 ] 

Miklos Szegedi commented on YARN-7715:
--

I am working on a preliminary patch to discuss. Do you think we should reuse 
reacquireContainer for the apply logic or create a separate apply?

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7715) Update CPU and Memory cgroups params on container update as well.

2018-05-02 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi reassigned YARN-7715:


Assignee: Miklos Szegedi

> Update CPU and Memory cgroups params on container update as well.
> -
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-05-02 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461681#comment-16461681
 ] 

Miklos Szegedi commented on YARN-4599:
--

[~aw], [~haibo.chen], [~kasha], [~mding], [~sandflee], [~sidharta-s], 
[~vinodkv]  do you have time to review the patch or do you have any comments on 
the design approach?

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4599) Set OOM control for memory cgroups

2018-04-30 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-4599:
-
Attachment: YARN-4599.000.patch

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.000.patch, YARN-4599.sandflee.patch, 
> yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups

2018-04-30 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458806#comment-16458806
 ] 

Miklos Szegedi commented on YARN-4599:
--

I will provide the patch shortly. Here are the design suggestions of the patch:
 * The basic idea is what was discussed above. It disables the OOM killer on 
the hadoop-yarn cgroup. This will trigger a pause on all containers when all of 
them exceeded the node limit.
 * YARN will be notified with an executable listening to the cgroups OOM Linux 
event. This should be very fast. The executable is oom-listener, not 
container-executor. This is because it does not need to run as root. I avoided 
JNI to be more defensive on security and it also helps to test the executable 
easier.
 * When YARN receives the notification, it runs a pluggable OOM handler to 
resolve the situation. YARN is outside the hadoop-yarn group, so it can run 
freely, all containers are frozen at this point. Different users may have 
different preferences, thus the handler is pluggable.
 * The default OOM handler picks the latest container that ran out of it's 
request. This ensures that it kills a container that did not cost much so far 
but it keeps guaranteed containers that play by the rules and use memory within 
their limits. It repeats the process until the OOM is resolved. Based on my 
experiments the kernel updates the flag almost instantaneously, so it will just 
kill as many containers as necessary.
 * If the default OOM handler cannot pick a container with the logic above it 
kills the latest container until the OOM is resolved.
 * If we are still in OOM without any containers, an exception is thrown and 
the node is brought down. This can be the case, if containers leaked processes, 
had processes running with another user and cannot be killed the container user 
or someone put a process into the root hadoop-yarn cgroup.
 * The killer is not the normal container cleanup code. The standard behaviour 
is to send a SIGTERM to the container PGID process and if it does not respond 
in 250 milliseconds, it sends a SIGKILL. However, in our case all the processes 
are frozen by cgroups, so that they cannot respond to a SIGTERM. Because of 
this it uses the standard container executor code to send a SIGKILL to the PGID 
right away with the container user. The kernel OOM killer would do the same. 
This works pretty fast. It walks through all the thread/process IDs in the 
tasks file, so that all active PGIDs are found in the container. The current 
code does not delete standalone processes that are not a process group leader. 
If they are not part of one of the container local process groups they may be 
leaked. It also cannot handle processes that are running as different users in 
than the container user in the cgroup of the container. This should be rare.
 * The code adds a watchdog to measure the time to resolve an OOM situation. 
The time to resolve an OOM situation takes 10-160 milliseconds based on my 
experiments.
 * The patch contains documentation to set up and troubleshoot the feature.
 * I was able to test it manually but I did not do large scale and longhaul 
tests, yet.

> Set OOM control for memory cgroups
> --
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Karthik Kambatla
>Assignee: Miklos Szegedi
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6675) Add NM support to launch opportunistic containers based on overallocation

2018-04-20 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446378#comment-16446378
 ] 

Miklos Szegedi commented on YARN-6675:
--

+1 pending Jenkins.

> Add NM support to launch opportunistic containers based on overallocation
> -
>
> Key: YARN-6675
> URL: https://issues.apache.org/jira/browse/YARN-6675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha3
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-6675-YARN-1011.00.patch, 
> YARN-6675-YARN-1011.01.patch, YARN-6675-YARN-1011.prelim0.patch, 
> YARN-6675-YARN-1011.prelim1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6675) Add NM support to launch opportunistic containers based on overallocation

2018-04-20 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446344#comment-16446344
 ] 

Miklos Szegedi commented on YARN-6675:
--

Thank you for the patch [~haibochen].

Can you address this checkstyle and the unit test issue?

./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java:1077:
 protected String getContainerPid(Path pidFilePath) throws Exception {:41: 
'pidFilePath' hides a field. [HiddenField]

Also, your comments and the unit tests describe the scheduling logic nicely but 
I would write a user/administrator facing documentation about the logic in 
another jira. This one is already too big.

> Add NM support to launch opportunistic containers based on overallocation
> -
>
> Key: YARN-6675
> URL: https://issues.apache.org/jira/browse/YARN-6675
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha3
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-6675-YARN-1011.00.patch, 
> YARN-6675-YARN-1011.prelim0.patch, YARN-6675-YARN-1011.prelim1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-04-20 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445915#comment-16445915
 ] 

Miklos Szegedi commented on YARN-8090:
--

Thank you for the comments [~grepas]. I updated the patch.

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-8090.000.patch, YARN-8090.001.patch
>
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail:

[jira] [Updated] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-04-20 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8090:
-
Attachment: YARN-8090.001.patch

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-8090.000.patch, YARN-8090.001.patch
>
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Commented] (YARN-8037) CGroupsResourceCalculator logs excessive warnings on container relaunch

2018-04-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431526#comment-16431526
 ] 

Miklos Szegedi commented on YARN-8037:
--

Thank you, [~shaneku...@gmail.com]. How about hashing the stack trace of the 
exception and reporting it only, if it has not been seen before?

> CGroupsResourceCalculator logs excessive warnings on container relaunch
> ---
>
> Key: YARN-8037
> URL: https://issues.apache.org/jira/browse/YARN-8037
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Major
>
> When a container is relaunched, the old process no longer exists. When using 
> the {{CGroupsResourceCalculator}} this results in the warning and exception 
> below being logged every second until the relaunch occurs, which is excessive 
> and filling up the logs.
> {code:java}
> 2018-03-16 14:30:33,438 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator:
>  Failed to parse 12844
> org.apache.hadoop.yarn.exceptions.YarnException: The process vanished in the 
> interim 12844
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.readTotalProcessJiffies(CGroupsResourceCalculator.java:252)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.updateProcessTree(CGroupsResourceCalculator.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CombinedResourceCalculator.updateProcessTree(CombinedResourceCalculator.java:52)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
> Caused by: java.io.FileNotFoundException: 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_e01_1521209613260_0002_01_02/cpuacct.stat
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:320)
> ... 4 more
> 2018-03-16 14:30:33,438 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator:
>  Failed to parse cgroups 
> /sys/fs/cgroup/memory/hadoop-yarn/container_e01_1521209613260_0002_01_02/memory.memsw.usage_in_bytes
> org.apache.hadoop.yarn.exceptions.YarnException: The process vanished in the 
> interim 12844
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.getMemorySize(CGroupsResourceCalculator.java:238)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.updateProcessTree(CGroupsResourceCalculator.java:187)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CombinedResourceCalculator.updateProcessTree(CombinedResourceCalculator.java:52)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
> Caused by: java.io.FileNotFoundException: 
> /sys/fs/cgroup/memory/hadoop-yarn/container_e01_1521209613260_0002_01_02/memory.usage_in_bytes
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:320)
> ... 4 more{code}
> We should consider moving the exception to debug to reduce the noise at a 
> minimum. Alternatively, it may make sense to stop the existing 
> {{MonitoringThread}} during relaunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-04-04 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8090:
-
Attachment: YARN-8090.000.patch

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-8090.000.patch
>
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Commented] (YARN-8035) Uncaught exception in ContainersMonitorImpl during relaunch due to the process ID changing

2018-04-03 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424406#comment-16424406
 ] 

Miklos Szegedi commented on YARN-8035:
--

+1 LGTM. Thank you for the patch [~shaneku...@gmail.com]. I will commit this 
shortly.

> Uncaught exception in ContainersMonitorImpl during relaunch due to the 
> process ID changing
> --
>
> Key: YARN-8035
> URL: https://issues.apache.org/jira/browse/YARN-8035
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Shane Kumpf
>Priority: Major
> Attachments: YARN-8035.001.patch, YARN-8035.002.patch
>
>
> In the case of a container relaunch event, the container ID is reused but a 
> new process is spawned. For resource monitoring, {{ContainersMonitorImpl}} 
> will obtain the new PID post relaunch and initialize the process tree 
> monitoring. As part of this initialization, a tag called {{ContainerPid}}, 
> whose value is the PID for the container, is populated for the metrics 
> associated with the container. If the prior container failed after its 
> process started, the original PID will already be populated for the 
> container, resulting in the {{MetricsException}} below.
> {code:java}
> 2018-03-16 11:59:02,563 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Uncaught exception in ContainersMonitorImpl while monitoring resource of 
> container_1521201379995_0001_01_02
> org.apache.hadoop.metrics2.MetricsException: Tag ContainerPid already exists!
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.checkTagName(MetricsRegistry.java:433)
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.tag(MetricsRegistry.java:394)
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.tag(MetricsRegistry.java:400)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.recordProcessId(ContainerMetrics.java:277)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.initializeProcessTrees(ContainersMonitorImpl.java:559)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:448){code}
> {{MetricsRegistry}} provides a {{tag}} method that allows for updating the 
> value of an existing tag. Updating the value ensures that the PID associated 
> with container is the currently running process, which appears to be an 
> appropriate fix. However, it's unclear how this tag might be being used by 
> other systems. I'm not finding any usage in Hadoop itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6739) Crash NM at start time if oversubscription is on but LinuxContainerExcutor or cgroup is off

2018-04-02 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423197#comment-16423197
 ] 

Miklos Szegedi commented on YARN-6739:
--

+1 Thank you for the patch [~haibochen].

> Crash NM at start time if oversubscription is on but LinuxContainerExcutor or 
> cgroup is off
> ---
>
> Key: YARN-6739
> URL: https://issues.apache.org/jira/browse/YARN-6739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha3
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-6739-YARN-1011.00.patch, 
> YARN-6739-YARN-1011.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8077) The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is confusing

2018-03-29 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419921#comment-16419921
 ] 

Miklos Szegedi commented on YARN-8077:
--

The patch is integrated in trunk.

> The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is 
> confusing
> 
>
> Key: YARN-8077
> URL: https://issues.apache.org/jira/browse/YARN-8077
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Trivial
> Fix For: 3.2.0
>
> Attachments: YARN-8077.001.patch
>
>
> The parameter should be memLimit.   It contains the meaning of vmemLimit and 
> pmemLimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-03-29 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419898#comment-16419898
 ] 

Miklos Szegedi commented on YARN-8090:
--

I have seen a similar race condition in the shuffle handler here:

[https://stackoverflow.com/questions/27253616/hadoop-warn-ebadf-bad-file-descriptor]

 

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To

[jira] [Created] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-03-29 Thread Miklos Szegedi (JIRA)

Miklos Szegedi created YARN-8090:


 Summary: Race conditions in FadvisedChunkedFile
 Key: YARN-8090
 URL: https://issues.apache.org/jira/browse/YARN-8090
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.0
Reporter: Miklos Szegedi
Assignee: Miklos Szegedi


{code:java}
11:04:33.605 AM WARNFadvisedChunkedFile 
Failed to manage OS cache for 
/var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
Method)
at 
org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
at 
org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
at 
org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
at 
org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
at 
org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
at 
org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
at 
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at 
org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
at 
org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at 
org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
at 
org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
at 
org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at 
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8077) The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is confusing

2018-03-28 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417735#comment-16417735
 ] 

Miklos Szegedi commented on YARN-8077:
--

The Jenkins failure seems to be unrelated (protoc). Let me look into this.

> The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is 
> confusing
> 
>
> Key: YARN-8077
> URL: https://issues.apache.org/jira/browse/YARN-8077
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Trivial
> Fix For: 3.2.0
>
> Attachments: YARN-8077.001.patch
>
>
> The parameter should be memLimit.   It contains the meaning of vmemLimit and 
> pmemLimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (YARN-8077) The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is confusing

2018-03-28 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8077:
-
Comment: was deleted

(was: Committed to trunk.)

> The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is 
> confusing
> 
>
> Key: YARN-8077
> URL: https://issues.apache.org/jira/browse/YARN-8077
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Trivial
> Fix For: 3.2.0
>
> Attachments: YARN-8077.001.patch
>
>
> The parameter should be memLimit.   It contains the meaning of vmemLimit and 
> pmemLimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8077) The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is confusing

2018-03-27 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416040#comment-16416040
 ] 

Miklos Szegedi commented on YARN-8077:
--

+1 LGTM. Thank you for raising this [~Sen Zhao] and for the patch. I will 
commit this shortly.

> The vmemLimit parameter in ContainersMonitorImpl#isProcessTreeOverLimit is 
> confusing
> 
>
> Key: YARN-8077
> URL: https://issues.apache.org/jira/browse/YARN-8077
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Trivial
> Attachments: YARN-8077.001.patch
>
>
> The parameter should be memLimit.   It contains the meaning of vmemLimit and 
> pmemLimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8039) Clean up log dir configuration in TestLinuxContainerExecutorWithMocks.testStartLocalizer

2018-03-16 Thread Miklos Szegedi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szegedi updated YARN-8039:
-
Attachment: YARN-8039.000.patch

> Clean up log dir configuration in 
> TestLinuxContainerExecutorWithMocks.testStartLocalizer
> 
>
> Key: YARN-8039
> URL: https://issues.apache.org/jira/browse/YARN-8039
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-8039.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted

2018-03-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402162#comment-16402162
 ] 

Miklos Szegedi commented on YARN-8031:
--

[~jayceAu], thank you for raising this. If you have CGroups already mounted, 
you should set the mount option to false as described here:

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]
{code:java}
Discover CGroups mounted alreadyThis should be used on newer systems 
like RHEL7 or Ubuntu16 or if the administrator mounts CGroups before YARN 
starts. Set yarn.nodemanager.linux-container-executor.cgroups.mount to false 
and leave other settings set to their defaults. YARN will locate the mount 
points in /proc/mounts. Common locations include /sys/fs/cgroup and /cgroup. 
The default location can vary depending on the Linux distribution in use.{code}

> NodeManager will fail to start if cpu subsystem is already mounted
> --
>
> Key: YARN-8031
> URL: https://issues.apache.org/jira/browse/YARN-8031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: JayceAu
>Priority: Major
> Attachments: image-2018-03-15-14-47-30-583.png
>
>
> if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true 
> and cpu subsystem is not yet mounted, NodeManager will mount the cpu 
> subsystem and then create the control group whose default name is 
> *hadoop-yarn* if the mount step is successful. This procedure works well if 
> cpu subsystem is not yet mounted. However, under some situation cpu subsystem 
> is already mounted before NodeManager starts and NodeManager will fail to 
> start because of no write permission to the *hadoop-yarn* path . For example:
>  # in OS that use systemd such as centos7 will have cpu subsystem mounted by 
> default on machine startup
>  # some deamon whose start order is more precedent than NodeManager may also 
> rely on the mounted state of cpu subsystem. In our production environment, we 
> limit the cpu usage of the monitoring and control agent, which starts on 
> reboot
> In order to solve this problem, container-executor must be able to create the 
> control group *hadoop-yarn* if mounting controller is successful or this 
> controller is already mounted. Besides, if cpu subsystem is used in 
> combination with other subsystem and it's already mounted, container-executor 
> should use the latest mount point of cpu subsystem instread of the one 
> provided by NodeManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8039) Clean up log dir configuration in TestLinuxContainerExecutorWithMocks.testStartLocalizer

2018-03-16 Thread Miklos Szegedi (JIRA)

Miklos Szegedi created YARN-8039:


 Summary: Clean up log dir configuration in 
TestLinuxContainerExecutorWithMocks.testStartLocalizer
 Key: YARN-8039
 URL: https://issues.apache.org/jira/browse/YARN-8039
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Miklos Szegedi
Assignee: Miklos Szegedi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8037) CGroupsResourceCalculator excessive warnings on container relaunch

2018-03-16 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402121#comment-16402121
 ] 

Miklos Szegedi commented on YARN-8037:
--

Thank you, [~shaneku...@gmail.com] for raising this. [~haibochen], we had a 
logic in one of earlier patches in YARN-7064 to remove multiple reporting of 
issues from CGroupsResourceCalculator and we removed based on your advice to 
support debugging. What is your opinion about this suggestion? Do you think we 
should add back some filtering here? 
https://issues.apache.org/jira/browse/YARN-7064?focusedCommentId=16323135=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16323135

 

> CGroupsResourceCalculator excessive warnings on container relaunch
> --
>
> Key: YARN-8037
> URL: https://issues.apache.org/jira/browse/YARN-8037
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Major
>
> When a container is relaunched, the old process no longer exists. When using 
> the {{CGroupsResourceCalculator}} this results in the warning and exception 
> below being logged every second until the relaunch occurs, which is excessive 
> and filling up the logs.
> {code:java}
> 2018-03-16 14:30:33,438 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator:
>  Failed to parse 12844
> org.apache.hadoop.yarn.exceptions.YarnException: The process vanished in the 
> interim 12844
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.readTotalProcessJiffies(CGroupsResourceCalculator.java:252)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.updateProcessTree(CGroupsResourceCalculator.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CombinedResourceCalculator.updateProcessTree(CombinedResourceCalculator.java:52)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
> Caused by: java.io.FileNotFoundException: 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_e01_1521209613260_0002_01_02/cpuacct.stat
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:320)
> ... 4 more
> 2018-03-16 14:30:33,438 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator:
>  Failed to parse cgroups 
> /sys/fs/cgroup/memory/hadoop-yarn/container_e01_1521209613260_0002_01_02/memory.memsw.usage_in_bytes
> org.apache.hadoop.yarn.exceptions.YarnException: The process vanished in the 
> interim 12844
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.getMemorySize(CGroupsResourceCalculator.java:238)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.updateProcessTree(CGroupsResourceCalculator.java:187)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CombinedResourceCalculator.updateProcessTree(CombinedResourceCalculator.java:52)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
> Caused by: java.io.FileNotFoundException: 
> /sys/fs/cgroup/memory/hadoop-yarn/container_e01_1521209613260_0002_01_02/memory.usage_in_bytes
>  (No such file or directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsResourceCalculator.processFile(CGroupsResourceCalculator.java:320)
> ... 4 more{code}
> We should consider moving the exception to debug to reduce the noise at a 
> minimum. Alternatively, it may make sense to stop the existing 
> {{MonitoringThread}} during relaunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail:

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-13 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397531#comment-16397531
 ] 

Miklos Szegedi commented on YARN-5764:
--

Committed to trunk. Thank you for the contribution [~devaraj.k], [~olasoji] for 
the report and for the reviews [~leftnoteasy], [~rajesh.balamohan], 
[~raviprak], [~sunilg] and [~rohithsharma].

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v10.patch, 
> YARN-5764-v11.patch, YARN-5764-v2.patch, YARN-5764-v3.patch, 
> YARN-5764-v4.patch, YARN-5764-v5.patch, YARN-5764-v6.patch, 
> YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393396#comment-16393396
 ] 

Miklos Szegedi commented on YARN-5764:
--

+1 LGTM pending Jenkins. I will commit this shortly afterwards.

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, 
> YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch, YARN-5764-v9.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-09 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393265#comment-16393265
 ] 

Miklos Szegedi commented on YARN-5764:
--

[~devaraj.k], could you address the two checkstyle issues?

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, 
> YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392213#comment-16392213
 ] 

Miklos Szegedi commented on YARN-5764:
--

Thank you, [~devaraj.k]. The patch looks good to me in general. I still see two 
checkstyle issues. I started a new jenkins run.

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, 
> YARN-5764-v6.patch, YARN-5764-v7.patch, YARN-5764-v8.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8009) YARN limit number of simultaneously running containers in the application level

2018-03-07 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390116#comment-16390116
 ] 

Miklos Szegedi commented on YARN-8009:
--

Thank you for raising this [~sachinjose2...@gmail.com]. Normally the 
Application Master has the ability to specify the amount of containers. It can 
then provide an option to the user. See the the distributed shell example for 
details:

https://github.com/apache/hadoop/blob/037d7834833df2d1e60f5015b60d42550b1ddce6/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java#L459

> YARN limit number of simultaneously running containers in the application 
> level
> ---
>
> Key: YARN-8009
> URL: https://issues.apache.org/jira/browse/YARN-8009
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Sachin Jose
>Priority: Minor
>  Labels: features
>
> It would be really useful if the user can specify maximum containers can be 
> running simultaneously in the application level. Most of the long running 
> YARN application can be benefited out of it. At this moment, the only 
> available option to restrict resource over usage of long running is in the 
> YARN resource manager queue level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-03-07 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390095#comment-16390095
 ] 

Miklos Szegedi commented on YARN-7626:
--

Thank you for the contribution [~Zian Chen] and for the commit [~leftnoteasy].

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch, 
> YARN-7626.009.patch, YARN-7626.010.patch, YARN-7626.011.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-03-05 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387021#comment-16387021
 ] 

Miklos Szegedi commented on YARN-5764:
--

Thank you, [~devaraj.k] for the updated patch.
{code:java}
3599public static final String NM_NUMA_AWARENESS_NODE_MEMORY = NM_PREFIX
3600+ "numa-awareness..memory";
3601public static final String NM_NUMA_AWARENESS_NODE_CPUS = NM_PREFIX
3602+ "numa-awareness..cpus";{code}
These two lines are no-op, they can probably be omitted.
{code:java}
yarn.nodemanager.numa-awareness.1.memory
{code}
Optional: Is there an example of a NUMA architecture of assymetric 
architecture. It might make sense in the future to define nodes once and 
specify a multiplier, so that we can make the configuration easier.
{code:java}
145 String[] args = new String[] {"numactl", "--hardware"};{code}
This should be {{/usr/bin/numactl}} for security reasons. In fact should not it 
use the configured numactl path?
I think {{recoverCpus}} and {{recoverMemory}} can be eliminated. You could just 
create a Resource object and use assignResources.
{code}
213 NumaResourceAllocation numaNode = allocate(containerId, resource);
{code}
This is a little bit misleading. Allocate may return multiple allocations on 
multiple nodes not just a single numaNode.
I have a question. {{recoverNumaResource}} reallocates the resources based on 
the registered values. Where are those resources released? It looks like 
testRecoverNumaResource() does not test a container allocation, release and 
then relaunch cycle but the opposite direction. What is the reason for that?

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch, 
> YARN-5764-v6.patch, YARN-5764-v7.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-03-05 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386738#comment-16386738
 ] 

Miklos Szegedi commented on YARN-7626:
--

Thank you for the patch [~Zian Chen] and for the review [~leftnoteasy].

Optional: I have one style issue with the latest patch. When you refer to 6 in 
your patch like the one below, you should probably do sizeof("regex:"). This 
helps to better understand the code and it is more future proof.
{code:java}
132 return is_volume_name(requested) && (execute_regex_match(pattern + 6, 
requested) == 0);{code}

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch, 
> YARN-7626.009.patch, YARN-7626.010.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-02-20 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370401#comment-16370401
 ] 

Miklos Szegedi commented on YARN-7626:
--

Indeed there is not there any substantial difference other than saying 
{{regex:/dev/nvidia.*}}. I think this latter is a bit more robust in case we 
try to configure regexes for other purposes in the future. This is just an 
opinion I let you decide.
{quote}what if hackers input user mount like regex+ as a prefix?
{quote}
Regex+ won't be considered valid. What if they put ^.*$? I do not think there 
is a difference at least not from this point of view. But you just raised an 
important point. The regex has to be properly validated.

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-02-13 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363078#comment-16363078
 ] 

Miklos Szegedi commented on YARN-7626:
--

Thank you for the patch [~Zian Chen].
{code:java}
return !(len > 2 && str[0] == '^' && str[len-1] == '$');{code}
Optional: I think this is still misleading. is_regex should return one on 
success and 0 on failure.
{code:java}
// Iterate each permitted values.{code}
Optional: I think it would be better to write 'Iterate through each permitted 
value'
{code:java}
'/dev/nvidia1:/dev/nvidia1'
  if (prefix == 0) {
ret = strcmp(values[i], permitted_values[j]);
  } else {
// If permitted-Values[j] is a REGEX, use REGEX to compare
if (is_regex(permitted_values[j]) == 0) {
  ret = validate_volume_name_with_argument(values[i], 
permitted_values[j]);
} else {
  ret = strncmp(values[i], permitted_values[j], tmp_ptr - 
values[i]);
}
  }
{code}
Technically the code, where prefix is not null including the regex match, 
should check only the characters before the prefix :. It is checking now the 
whole value[i], you should apply the regex only to [values[i] ... tmp_ptr].
{code:java}
/**
 * Helper function to help normalize mounts for checking if mounts are
 * permitted. The function does the following -
 * 1. Find the canonical path for mount using realpath
 * 2. If the path is a directory, add a '/' at the end (if not present)
 * 3. Return a copy of the canonicalised path(to be freed by the caller)
 * @param mount path to be canonicalised
 * @return pointer to canonicalised path, NULL on error
 */
static char* normalize_mount(const char* mount, int isUserMount) {
{code}
There is no @param documentation for isUserMount, in fact I would name it 
isRegexAllowed to avoid confusion.
{code:java}
const char *container_executor_cfg_path = normalize_mount(get_config_path(""), 
1);{code}
I do not understand why the config path could be a regex.
{code:java}
tmp_path_buffer[0] = normalize_mount(mount_src, 1);{code}
Should not this be 0, too?

I have a few conceptual issues with the latest patch.
 # First of all, normalize_mounts, walks through the permitted mounts and it 
resolves symlinks but it does not resolve symlinks, if isUserMount (isRegex) is 
1. What if the regex resolves to a symlink? I think it would probably be more 
future proof, if normalize_mounts applied the regex to the directory tree and 
then called the original normalize_mount on the resulting file names, that 
returns the real path for each. This would eliminate the need for passing 
isUserMount all the way through the call structure. It would also help to avoid 
issues that appear with invalid regexes, etc.
 # Technically a regex without the ^$ pair is a valid regex. It would be more 
precise and future proof to mark regexes with a prefix like 
{{regex:/dev/device[0-9]+}}. In this case we would not need to use just a 
subset for matching.

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch, YARN-7626.007.patch, YARN-7626.008.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5764) NUMA awareness support for launching containers

2018-02-12 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361284#comment-16361284
 ] 

Miklos Szegedi commented on YARN-5764:
--

[~devaraj.k], thank you for the reply. Did you update the patch with the fixes? 
I do not see any new patches after last August.

> NUMA awareness support for launching containers
> ---
>
> Key: YARN-5764
> URL: https://issues.apache.org/jira/browse/YARN-5764
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, yarn
>Reporter: Olasoji
>Assignee: Devaraj K
>Priority: Major
> Attachments: NUMA Awareness for YARN Containers.pdf, NUMA Performance 
> Results.pdf, YARN-5764-v0.patch, YARN-5764-v1.patch, YARN-5764-v2.patch, 
> YARN-5764-v3.patch, YARN-5764-v4.patch, YARN-5764-v5.patch
>
>
> The purpose of this feature is to improve Hadoop performance by minimizing 
> costly remote memory accesses on non SMP systems. Yarn containers, on launch, 
> will be pinned to a specific NUMA node and all subsequent memory allocations 
> will be served by the same node, reducing remote memory accesses. The current 
> default behavior is to spread memory across all NUMA nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7626) Allow regular expression matching in container-executor.cfg for devices and named docker volumes mount

2018-02-08 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357340#comment-16357340
 ] 

Miklos Szegedi commented on YARN-7626:
--

Thank you, [~Zian Chen] for the reply and update. Your latest patch does not 
seem to apply to trunk. Could you verify and rebase?

> Allow regular expression matching in container-executor.cfg for devices and 
> named docker volumes mount
> --
>
> Key: YARN-7626
> URL: https://issues.apache.org/jira/browse/YARN-7626
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-7626.001.patch, YARN-7626.002.patch, 
> YARN-7626.003.patch, YARN-7626.004.patch, YARN-7626.005.patch, 
> YARN-7626.006.patch
>
>
> Currently when we config some of the GPU devices related fields (like ) in 
> container-executor.cfg, these fields are generated based on different driver 
> versions or GPU device names. We want to enable regular expression matching 
> so that user don't need to manually set up these fields when config 
> container-executor.cfg,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7859) New feature: add queue scheduling deadLine in fairScheduler.

2018-02-06 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354632#comment-16354632
 ] 

Miklos Szegedi commented on YARN-7859:
--

Never mind, I found the location for the compare. Please ignore my comment.

> New feature: add queue scheduling deadLine in fairScheduler.
> 
>
> Key: YARN-7859
> URL: https://issues.apache.org/jira/browse/YARN-7859
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: wangwj
>Assignee: wangwj
>Priority: Major
>  Labels: fairscheduler, features, patch
> Fix For: 3.0.0
>
> Attachments: YARN-7859-v1.patch, YARN-7859-v2.patch, log, 
> screenshot-1.png, screenshot-3.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
>  As everyone knows.In FairScheduler the phenomenon of queue scheduling 
> starvation often occurs when the number of cluster jobs is large.The App in 
> one or more queue are pending.So I have thought a way to solve this 
> problem.Add queue scheduling deadLine in fairScheduler.When a queue is not 
> scheduled for FairScheduler within a specified time.We mandatory scheduler it!
> On the basis of the above, I propose this issue...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7859) New feature: add queue scheduling deadLine in fairScheduler.

2018-02-06 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354266#comment-16354266
 ] 

Miklos Szegedi commented on YARN-7859:
--

[~wangwj], thank you for the patch. Sorry, I think there is a misunderstanding. 
You changed your code in the latest patch. I did not say that there is 
something wrong with your code, what I was saying is that the reason for the 
bug is that this code:
{code:java}
private int compareDemand(Schedulable s1, Schedulable s2) {
  int res = 0;
  Resource demand1 = s1.getDemand();
  Resource demand2 = s2.getDemand();
  if (demand1.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand2, Resources.none())) {
res = 1;
  } else if (demand2.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand1, Resources.none())) {
res = -1;
  }
  return res;
}{code}
should be written something like this:
{code:java}
private int compareDemand(Schedulable s1, Schedulable s2) {
  int res = 0;
  Resource demand1 = s1.getDemand();
  Resource demand2 = s2.getDemand();
  if (demand1.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand2, Resources.none())) {
res = 1;
  } else if (demand2.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand1, Resources.none())) {
res = -1;
  }
  return RESOURCE_CALCULATOR.compare(null, demand1, demand2);
}{code}
The demand is not compared right now, that is causing the starvation of your 
second queue because regardless of demand they are considered equal. It is an 
existing issue that I wanted to point out, not an issue with your patch.

> New feature: add queue scheduling deadLine in fairScheduler.
> 
>
> Key: YARN-7859
> URL: https://issues.apache.org/jira/browse/YARN-7859
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: wangwj
>Assignee: wangwj
>Priority: Major
>  Labels: fairscheduler, features, patch
> Fix For: 3.0.0
>
> Attachments: YARN-7859-v1.patch, YARN-7859-v2.patch, log, 
> screenshot-1.png, screenshot-3.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
>  As everyone knows.In FairScheduler the phenomenon of queue scheduling 
> starvation often occurs when the number of cluster jobs is large.The App in 
> one or more queue are pending.So I have thought a way to solve this 
> problem.Add queue scheduling deadLine in fairScheduler.When a queue is not 
> scheduled for FairScheduler within a specified time.We mandatory scheduler it!
> On the basis of the above, I propose this issue...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7859) New feature: add queue scheduling deadLine in fairScheduler.

2018-02-05 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353084#comment-16353084
 ] 

Miklos Szegedi commented on YARN-7859:
--

[~wangwj], thank you for the debugging, patch and the suggestion. While your 
patch probably works in the above mentioned scenario, it just works around the 
problem. I think the issue is that this code below states that it compares the 
demands but it instead says, if {{s1>0 ? "s1 is bigger": (s2>0 ? "s2 is bigger" 
: "=")}}.
{code:java}
private int compareDemand(Schedulable s1, Schedulable s2) {
  int res = 0;
  Resource demand1 = s1.getDemand();
  Resource demand2 = s2.getDemand();
  if (demand1.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand2, Resources.none())) {
res = 1;
  } else if (demand2.equals(Resources.none()) && Resources.greaterThan(
  RESOURCE_CALCULATOR, null, demand1, Resources.none())) {
res = -1;
  }
  return res;
}
{code}
This is not what it should do. This is not comparison and I think this is 
causing the starvation of your second queue picking the first one if it has any 
demand. Could you try removing your workaround and fixing the {{compare}} 
function above? I think that will solve the issue with much less lines of code.

> New feature: add queue scheduling deadLine in fairScheduler.
> 
>
> Key: YARN-7859
> URL: https://issues.apache.org/jira/browse/YARN-7859
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: wangwj
>Assignee: wangwj
>Priority: Major
>  Labels: fairscheduler, features, patch
> Fix For: 3.0.0
>
> Attachments: YARN-7859-v1.patch, log, screenshot-1.png, 
> screenshot-3.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
>  As everyone knows.In FairScheduler the phenomenon of queue scheduling 
> starvation often occurs when the number of cluster jobs is large.The App in 
> one or more queue are pending.So I have thought a way to solve this 
> problem.Add queue scheduling deadLine in fairScheduler.When a queue is not 
> scheduled for FairScheduler within a specified time.We mandatory scheduler it!
> On the basis of the above, I propose this issue...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7876) Localized jars that are expanded after localization are not fully copied

2018-02-05 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353017#comment-16353017
 ] 

Miklos Szegedi commented on YARN-7876:
--

Thank you for committing this [~jlowe]!

> Localized jars that are expanded after localization are not fully copied
> 
>
> Key: YARN-7876
> URL: https://issues.apache.org/jira/browse/YARN-7876
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Blocker
> Fix For: 3.1.0
>
> Attachments: YARN-7876.000.patch, YARN-7876.001.patch
>
>
> YARN-2185 added the ability to localize jar files as a stream instead of 
> copying to local disk and then extracting. ZipInputStream does not need the 
> end of the file. Let's read it out. This helps with an additional 
> TeeInputStream on the input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7857) -fstack-check compilation flag causes binary incompatibility for container-executor between RHEL 6 and RHEL 7

2018-02-05 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352824#comment-16352824
 ] 

Miklos Szegedi commented on YARN-7857:
--

[~Jim_Brennan], I agree let's postpone removing the guard. The RH7 code checks 
much less. It seems like it checks only the pages that it needs, so probably 
that is why it is not crashing like the RH6 code. However, I am interested why 
the kernel traps the expansion of the stack despite of the fact that we are 
within the actual limit of the stack.

> -fstack-check compilation flag causes binary incompatibility for 
> container-executor between RHEL 6 and RHEL 7
> -
>
> Key: YARN-7857
> URL: https://issues.apache.org/jira/browse/YARN-7857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-7857.001.patch
>
>
> The segmentation fault in container-executor reported in [YARN-7796]  appears 
> to be due to a binary compatibility issue with the {{-fstack-check}} flag 
> that was added in [YARN-6721]
> Based on my testing, a container-executor (without the patch from 
> [YARN-7796]) compiled on RHEL 6 with the -fstack-check flag always hits this 
> segmentation fault when run on RHEL 7.  But if you compile without this flag, 
> the container-executor runs on RHEL 7 with no problems.  I also verified this 
> with a simple program that just does the copy_file.
> I think we need to either remove this flag, or find a suitable alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6739) Crash NM at start time if oversubscription is on but LinuxContainerExcutor or cgroup is off

2018-02-02 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351267#comment-16351267
 ] 

Miklos Szegedi commented on YARN-6739:
--

Thank you, [~haibochen] for the patch.
{code:java}
void serviceInit(Configuration myConf) throws Exception{code}
This already throws an exception, so there is no need to throw a runtime 
exception. YarnException should be sufficient.

> Crash NM at start time if oversubscription is on but LinuxContainerExcutor or 
> cgroup is off
> ---
>
> Key: YARN-6739
> URL: https://issues.apache.org/jira/browse/YARN-6739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha3
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-6739-YARN-1011.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1037 matches

Mail list logo