[jira] [Created] (YARN-11302) hadoop-yarn-applications-mawo-core module publishes tar file during maven deploy

2022-09-13 Thread Steven Rand (Jira)
Steven Rand created YARN-11302:
--

 Summary: hadoop-yarn-applications-mawo-core module publishes tar 
file during maven deploy
 Key: YARN-11302
 URL: https://issues.apache.org/jira/browse/YARN-11302
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications, yarn
Affects Versions: 3.3.4
Reporter: Steven Rand


The {{hadoop-yarn-applications-mawo-core}} module will currently publish a file 
with extension {{bin.tar.gz}} during the maven deploy step: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-mawo/hadoop-yarn-applications-mawo-core/src/assembly/bin.xml#L16.

I don't know whether the community considers this to be a bug or not, but 
creating a ticket because the deploy step typically creates JAR and POM files, 
and some maven repositories that are intended to only host JARs will have 
allowlists of file extensions that block tarballs from being published, and 
therefore cause the maven deploy operation to fail with this error:

{code}
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer 
file: 
https:///artifactory//org/apache/hadoop/applications/mawo/hadoop-yarn-applications-mawo-core//hadoop-yarn-applications-mawo-core--bin.tar.gz.
 Return code is: 409, ReasonPhrase: .
{code}

Feel free to close if the community doesn't consider this to be a problem, but 
notably it is a regression from versions predating mawo when only JAR and POM 
files were published in the deploy step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11184) fenced active RM not failing over correctly in HA setup

2022-06-14 Thread Steven Rand (Jira)
Steven Rand created YARN-11184:
--

 Summary: fenced active RM not failing over correctly in HA setup
 Key: YARN-11184
 URL: https://issues.apache.org/jira/browse/YARN-11184
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.2.3
Reporter: Steven Rand
 Attachments: image-2022-06-14-16-38-00-336.png, 
image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, 
image-2022-06-14-16-44-45-101.png

We've observed an issue recently on a production cluster running 3.2.3 in which 
a fenced Resource Manager remains active, but does not communicate with the ZK 
state store, and therefore cannot function correctly. This did not occur while 
running 3.2.2 on the same cluster.

In more detail, what seems to happen is: 

1. The active RM gets a {{NodeExists}} error from ZK while storing an app in 
the state store. I suspect that this is caused by some transient connection 
issue that causes the first node creation request to succeed, but for the 
response to not reach the RM, triggering a duplicate request which fails with 
this error.

!image-2022-06-14-16-38-00-336.png!

2. Because of this error, the active RM is fenced.

!image-2022-06-14-16-39-50-278.png!

3. Because it is fenced, the active RM starts to transition to standby.

!image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully transitions 
to standby. It never logs {{Transitioning RM to Standby mode}} from the run 
method of {{{}StandByTransitionRunnable{}}}: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
 Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, but 
evidently not making progress:

 !image-2022-06-14-16-44-45-101.png! 

So the RM doesn't work because it is fenced, but remains active, which causes 
an outage until a failover is manually initiated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10244) backport YARN-9848 to branch-3.2

2020-04-26 Thread Steven Rand (Jira)
Steven Rand created YARN-10244:
--

 Summary: backport YARN-9848 to branch-3.2
 Key: YARN-10244
 URL: https://issues.apache.org/jira/browse/YARN-10244
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, resourcemanager
Reporter: Steven Rand
Assignee: Steven Rand


Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9850) document or revert change in which DefaultContainerExecutor no longer propagates NM env to containers

2019-09-21 Thread Steven Rand (Jira)
Steven Rand created YARN-9850:
-

 Summary: document or revert change in which 
DefaultContainerExecutor no longer propagates NM env to containers
 Key: YARN-9850
 URL: https://issues.apache.org/jira/browse/YARN-9850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Steven Rand


After 
[https://github.com/apache/hadoop/commit/9d4d30243b0fc9630da51a2c17b543ef671d035c],
 containers launched by the {{DefaultContainerExecutor}} no longer inherit the 
environment of the NodeManager.

I don't object to the commit (I actually prefer the new behavior), but I do 
think that it's a notable breaking change, as people may be relying on 
variables in the NM environment for their containers to behave correctly.

As far as I can tell, we don't currently include this behavior change in the 
release notes for Hadoop 3, and it's a particularly tricky one to track down, 
since there's no JIRA ticket for it.

I think that we should at least include this change in the release notes for 
the 3.0.0 release. Arguably it's worth having the DefaultContainerExecutor set 
{{inheritParentEnv}} to true when it creates its {{ShellCommandExecutor}} since 
that preserves the old behavior and is less surprising to users, but I don't 
feel strongly either way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9848) revert YARN-4946

2019-09-19 Thread Steven Rand (Jira)
Steven Rand created YARN-9848:
-

 Summary: revert YARN-4946
 Key: YARN-9848
 URL: https://issues.apache.org/jira/browse/YARN-9848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, resourcemanager
Reporter: Steven Rand


In YARN-4946, we've been discussing a revert due to the potential for keeping 
more applications in the state store than desired, and the potential to greatly 
increase RM recovery times.

 

I'm in favor of reverting the patch, but other ideas along the lines of 
YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8903) when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node

2018-10-17 Thread Steven Rand (JIRA)
Steven Rand created YARN-8903:
-

 Summary: when NM becomes unhealthy due to local disk usage, have 
option to kill application using most space instead of releasing all containers 
on node
 Key: YARN-8903
 URL: https://issues.apache.org/jira/browse/YARN-8903
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.1.1
Reporter: Steven Rand


We sometimes experience an issue in which a single application, usually a Spark 
job, causes at least one node in a YARN cluster to become unhealthy by filling 
up the local dir(s) on that node past the threshold at which the node is 
considered unhealthy.

When this happens, the impact is potentially large depending on what else is 
running on that node, as all containers on that node are lost. Sometimes not 
much else is running on the node and it's fine, but other times we lose AM 
containers from other apps and/or non-AM containers with long-running tasks.

I thought that it would be helpful to add an option (default false) whereby if 
a node is going to become unhealthy due to full local disk(s), it instead 
identifies the application that's using the most local disk space on that node, 
and kills that application. (Roughly analogous to how the OOM killer in Linux 
picks one process to kill rather than letting the machine crash.)

The benefit is that only one application is impacted, and no other application 
loses any containers. This prevents one user's poorly written code that 
shuffles/spills huge amounts of data from negatively impacting other users.

The downside is that we're killing the entire application, not just the task(s) 
responsible for the local disk usage. I believe it's necessary to kill the 
whole application instead of identifying the container running the relevant 
task(s), because doing so would require more knowledge of the internal state of 
aux services responsible for shuffling than what YARN has according to my 
understanding.

If this seems reasonable, I can work on the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7911) Method identifyContainersToPreempt uses ResourceRequest#getRelaxLocality incorrectly

2018-02-08 Thread Steven Rand (JIRA)
Steven Rand created YARN-7911:
-

 Summary: Method identifyContainersToPreempt uses 
ResourceRequest#getRelaxLocality incorrectly
 Key: YARN-7911
 URL: https://issues.apache.org/jira/browse/YARN-7911
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 3.1.0
Reporter: Steven Rand
Assignee: Steven Rand


After YARN-7655, in {{identifyContainersToPreempt}} we expand the search space 
to all nodes if we had previously only considered a subset to satisfy a 
{{NODE_LOCAL}} or {{RACK_LOCAL}} RR, and were going to preempt AM containers as 
a result, and the RR allowed locality to be relaxed:

{code}
// Don't preempt AM containers just to satisfy local requests if relax
// locality is enabled.
if (bestContainers != null
&& bestContainers.numAMContainers > 0
&& !ResourceRequest.isAnyLocation(rr.getResourceName())
&& rr.getRelaxLocality()) {
  bestContainers = identifyContainersToPreemptForOneContainer(
  scheduler.getNodeTracker().getAllNodes(), rr);
}
{code}

This turns out to be based on a misunderstanding of what 
{{rr.getRelaxLocality}} means. I had believed that it means that locality can 
be relaxed _from_ that level. However, it actually means that locality can be 
relaxed _to_ that level: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L450.

For example, suppose we have {{relaxLocality}} set to {{true}} at the node 
level, but {{false}} at the rack and {{ANY}} levels. This is saying that we 
cannot relax locality to the rack level. However, the current behavior after 
YARN-7655 is to interpret relaxLocality being true at the node level as saying 
that it's okay to satisfy the request elsewhere.

What we should do instead is check whether relaxLocality is enabled for the 
corresponding RR at the next level. So if we're considering a node-level RR, we 
should find the corresponding rack-level RR and check whether relaxLocality is 
enabled for it. And similarly, if we're considering a rack-level RR, we should 
check the corresponding any-level RR.

It may also be better to use {{FSAppAttempt#getAllowedLocalityLevel}} instead 
of explicitly checking {{relaxLocality}}, but I'm not sure which is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7910) Fix TODO in TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM

2018-02-08 Thread Steven Rand (JIRA)
Steven Rand created YARN-7910:
-

 Summary: Fix TODO in 
TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM
 Key: YARN-7910
 URL: https://issues.apache.org/jira/browse/YARN-7910
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, test
Affects Versions: 3.1.0
Reporter: Steven Rand
Assignee: Steven Rand


In YARN-7655, we left a {{TODO}} in the newly added test:

{code}
// TODO (YARN-7655) The starved app should be allocated 4 containers.
// It should be possible to modify the RRs such that this is true
// after YARN-7903.
verifyPreemption(0, 4);
{code}

This JIRA is to track resolving that after YARN-7903 is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2017-12-13 Thread Steven Rand (JIRA)
Steven Rand created YARN-7655:
-

 Summary: avoid AM preemption caused by RRs for specific nodes or 
racks
 Key: YARN-7655
 URL: https://issues.apache.org/jira/browse/YARN-7655
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.0.0
Reporter: Steven Rand
Assignee: Steven Rand


We frequently see AM preemptions when 
{{starvedApp.getStarvedResourceRequests()}} in 
{{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
that request containers on a specific node. Since this causes us to only 
consider one node to preempt containers on, the really good work that was done 
in YARN-5830 doesn't save us from AM preemption. Even though there might be 
multiple nodes on which we could preempt enough non-AM containers to satisfy 
the app's starvation, we often wind up preempting one or more AM containers on 
the single node that we're considering.

A proposed solution is that if we're going to preempt one or more AM containers 
for an RR that specifies a node or rack, then we should instead expand the 
search space to consider all nodes. That way we take advantage of YARN-5830, 
and only preempt AMs if there's no alternative. I've attached a patch with an 
initial implementation of this. We've been running it on a few clusters, and 
have seen AM preemptions drop from double-digit occurrences on many days to 
zero.

Of course, the tradeoff is some loss of locality, since the starved app is less 
likely to be allocated resources at the most specific locality level that it 
asked for. My opinion is that this tradeoff is worth it, but interested to hear 
what others think as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7391) Consider square root instead of natural log for size-based weight

2017-10-25 Thread Steven Rand (JIRA)
Steven Rand created YARN-7391:
-

 Summary: Consider square root instead of natural log for 
size-based weight
 Key: YARN-7391
 URL: https://issues.apache.org/jira/browse/YARN-7391
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.0.0-beta1
Reporter: Steven Rand


Currently for size-based weight, we compute the weight of an app using this 
code from 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377:

{code}
  if (sizeBasedWeight) {
// Set weight based on current memory demand
weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2);
  }
{code}

Because the natural log function grows slowly, the weights of two apps with 
hugely different memory demands can be quite similar. For example, {{weight}} 
evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 for 
an app with a demand of 1000 GB. The app with the much larger demand will still 
have a higher weight, but not by a large amount relative to the sum of those 
weights.

I think it's worth considering a switch to a square root function, which will 
grow more quickly. In the above example, the app with a demand of 20 GB now has 
a weight of 143, while the app with a demand of 1000 GB now has a weight of 
1012. These weights seem more reasonable relative to each other given the 
difference in demand between the two apps.

The above example is admittedly a bit extreme, but I believe that a square root 
function would also produce reasonable results in general.

The code I have in mind would look something like:

{code}
  if (sizeBasedWeight) {
// Set weight based on current memory demand
weight = Math.sqrt(app.getDemand().getMemorySize());
  }
{code}

Would people be comfortable with this change?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)
Steven Rand created YARN-7290:
-

 Summary: canContainerBePreempted can return true when it shouldn't
 Key: YARN-7290
 URL: https://issues.apache.org/jira/browse/YARN-7290
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0-beta1
Reporter: Steven Rand


In FSAppAttempt#canContainerBePreempted, we make sure that preempting the given 
container would not put the app below its fair share:

{code}
// Check if the app's allocation will be over its fairshare even
// after preempting this container
Resource usageAfterPreemption = Resources.clone(getResourceUsage());

// Subtract resources of containers already queued for preemption
synchronized (preemptionVariablesLock) {
  Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
}

// Subtract this container's allocation to compute usage after preemption
Resources.subtractFrom(
usageAfterPreemption, container.getAllocatedResource());
return !isUsageBelowShare(usageAfterPreemption, getFairShare());
{code}

However, this only considers one container in isolation, and fails to consider 
containers for the same app that we already added to {{preemptableContainers}} 
in FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have 
a case where we preempt multiple containers from the same app, none of which by 
itself puts the app below fair share, but which cumulatively do so.

I've attached a patch with a test to show this behavior. The flow is:

1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
all the resources (8g and 8vcores)
2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
containers, each of which is 3g and 3vcores in size. At this point both 
greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
3. For the first container requested by starvedApp, we (correctly) preempt 3 
containers from greedyApp, each of which is 1g and 1vcore.
4. For the second container requested by starvedApp, we again (this time 
incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below its 
fair share, but happens anyway because all six times that we call {{return 
!isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the value of 
{{usageAfterPreemption}} is 7g and 7vcores (confirmed using debugger).

So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
account for containers that we're already planning on preempting in 
FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-07 Thread Steven Rand (JIRA)
Steven Rand created YARN-6960:
-

 Summary: definition of active queue allows idle long-running apps 
to distort fair shares
 Key: YARN-6960
 URL: https://issues.apache.org/jira/browse/YARN-6960
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0-alpha4, 2.8.1
Reporter: Steven Rand
Assignee: Steven Rand


YARN-2026 introduced the notion of only considering active queues when 
computing the fair share of each queue. The definition of an active queue is a 
queue with at least one runnable app:

{code}
  public boolean isActive() {
return getNumRunnableApps() > 0;
  }
{code}

One case that this definition of activity doesn't account for is that of 
long-running applications that scale dynamically. Such an application might 
request many containers when jobs are running, but scale down to very few 
containers, or only the AM container, when no jobs are running.

Even when such an application has scaled down to a negligible amount of demand 
and utilization, the queue that it's in is still considered to be active, which 
defeats the purpose of YARN-2026. For example, consider this scenario:

1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
which have the same weight.
2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
currently have only one container each (the AM).
3. An application in queue {{root.c}} starts, and uses the whole cluster except 
for the small amount in use by {{root.a}} and {{root.b}}. An application in 
{{root.d}} starts, and has a high enough demand to be able to use half of the 
cluster. Because all four queues are active, the app in {{root.d}} can only 
preempt the app in {{root.c}} up to roughly 25% of the cluster's resources, 
while the app in {{root.c}} keeps about 75%.

Ideally in this example, the app in {{root.d}} would be able to preempt the app 
in {{root.c}} up to 50% of the cluster, which would be possible if the idle 
apps in {{root.a}} and {{root.b}} didn't cause those queues to be considered 
active.

One way to address this is to update the definition of an active queue to be a 
queue containing 1 or more non-AM containers. This way if all apps in a queue 
scale down to only the AM, other queues' fair shares aren't affected.

The benefit of this approach is that it's quite simple. The downside is that it 
doesn't account for apps that are idle and using almost no resources, but still 
have at least one non-AM container.

There are a couple of other options that seem plausible to me, but they're much 
more complicated, and it seems to me that this proposal makes good progress 
while adding minimal extra complexity.

Does this seem like a reasonable change? I'm certainly open to better ideas as 
well.

Thanks,
Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6956) preemption may only consider resource requests for one node

2017-08-05 Thread Steven Rand (JIRA)
Steven Rand created YARN-6956:
-

 Summary: preemption may only consider resource requests for one 
node
 Key: YARN-6956
 URL: https://issues.apache.org/jira/browse/YARN-6956
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.9.0, 3.0.0-beta1
 Environment: CDH 5.11.0
Reporter: Steven Rand


I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This cause preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6120) add retention of aggregated logs to Timeline Server

2017-03-08 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand resolved YARN-6120.
---
Resolution: Duplicate

I now have the ability to submit a patch for YARN-2985, so this duplicate JIRA 
is unnecessary. 

> add retention of aggregated logs to Timeline Server
> ---
>
> Key: YARN-6120
> URL: https://issues.apache.org/jira/browse/YARN-6120
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, timelineserver
>Affects Versions: 2.7.3
>Reporter: Steven Rand
> Attachments: YARN-6120.001.patch
>
>
> The MR History Server performs retention of aggregated logs for MapReduce 
> applications. However, there is no way of enforcing retention on aggregated 
> logs for other types of applications. This JIRA proposes to add log retention 
> to the Timeline Server.
> Also, this is arguably a duplicate of 
> https://issues.apache.org/jira/browse/YARN-2985, but I could not find a way 
> to attach a patch for that issue. If someone closes this as a duplicate, 
> could you please assign that issue to me?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6013) ApplicationMasterProtocolPBClientImpl.allocate fails with EOFException when RPC privacy is enabled

2016-12-19 Thread Steven Rand (JIRA)
Steven Rand created YARN-6013:
-

 Summary: ApplicationMasterProtocolPBClientImpl.allocate fails with 
EOFException when RPC privacy is enabled
 Key: YARN-6013
 URL: https://issues.apache.org/jira/browse/YARN-6013
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, yarn
Affects Versions: 2.8.0
Reporter: Steven Rand
Priority: Critical


When privacy is enabled for RPC (hadoop.rpc.protection = privacy), 
{{ApplicationMasterProtocolPBClientImpl.allocate}} sometimes (but not always) 
fails with an EOFException. I've reproduced this with Spark 2.0.2 built against 
latest branch-2.8 and with a simple distcp job on latest branch-2.8.

Steps to reproduce using distcp:

1. Set hadoop.rpc.protection equal to privacy
2. Write data to HDFS. I did this with Spark as follows: 

{code}
sc.parallelize(1 to (5*1024*1024)).map(k => Seq(k, 
org.apache.commons.lang.RandomStringUtils.random(1024, 
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWxyZ0123456789")).mkString("|")).toDF().repartition(100).write.parquet("hdfs:///tmp/testData")
{code}

3. Attempt to distcp that data to another location in HDFS. For example:

{code}
hadoop distcp -Dmapreduce.framework.name=yarn hdfs:///tmp/testData 
hdfs:///tmp/testDataCopy
{code}

I observed this error in the ApplicationMaster's syslog:

{code}
2016-12-19 19:13:50,097 INFO [eventHandlingThread] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer 
setup for JobId: job_1482189777425_0004, File: 
hdfs://:8020/tmp/hadoop-yarn/staging//.staging/job_1482189777425_0004/job_1482189777425_0004_1.jhist
2016-12-19 19:13:51,004 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2016-12-19 19:13:51,031 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for 
application_1482189777425_0004: ask=1 release= 0 newContainers=0 
finishedContainers=0 resourcelimit= knownNMs=3
2016-12-19 19:13:52,043 INFO [RMCommunicator Allocator] 
org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking 
ApplicationMasterProtocolPBClientImpl.allocate over null. Retrying after 
sleeping for 3ms.
java.io.EOFException: End of File Exception between local host is: 
"/"; destination host is: "":8030; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1486)
at org.apache.hadoop.ipc.Client.call(Client.java:1428)
at org.apache.hadoop.ipc.Client.call(Client.java:1338)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy80.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:398)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:335)
at com.sun.proxy.$Proxy81.allocate(Unknown Source)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:204)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:735)
at