[jira] [Commented] (YARN-9394) Use new API of RackResolver to get better performance
[ https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808423#comment-16808423 ] Hadoop QA commented on YARN-9394: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 27s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 24s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 16s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: The patch generated 1 new + 38 unchanged - 0 fixed = 39 total (was 38) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 55s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 26m 1s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 77m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9394 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12964664/YARN-9394.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8c164d5bbd4c 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / aaaf856 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/23869/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23869/testReport/ | | Max. process+thread count | 681 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client | | Console output | https://builds.apache.
[jira] [Commented] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.
[ https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808453#comment-16808453 ] Hadoop QA commented on YARN-9435: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 34s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 51s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 41s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 36s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 81m 29s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}149m 23s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9435 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12964662/YARN-9435.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 39875177edb0 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / aaaf856 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0
[jira] [Commented] (YARN-9281) Add express upgrade button to Appcatalog UI
[ https://issues.apache.org/jira/browse/YARN-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808557#comment-16808557 ] Adam Antal commented on YARN-9281: -- Thanks for taking care of the items, LGTM (non-binding). > Add express upgrade button to Appcatalog UI > --- > > Key: YARN-9281 > URL: https://issues.apache.org/jira/browse/YARN-9281 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9281.001.patch, YARN-9281.002.patch, > YARN-9281.003.patch, YARN-9281.004.patch, YARN-9281.005.patch, > YARN-9281.006.patch, YARN-9281.007.patch > > > It would be nice to have ability to upgrade applications deployed by > Application catalog from Application catalog UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9430) Recovering containers does not check available resources on node
[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-9430: Assignee: (was: Szilard Nemeth) > Recovering containers does not check available resources on node > > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recovery > happens, only the containers that fit into the node's resources will be > recovered. Unfortunately, this is not the case: RM does not check available > resources on node during recovery. > *Detailed explanation:* > *Testcase:* > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices > per node, initially. This means 4 GPU devices in the cluster altogether. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device > for each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU device > per node (2 in the cluster) after restart. > 6. Restart is initiated. > > *Expected behavior:* > After restart, only the AM and 2 normal containers should have been started, > as there are only 2 GPU devices in the cluster. > > *Actual behaviour:* > AM + 4 containers are allocated, this is all containers started originally > with step 4. > App id was: 1553977186701_0001 > *Logs*: > > {code:java} > 2019-03-30 13:22:30,299 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1553977186701_0001_01 of type RECOVER > 2019-03-30 13:22:30,366 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Added Application Attempt appattempt_1553977186701_0001_01 to scheduler > from user: systest > 2019-03-30 13:22:30,366 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > appattempt_1553977186701_0001_01 is recovering. Skipping notifying > ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on > event = RECOVER > 2019-03-30 13:22:33,257 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_01, > CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_04, > CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_04 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, vCores:2, yarn.io/gpu: 1> used and available after > allocation > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_05, > CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Processing container_e84_1553977186701_0001_01_05 of type RECOVER > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to > RUNNING > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_05 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, vCores:3, yarn.io/gpu: 2> used and > available after allocation > 2019-03-30 13:22:33,279 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_03, > CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,280 DEBUG > org.apache.h
[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold
[ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808563#comment-16808563 ] Szilard Nemeth commented on YARN-9421: -- As per our further discussion with [~wilfreds]: Let’s give a minimum percentage of nodes, e.g. 75% of nodes are registered or combined with the timeout. For percentage, we should check if the NM whitelist file is always present. If we don’t have this file or it's empty, we need to drop the percentage criteria and only use the timeout value. This should be configurable as flexible as possible. Another corner case: What if the whitelist contains more machines than really available (IP whitelist, etc)? We can also add number of nodes to wait for as a 3rd grade of threshold, but this is optional. What we need to do with the applications: Park applications until we reached the threshold. We need to pay attention to give an upper-limit of the timeout value so user's don't have the freedom to accidentally provide some very high (e.g. 100 minutes) value. I would define the maximum of timeout as 1 minute. > Implement SafeMode for ResourceManager by defining a resource threshold > --- > > Key: YARN-9421 > URL: https://issues.apache.org/jira/browse/YARN-9421 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Szilard Nemeth >Priority: Major > Attachments: client-log.log, nodemanager.log, resourcemanager.log > > > We have a hypothetical testcase in our test suite that tests Resource Types. > The test does the following: > 1. Sets up a resource named "gpu" > 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu". > 3. It executes a sleep job with resoure requests: > "-Dmapreduce.reduce.resource.gpu=7" and > "-Dyarn.app.mapreduce.am.resource.gpu=11" > Sometimes, we encounter situations when the app submission fails with: > {code:java} > 2019-02-25 06:09:56,795 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission > failed in validating AM resource request for application > application_1551103768202_0001 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[gpu], Requested > resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation={code} > It's clearly visible that the maximum allowed allocation does not have any > "gpu" resources. > > Looking into the logs further, I realized that sometimes the node having the > "gpu" resources are registered after the app is submitted. > In a real world situation and even with this very special test exexution, we > can't be sure which order NMs are registering with RM. > With the advent of resource types, this issue was more likely surface. > If we have a cluster with some "rare" resources like GPUs only on some nodes > out of a 100, we can quickly run into a situation when the NMs with GPUs are > registering later than the normal nodes. While the critical NMs are still > registering, we will most likely experience the same > InvalidResourceRequestException if we submit jobs requesting GPUs. > There is a naive solution to this: > 1. Give some time for RM to wait for NMs to be able to register themselves > and put submitted applications on hold. This could work in some situations > but it's not the most flexible solution as different clusters can have > different requirements. Of course, we can make this more flexible by making > the timeout value configurable. > *A more flexible alternative would be:* > 2. We define a threshold of Resource capability: While we haven't reached > this threshold, we put submitted jobs on hold. Once we reached the threshold, > we enable jobs to pass through. > This is very similar to an already existing concept, the SafeMode in HDFS > ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]). > Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 > GPUs. > Defining a threshold like this, we can ensure most of the submitted jobs > won't be lost, just "parked" until NMs are registered. > The final solution could be the Resource threshold, or the combination of the > threshold and timeout value. I'm open for any other suggestion as well. > *Last but not least, a very easy way to reproduce the issue on a 3 node > cluster:* > 1. Configure a resource type, named 'testres'. > 2. Node1 runs RM, Node 2/3 runs NMs > 3. Node2 has 1 testres > 4. Node3 has 0 testres > 5. St
[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold
[ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808567#comment-16808567 ] Szilard Nemeth commented on YARN-9421: -- [~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case can happen with any default resources like memory, vcores, etc. Do you still have concerns? [~eyang]: Thanks for your comments! You are right about the concern of cluster can change frequently. I haven't mentioned but I meant to: I want to use the safemode mechanism only on startup. If we define a low enough timeout value, jobs can't queue up so we don't use much memory. I agree with you as the safemode concept wouldn't be a default behavior and I never wanted to be like that: This is definitely planned as an opt-in feature. > Implement SafeMode for ResourceManager by defining a resource threshold > --- > > Key: YARN-9421 > URL: https://issues.apache.org/jira/browse/YARN-9421 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Szilard Nemeth >Priority: Major > Attachments: client-log.log, nodemanager.log, resourcemanager.log > > > We have a hypothetical testcase in our test suite that tests Resource Types. > The test does the following: > 1. Sets up a resource named "gpu" > 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu". > 3. It executes a sleep job with resoure requests: > "-Dmapreduce.reduce.resource.gpu=7" and > "-Dyarn.app.mapreduce.am.resource.gpu=11" > Sometimes, we encounter situations when the app submission fails with: > {code:java} > 2019-02-25 06:09:56,795 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission > failed in validating AM resource request for application > application_1551103768202_0001 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[gpu], Requested > resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation={code} > It's clearly visible that the maximum allowed allocation does not have any > "gpu" resources. > > Looking into the logs further, I realized that sometimes the node having the > "gpu" resources are registered after the app is submitted. > In a real world situation and even with this very special test exexution, we > can't be sure which order NMs are registering with RM. > With the advent of resource types, this issue was more likely surface. > If we have a cluster with some "rare" resources like GPUs only on some nodes > out of a 100, we can quickly run into a situation when the NMs with GPUs are > registering later than the normal nodes. While the critical NMs are still > registering, we will most likely experience the same > InvalidResourceRequestException if we submit jobs requesting GPUs. > There is a naive solution to this: > 1. Give some time for RM to wait for NMs to be able to register themselves > and put submitted applications on hold. This could work in some situations > but it's not the most flexible solution as different clusters can have > different requirements. Of course, we can make this more flexible by making > the timeout value configurable. > *A more flexible alternative would be:* > 2. We define a threshold of Resource capability: While we haven't reached > this threshold, we put submitted jobs on hold. Once we reached the threshold, > we enable jobs to pass through. > This is very similar to an already existing concept, the SafeMode in HDFS > ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]). > Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 > GPUs. > Defining a threshold like this, we can ensure most of the submitted jobs > won't be lost, just "parked" until NMs are registered. > The final solution could be the Resource threshold, or the combination of the > threshold and timeout value. I'm open for any other suggestion as well. > *Last but not least, a very easy way to reproduce the issue on a 3 node > cluster:* > 1. Configure a resource type, named 'testres'. > 2. Node1 runs RM, Node 2/3 runs NMs > 3. Node2 has 1 testres > 4. Node3 has 0 testres > 5. Stop all nodes > 6. Start RM on Node1 > 7. Start NM on Node3 (the one without the resource) > 8. Start a pi job, request 1 testres for the AM > Here's the command to start the job: > {code:java} > MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduc
[jira] [Comment Edited] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold
[ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808567#comment-16808567 ] Szilard Nemeth edited comment on YARN-9421 at 4/3/19 10:05 AM: --- [~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case can happen with any default resources like memory, vcores, etc. Do you still have concerns? [~eyang]: Thanks for your comments! You are right about the concern of cluster can change frequently. I haven't mentioned but I meant to: I want to use the safemode mechanism only on startup. If we define a low enough timeout value, jobs can't queue up so we don't use much memory. I agree with you as the safemode concept wouldn't be a default behavior and I never wanted to be like that: This is definitely planned as an opt-in feature. Does this answer all of your concerns / questions? I didn't really get the SLA part, sorry. was (Author: snemeth): [~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case can happen with any default resources like memory, vcores, etc. Do you still have concerns? [~eyang]: Thanks for your comments! You are right about the concern of cluster can change frequently. I haven't mentioned but I meant to: I want to use the safemode mechanism only on startup. If we define a low enough timeout value, jobs can't queue up so we don't use much memory. I agree with you as the safemode concept wouldn't be a default behavior and I never wanted to be like that: This is definitely planned as an opt-in feature. > Implement SafeMode for ResourceManager by defining a resource threshold > --- > > Key: YARN-9421 > URL: https://issues.apache.org/jira/browse/YARN-9421 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Szilard Nemeth >Priority: Major > Attachments: client-log.log, nodemanager.log, resourcemanager.log > > > We have a hypothetical testcase in our test suite that tests Resource Types. > The test does the following: > 1. Sets up a resource named "gpu" > 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu". > 3. It executes a sleep job with resoure requests: > "-Dmapreduce.reduce.resource.gpu=7" and > "-Dyarn.app.mapreduce.am.resource.gpu=11" > Sometimes, we encounter situations when the app submission fails with: > {code:java} > 2019-02-25 06:09:56,795 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission > failed in validating AM resource request for application > application_1551103768202_0001 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[gpu], Requested > resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation={code} > It's clearly visible that the maximum allowed allocation does not have any > "gpu" resources. > > Looking into the logs further, I realized that sometimes the node having the > "gpu" resources are registered after the app is submitted. > In a real world situation and even with this very special test exexution, we > can't be sure which order NMs are registering with RM. > With the advent of resource types, this issue was more likely surface. > If we have a cluster with some "rare" resources like GPUs only on some nodes > out of a 100, we can quickly run into a situation when the NMs with GPUs are > registering later than the normal nodes. While the critical NMs are still > registering, we will most likely experience the same > InvalidResourceRequestException if we submit jobs requesting GPUs. > There is a naive solution to this: > 1. Give some time for RM to wait for NMs to be able to register themselves > and put submitted applications on hold. This could work in some situations > but it's not the most flexible solution as different clusters can have > different requirements. Of course, we can make this more flexible by making > the timeout value configurable. > *A more flexible alternative would be:* > 2. We define a threshold of Resource capability: While we haven't reached > this threshold, we put submitted jobs on hold. Once we reached the threshold, > we enable jobs to pass through. > This is very similar to an already existing concept, the SafeMode in HDFS > ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]). > Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 > GPUs. > Defining a threshold like this, we can ensure
[jira] [Commented] (YARN-9430) Recovering containers does not check available resources on node
[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808578#comment-16808578 ] Szilard Nemeth commented on YARN-9430: -- As per our further discussion with [~wilfreds], we need to check the following further: 1. Verify if the test we execute working with work-preserving recovery: This is most likely the case (99%). Why does it matter? Because with work-preserving recovery of NM, we don't kill containers when the NM is killed/stopped, we keep them running instead. That's why after restart, the recovery of the containers happens and they are running. As I simulated the GPU to disappear with the "fake nvidia-smi script", containers can't detect that the GPU device disappeared. We need to come up with a mechanism to simulate "GPU goes offline" event while the containers are running, one idea for that is to kill the GPU binary process that the container communicates with, but we definitely need to look into this in more details. Container should crash and finish this case. 2. We also need to check simple (non work-preserving) recovery as well. If the containers are killed on restart and we come back with less GPUs, we should still see the issue on RM side. In non work-preserving case, RM should not allow to start the containers at all as there's not enough resources for it to start. The application's AM should handle these situations. *Nevertheless, the testcase pasted in the description should be added to the code and RM should not allow any resource going less than zero*. A big fat error log definitely need to be added to the deduct function mentioned in the description. [~adam.antal], [~shuzirra], [~bsteinbach]: Anything to add? > Recovering containers does not check available resources on node > > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recovery > happens, only the containers that fit into the node's resources will be > recovered. Unfortunately, this is not the case: RM does not check available > resources on node during recovery. > *Detailed explanation:* > *Testcase:* > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices > per node, initially. This means 4 GPU devices in the cluster altogether. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device > for each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU device > per node (2 in the cluster) after restart. > 6. Restart is initiated. > > *Expected behavior:* > After restart, only the AM and 2 normal containers should have been started, > as there are only 2 GPU devices in the cluster. > > *Actual behaviour:* > AM + 4 containers are allocated, this is all containers started originally > with step 4. > App id was: 1553977186701_0001 > *Logs*: > > {code:java} > 2019-03-30 13:22:30,299 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1553977186701_0001_01 of type RECOVER > 2019-03-30 13:22:30,366 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Added Application Attempt appattempt_1553977186701_0001_01 to scheduler > from user: systest > 2019-03-30 13:22:30,366 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > appattempt_1553977186701_0001_01 is recovering. Skipping notifying > ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on > event = RECOVER > 2019-03-30 13:22:33,257 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_01, > CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_04, > CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_04 of capacity
[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates
[ https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808714#comment-16808714 ] Szilard Nemeth commented on YARN-9080: -- Hi [~Prabhu Joseph]! Here are my comments: 1. The depth of the nested while / if statements makes to code very hard to read and increase cyclomatic complexity (https://en.wikipedia.org/wiki/Cyclomatic_complexity) First of all, I would extract the logic into some private methods. Essentially, the pseudo-code of the algorithm is this: 1. Loop over list of files under dirPath 2. if file is a directory, we should do something with the dir, let's call this "dir1" 3. We loop over files under "dir1" (bucket1Iter) 4. If file is a directory and it matches bucket1Regex, we iterate over files under the file (bucket2Iter) 5. If the file matches bucket2Regex then we have a valid dir 6. If we have files under this dir, we loop over those 7. If we find a directory and it's a valid applicationId, we invoke delete. Please try to come up with something more readable, more easy to understand. I would try to extract methods first along with the while-loops then go until you have reasonable chunks of code. 2. I was wondering what is the meaning of "clusterts" and only realized from the tests that is clusterTimeStamp. You should either use this latter name or use clusterTs, but I prefer clusterTimeStamp. 3. Please extract the condition of the if-statement into a method from here: {code:java} if ((fs.listStatus(bucket2Path).length != 0) || (now - bucket2Stat.getModificationTime() <= retainMillis)) { {code} Please let me know if you are ready and I will check again, thanks! > Bucket Directories as part of ATS done accumulates > -- > > Key: YARN-9080 > URL: https://issues.apache.org/jira/browse/YARN-9080 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, > 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, > YARN-9080-006.patch > > > Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 > as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner > removes only the app directories and not the bucket directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4901) QueueMetrics needs to be cleared before MockRM is initialized
[ https://issues.apache.org/jira/browse/YARN-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-4901: - Summary: QueueMetrics needs to be cleared before MockRM is initialized (was: MockRM should clear the QueueMetrics when it starts) > QueueMetrics needs to be cleared before MockRM is initialized > - > > Key: YARN-4901 > URL: https://issues.apache.org/jira/browse/YARN-4901 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Daniel Templeton >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-4901-001.patch > > > The {{ResourceManager}} rightly assumes that when it starts, it's starting > from naught. The {{MockRM}}, however, violates that assumption. For > example, in {{TestNMReconnect}}, each test method creates a new {{MockRM}} > instance. The {{QueueMetrics.queueMetrics}} field is static, which means > that when multiple {{MockRM}} instances are created, the {{QueueMetrics}} > bleed over. Having the MockRM clear the {{QueueMetrics}} when it starts > should resolve the issue. I haven't looked yet at scope to see how hard easy > that is to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4901) QueueMetrics needs to be cleared before MockRM is initialized
[ https://issues.apache.org/jira/browse/YARN-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808733#comment-16808733 ] Hudson commented on YARN-4901: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16334 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16334/]) YARN-4901. QueueMetrics needs to be cleared before MockRM is (sunilg: rev 002dcc4ebf79bbaa5e603565640d8289991d781f) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java > QueueMetrics needs to be cleared before MockRM is initialized > - > > Key: YARN-4901 > URL: https://issues.apache.org/jira/browse/YARN-4901 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Daniel Templeton >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-4901-001.patch > > > The {{ResourceManager}} rightly assumes that when it starts, it's starting > from naught. The {{MockRM}}, however, violates that assumption. For > example, in {{TestNMReconnect}}, each test method creates a new {{MockRM}} > instance. The {{QueueMetrics.queueMetrics}} field is static, which means > that when multiple {{MockRM}} instances are created, the {{QueueMetrics}} > bleed over. Having the MockRM clear the {{QueueMetrics}} when it starts > should resolve the issue. I haven't looked yet at scope to see how hard easy > that is to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates
[ https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808747#comment-16808747 ] Peter Bacsko commented on YARN-9080: I'd like to join Szilard in suggesting making the code more readable. I can imagine sth like: {code} RemoteIterator clustertsIter = list(dirpath); while (clustertsIter.hasNext()) { FileStatus clustertsStat = clustertsIter.next(); MutableBoolean toBeRemoved = new MutableBoolean(); MutableBoolean isValid = new MutableBoolean(); if (clustertsStat.isDirectory()) { processClusterTsDir(clustertsStat, toBeRemoved, isValid); } .. private void processClusterTsDir(FileStatus fs, MutableBoolean toBeRemoved, MutableBoolean isValid) { Path clustertsPath = fs.getPath(); RemoteIterator bucket1Iter = list(dir); while (bucket1Iter.hasNext()) { FileStatus bucket1Stat = bucket1Iter.next(); Path bucket1Path = bucket1Stat.getPath(); if (bucket1Stat.isDirectory() && bucket1Path.getName().matches(bucket1Regex)) { processBucket1Dir(bucket1Stat, toBeRemoved, isValid); } } private void processBucket1Dir(FileStatus fs, MutableBoolean toBeRemoved, MutableBoolean isValid) { // walk through directories, check condition, go down to processBucket2Dir if it's true ... } private void processBucket2Dir(FileStatus fs, MutableBoolean toBeRemoved, MutableBoolean isValid) { ... // walk through directories, check condition, go down to processAppDir if it's true } private void processAppDir(FileStatus fs, MutableBoolean toBeRemoved, MutableBoolean isValid) { ... } {code} So basically each time you descend down the hierarchy, you enter a new method and pass around fields that you need later - then changes will be reflected in the outermost call. > Bucket Directories as part of ATS done accumulates > -- > > Key: YARN-9080 > URL: https://issues.apache.org/jira/browse/YARN-9080 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, > 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, > YARN-9080-006.patch > > > Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 > as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner > removes only the app directories and not the bucket directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9436) Flaky test testApplicationLifetimeMonitor
Peter Bacsko created YARN-9436: -- Summary: Flaky test testApplicationLifetimeMonitor Key: YARN-9436 URL: https://issues.apache.org/jira/browse/YARN-9436 Project: Hadoop YARN Issue Type: Bug Components: scheduler, test Reporter: Peter Bacsko Assignee: Peter Bacsko In our test environment, we occasionally encounter this failure: {noformat} 2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor 2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) Time elapsed: 34.244 s <<< FAILURE! 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value 2019-04-03 12:53:08 at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) 2019-04-03 12:53:08 {noformat} The root cause is the condition here: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun > maxLifetime); {noformat} However, there are two problems with this condition: 1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31. 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, but this is correct, because in {{setUpCSQueue}} we set the queue lifetime: {noformat} csConf.setMaximumLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); csConf.setDefaultLifetimePerQueue( CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); {noformat} A more proper condition is: {noformat} Assert.assertTrue("Application killed before lifetime value", totalTimeRun >= maxLifetime); {noformat} The assertion message in the next line is also misleading: {noformat} Assert.assertTrue( "Application killed before lifetime value " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} If it false, it means that the application is killed _after_ 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s). {noformat} Assert.assertTrue( "Application killed after queue/app lifetime value: " + totalTimeRun, totalTimeRun < maxLifetime + 10L); {noformat} We can be even be stricter, since we expect a kill almost immediately after 30 seconds: {noformat} Assert.assertTrue( "Application killed too late: " + totalTimeRun, totalTimeRun < maxLifetime + 2L); {noformat} where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.
[ https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi updated YARN-9435: Attachment: YARN-9435.003.patch > Add Opportunistic Scheduler metrics in ResourceManager. > --- > > Key: YARN-9435 > URL: https://issues.apache.org/jira/browse/YARN-9435 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9435.001.patch, YARN-9435.002.patch, > YARN-9435.003.patch > > > Right now there are no metrics available for Opportunistic Scheduler at > ResourceManager. As part of this jira, we will add metrics like number of > allocated opportunistic containers, released opportunistic containers, node > level allocations, rack level allocations etc. for Opportunistic Scheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates
[ https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808801#comment-16808801 ] Prabhu Joseph commented on YARN-9080: - Thanks [~snemeth] and [~pbacsko] for the detailed explanation. Working on it, will update you. > Bucket Directories as part of ATS done accumulates > -- > > Key: YARN-9080 > URL: https://issues.apache.org/jira/browse/YARN-9080 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, > 0003-YARN-9080.patch, YARN-9080-004.patch, YARN-9080-005.patch, > YARN-9080-006.patch > > > Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 > as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner > removes only the app directories and not the bucket directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9436) Flaky test testApplicationLifetimeMonitor
[ https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808803#comment-16808803 ] Prabhu Joseph commented on YARN-9436: - [~pbacsko] I think this issue will be fixed by YARN-9404. Can you validate the same. > Flaky test testApplicationLifetimeMonitor > - > > Key: YARN-9436 > URL: https://issues.apache.org/jira/browse/YARN-9436 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our test environment, we occasionally encounter this failure: > {noformat} > 2019-04-03 12:49:32 [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, > Time elapsed: 215.535 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] > testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) > Time elapsed: 34.244 s <<< FAILURE! > 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before > lifetime value > 2019-04-03 12:53:08 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) > 2019-04-03 12:53:08 > {noformat} > The root cause is the condition here: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun > maxLifetime); > {noformat} > However, there are two problems with this condition: > 1. Logically it's not correct. In fact, since the app should be killed after > 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to > some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up > being 31. > 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is > 30, but this is correct, because in {{setUpCSQueue}} we set the queue > lifetime: > {noformat} > csConf.setMaximumLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); > csConf.setDefaultLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); > {noformat} > A more proper condition is: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun >= maxLifetime); > {noformat} > The assertion message in the next line is also misleading: > {noformat} > Assert.assertTrue( > "Application killed before lifetime value " + totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > If it false, it means that the application is killed _after_ 40 seconds, > which exceeds both the app's lifetime (40s) and that of the queue (30s). > {noformat} > Assert.assertTrue( > "Application killed after queue/app lifetime value: " + > totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > We can be even be stricter, since we expect a kill almost immediately after > 30 seconds: > {noformat} > Assert.assertTrue( > "Application killed too late: " + totalTimeRun, > totalTimeRun < maxLifetime + 2L); > {noformat} > where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9436) Flaky test testApplicationLifetimeMonitor
[ https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808805#comment-16808805 ] Peter Bacsko commented on YARN-9436: Whoah, thanks [~Prabhu Joseph] - yes it's exactly the same. I'm closing this. > Flaky test testApplicationLifetimeMonitor > - > > Key: YARN-9436 > URL: https://issues.apache.org/jira/browse/YARN-9436 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our test environment, we occasionally encounter this failure: > {noformat} > 2019-04-03 12:49:32 [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, > Time elapsed: 215.535 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] > testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) > Time elapsed: 34.244 s <<< FAILURE! > 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before > lifetime value > 2019-04-03 12:53:08 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) > 2019-04-03 12:53:08 > {noformat} > The root cause is the condition here: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun > maxLifetime); > {noformat} > However, there are two problems with this condition: > 1. Logically it's not correct. In fact, since the app should be killed after > 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to > some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up > being 31. > 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is > 30, but this is correct, because in {{setUpCSQueue}} we set the queue > lifetime: > {noformat} > csConf.setMaximumLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); > csConf.setDefaultLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); > {noformat} > A more proper condition is: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun >= maxLifetime); > {noformat} > The assertion message in the next line is also misleading: > {noformat} > Assert.assertTrue( > "Application killed before lifetime value " + totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > If it false, it means that the application is killed _after_ 40 seconds, > which exceeds both the app's lifetime (40s) and that of the queue (30s). > {noformat} > Assert.assertTrue( > "Application killed after queue/app lifetime value: " + > totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > We can be even be stricter, since we expect a kill almost immediately after > 30 seconds: > {noformat} > Assert.assertTrue( > "Application killed too late: " + totalTimeRun, > totalTimeRun < maxLifetime + 2L); > {noformat} > where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9436) Flaky test testApplicationLifetimeMonitor
[ https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YARN-9436. Resolution: Duplicate > Flaky test testApplicationLifetimeMonitor > - > > Key: YARN-9436 > URL: https://issues.apache.org/jira/browse/YARN-9436 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our test environment, we occasionally encounter this failure: > {noformat} > 2019-04-03 12:49:32 [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, > Time elapsed: 215.535 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] > testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) > Time elapsed: 34.244 s <<< FAILURE! > 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before > lifetime value > 2019-04-03 12:53:08 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) > 2019-04-03 12:53:08 > {noformat} > The root cause is the condition here: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun > maxLifetime); > {noformat} > However, there are two problems with this condition: > 1. Logically it's not correct. In fact, since the app should be killed after > 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to > some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up > being 31. > 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is > 30, but this is correct, because in {{setUpCSQueue}} we set the queue > lifetime: > {noformat} > csConf.setMaximumLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); > csConf.setDefaultLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); > {noformat} > A more proper condition is: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun >= maxLifetime); > {noformat} > The assertion message in the next line is also misleading: > {noformat} > Assert.assertTrue( > "Application killed before lifetime value " + totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > If it false, it means that the application is killed _after_ 40 seconds, > which exceeds both the app's lifetime (40s) and that of the queue (30s). > {noformat} > Assert.assertTrue( > "Application killed after queue/app lifetime value: " + > totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > We can be even be stricter, since we expect a kill almost immediately after > 30 seconds: > {noformat} > Assert.assertTrue( > "Application killed too late: " + totalTimeRun, > totalTimeRun < maxLifetime + 2L); > {noformat} > where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9421) Implement SafeMode for ResourceManager by defining a resource threshold
[ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808945#comment-16808945 ] Eric Yang commented on YARN-9421: - [~snemeth] SLA are predefined time window that a program is allowed to run. If resources gone away, and cause the job to queued up without running. (admin setup cron job to automatically restart YARN when system is down.) Application may miss their opportunity to execute for remaining in safe mode for extended period of time. The proposal is optional feature and default to be disabled. Hence, my concern is addressed. Thank you > Implement SafeMode for ResourceManager by defining a resource threshold > --- > > Key: YARN-9421 > URL: https://issues.apache.org/jira/browse/YARN-9421 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Szilard Nemeth >Priority: Major > Attachments: client-log.log, nodemanager.log, resourcemanager.log > > > We have a hypothetical testcase in our test suite that tests Resource Types. > The test does the following: > 1. Sets up a resource named "gpu" > 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu". > 3. It executes a sleep job with resoure requests: > "-Dmapreduce.reduce.resource.gpu=7" and > "-Dyarn.app.mapreduce.am.resource.gpu=11" > Sometimes, we encounter situations when the app submission fails with: > {code:java} > 2019-02-25 06:09:56,795 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission > failed in validating AM resource request for application > application_1551103768202_0001 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[gpu], Requested > resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation={code} > It's clearly visible that the maximum allowed allocation does not have any > "gpu" resources. > > Looking into the logs further, I realized that sometimes the node having the > "gpu" resources are registered after the app is submitted. > In a real world situation and even with this very special test exexution, we > can't be sure which order NMs are registering with RM. > With the advent of resource types, this issue was more likely surface. > If we have a cluster with some "rare" resources like GPUs only on some nodes > out of a 100, we can quickly run into a situation when the NMs with GPUs are > registering later than the normal nodes. While the critical NMs are still > registering, we will most likely experience the same > InvalidResourceRequestException if we submit jobs requesting GPUs. > There is a naive solution to this: > 1. Give some time for RM to wait for NMs to be able to register themselves > and put submitted applications on hold. This could work in some situations > but it's not the most flexible solution as different clusters can have > different requirements. Of course, we can make this more flexible by making > the timeout value configurable. > *A more flexible alternative would be:* > 2. We define a threshold of Resource capability: While we haven't reached > this threshold, we put submitted jobs on hold. Once we reached the threshold, > we enable jobs to pass through. > This is very similar to an already existing concept, the SafeMode in HDFS > ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]). > Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 > GPUs. > Defining a threshold like this, we can ensure most of the submitted jobs > won't be lost, just "parked" until NMs are registered. > The final solution could be the Resource threshold, or the combination of the > threshold and timeout value. I'm open for any other suggestion as well. > *Last but not least, a very easy way to reproduce the issue on a 3 node > cluster:* > 1. Configure a resource type, named 'testres'. > 2. Node1 runs RM, Node 2/3 runs NMs > 3. Node2 has 1 testres > 4. Node3 has 0 testres > 5. Stop all nodes > 6. Start RM on Node1 > 7. Start NM on Node3 (the one without the resource) > 8. Start a pi job, request 1 testres for the AM > Here's the command to start the job: > {code:java} > MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" > pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code} > > *Configurations*: > node1: yarn-site.xml of ResourceManager: > {code:java}
[jira] [Commented] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.
[ https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808951#comment-16808951 ] Hadoop QA commented on YARN-9435: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 13s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 58s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 83m 30s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}154m 21s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9435 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12964709/YARN-9435.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a37adc2c1817 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 002dcc4 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23870/testReport/ | | Max. process+thread count
[jira] [Assigned] (YARN-9254) Externalize Solr data storage
[ https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-9254: --- Assignee: Eric Yang > Externalize Solr data storage > - > > Key: YARN-9254 > URL: https://issues.apache.org/jira/browse/YARN-9254 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9254.001.patch > > > Application catalog contains embedded Solr. By default, Solr data is stored > in temp space of the docker container. For user who likes to persist Solr > data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting > to embedded Solr to externalize Solr data storage. This also implies passing > Kerberos credential settings to Solr JVM in order to access secure HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9254) Externalize Solr data storage
[ https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-9254: Attachment: YARN-9254.001.patch > Externalize Solr data storage > - > > Key: YARN-9254 > URL: https://issues.apache.org/jira/browse/YARN-9254 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Priority: Major > Attachments: YARN-9254.001.patch > > > Application catalog contains embedded Solr. By default, Solr data is stored > in temp space of the docker container. For user who likes to persist Solr > data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting > to embedded Solr to externalize Solr data storage. This also implies passing > Kerberos credential settings to Solr JVM in order to access secure HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9254) Externalize Solr data storage
[ https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809154#comment-16809154 ] Hadoop QA commented on YARN-9254: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 19s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 0s{color} | {color:red} The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} shelldocs {color} | {color:green} 0m 19s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 fixed = 104 total (was 236) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 10s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 19s{color} | {color:green} hadoop-yarn-applications-catalog-docker in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 43m 22s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9254 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12964761/YARN-9254.001.patch | | Optional Tests | dupname asflicense mvnsite unit shellcheck shelldocs | | uname | Linux b5df5453a28a 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d797907 | | maven | version: Apache Maven 3.3.9 | | shellcheck | v0.4.6 | | shellcheck | https://builds.apache.org/job/PreCommit-YARN-Build/23871/artifact/out/diff-patch-shellcheck.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23871/testReport/ | | Max. process+thread count | 413 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23871/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Externalize Solr data storage > - > > Key: YARN-9254 > URL: https://issues.apache.org/jira/browse/YARN-9254 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9254.001.patch > > > Application catalog contains embedded Solr. By default, Solr data is stored > in temp space of the docker container. For user who likes to persist Solr > data on HDFS, it would be nice to have a way to p
[jira] [Updated] (YARN-9254) Externalize Solr data storage
[ https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-9254: Attachment: YARN-9254.002.patch > Externalize Solr data storage > - > > Key: YARN-9254 > URL: https://issues.apache.org/jira/browse/YARN-9254 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9254.001.patch, YARN-9254.002.patch > > > Application catalog contains embedded Solr. By default, Solr data is stored > in temp space of the docker container. For user who likes to persist Solr > data on HDFS, it would be nice to have a way to pass solr.hdfs.home setting > to embedded Solr to externalize Solr data storage. This also implies passing > Kerberos credential settings to Solr JVM in order to access secure HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5670) Add support for Docker image clean up
[ https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809338#comment-16809338 ] Eric Yang commented on YARN-5670: - In today's YARN Docker meeting, there is consensus on using Node Manager track LRU by digest ID, and apply mark and sweep algorithm to prune images seem by node manager. Open concern is still around corner case where locally tagged system admin images can get deleted when the same image is used by a job. Kubernetes tackles docker image pruning problem by making assumption that [system not require human operators to work reliably|https://thenewstack.io/deletion-garbage-collection-kubernetes-objects/]. I think this is a safe assumption, and wait for [~shaneku...@gmail.com] and [~ebadger] to process this information. > Add support for Docker image clean up > - > > Key: YARN-5670 > URL: https://issues.apache.org/jira/browse/YARN-5670 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: Localization Support For Docker Images_002.pdf > > > Regarding to Docker image localization, we also need a way to clean up the > old/stale Docker image to save storage space. We may extend deletion service > to utilize "docker rm" to do this. > This is related to YARN-3854 and may depend on its implementation. Please > refer to YARN-3854 for Docker image localization details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9254) Externalize Solr data storage
[ https://issues.apache.org/jira/browse/YARN-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809341#comment-16809341 ] Hadoop QA commented on YARN-9254: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 32s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 19s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 0s{color} | {color:red} The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} shelldocs {color} | {color:green} 0m 18s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 fixed = 104 total (was 236) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 18s{color} | {color:green} hadoop-yarn-applications-catalog-docker in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 19s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 46m 50s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9254 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12964779/YARN-9254.002.patch | | Optional Tests | dupname asflicense mvnsite unit shellcheck shelldocs | | uname | Linux df6fe5d932c3 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8ff41d6 | | maven | version: Apache Maven 3.3.9 | | shellcheck | v0.4.6 | | shellcheck | https://builds.apache.org/job/PreCommit-YARN-Build/23872/artifact/out/diff-patch-shellcheck.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23872/testReport/ | | Max. process+thread count | 447 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: hadoop-yarn-project/hadoop-yarn | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23872/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Externalize Solr data storage > - > > Key: YARN-9254 > URL: https://issues.apache.org/jira/browse/YARN-9254 > Project: Hado
[jira] [Assigned] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesha Vora reassigned YARN-8466: Assignee: Yesha Vora > Add Chaos Monkey unit test framework for feature validation in scale > > > Key: YARN-8466 > URL: https://issues.apache.org/jira/browse/YARN-8466 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Yesha Vora >Priority: Critical > Attachments: YARN-8466.poc.001.patch > > > Currently we don't have such framework for testing. > We need a framework to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809385#comment-16809385 ] Vrushali C commented on YARN-9335: -- Thanks for the patch v3 Abhishek, lgtm. Will commit shortly > [atsv2] Restrict the number of elements held in timeline collector when > backend is unreachable for async calls > -- > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch, > YARN-9335.003.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.
[ https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809387#comment-16809387 ] Vrushali C commented on YARN-9382: -- thanks Abhishek, patch v2 looks good. Will commit it shortly > Publish container killed, paused and resumed events to ATSv2. > - > > Key: YARN-9382 > URL: https://issues.apache.org/jira/browse/YARN-9382 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9382.001.patch, YARN-9382.002.patch > > > There are some events missing in container lifecycle. We need to add support > for adding events for when container gets killed, paused and resumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9373) HBaseTimelineSchemaCreator has to allow user to configure pre-splits
[ https://issues.apache.org/jira/browse/YARN-9373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809390#comment-16809390 ] Vrushali C commented on YARN-9373: -- Thanks Prabhu, overall patch v2 looks good. I want to look at it a bit more in detail today. Will update the jira with comments, else will commit it. > HBaseTimelineSchemaCreator has to allow user to configure pre-splits > > > Key: YARN-9373 > URL: https://issues.apache.org/jira/browse/YARN-9373 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: Configurable_PreSplits.png, YARN-9373-001.patch, > YARN-9373-002.patch > > > Most of the TimelineService HBase tables is set with username splits which is > based on lowercase alphabet (a,ad,an,b,ca). This won't help if the rowkey > starts with either number or uppercase alphabet. We need to allow user to > configure based upon their data. For example, say a user has configured the > yarn.resourcemanager.cluster-id to be ATS or 123, then the splits can be > configured as A,B,C,,, or 100,200,300,,, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809392#comment-16809392 ] Vrushali C commented on YARN-9303: -- +1 to patch v1. I am reviewing the other patch. But this one is correct, will commit shortly > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time
qiuliang created YARN-9437: -- Summary: RMNodeImpls occupy too much memory and causes RM GC to take a long time Key: YARN-9437 URL: https://issues.apache.org/jira/browse/YARN-9437 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.9.1 Reporter: qiuliang Attachments: 1.png, 2.png, 3.png We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each RMNodeImpl has approximately 14M. The reason is that there is a 13W+ completedcontainers in each RMNodeImpl that has not been released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9438) launchTime not written to state store for running applications
Jonathan Hung created YARN-9438: --- Summary: launchTime not written to state store for running applications Key: YARN-9438 URL: https://issues.apache.org/jira/browse/YARN-9438 Project: Hadoop YARN Issue Type: Bug Reporter: Jonathan Hung Assignee: Jonathan Hung launchTime is only saved to state store after application finishes, so if restart happens, any running applications will have launchTime set as -1 (since this is the default timestamp of the recovery event). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9394) Use new API of RackResolver to get better performance
[ https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated YARN-9394: - Attachment: YARN-9394.003.patch > Use new API of RackResolver to get better performance > - > > Key: YARN-9394 > URL: https://issues.apache.org/jira/browse/YARN-9394 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.0, 3.2.1 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Attachments: YARN-9394.001.patch, YARN-9394.002.patch, > YARN-9394.003.patch > > > After adding a new API in RackResolver YARN-9332, some old callers should > switch to new API to get better performance. As an example, Spark > [YarnAllocator|https://github.com/apache/spark/blob/733f2c0b98208815f8408e36ab669d7c07e3767f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L361-L363] > for Dynamic Allocation invokes > [https://github.com/apache/hadoop/blob/6fa229891e06eea62cb9634efde755f40247e816/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java#L550] > to resolve racks in a loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9394) Use new API of RackResolver to get better performance
[ https://issues.apache.org/jira/browse/YARN-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809476#comment-16809476 ] Lantao Jin commented on YARN-9394: -- Attache [^YARN-9394.003.patch] to address checkstyle. > Use new API of RackResolver to get better performance > - > > Key: YARN-9394 > URL: https://issues.apache.org/jira/browse/YARN-9394 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.0, 3.2.1 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Attachments: YARN-9394.001.patch, YARN-9394.002.patch, > YARN-9394.003.patch > > > After adding a new API in RackResolver YARN-9332, some old callers should > switch to new API to get better performance. As an example, Spark > [YarnAllocator|https://github.com/apache/spark/blob/733f2c0b98208815f8408e36ab669d7c07e3767f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L361-L363] > for Dynamic Allocation invokes > [https://github.com/apache/hadoop/blob/6fa229891e06eea62cb9634efde755f40247e816/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java#L550] > to resolve racks in a loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809497#comment-16809497 ] Prabhu Joseph commented on YARN-9303: - Thanks [~vrushalic] for reviewing. > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9408) @Path("/apps/{appid}/appattempts") error message misleads
[ https://issues.apache.org/jira/browse/YARN-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809525#comment-16809525 ] Vrushali C commented on YARN-9408: -- Hmm, so I am trying understand this error. Looks like it may be thrown at this line https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase/hadoop-yarn-server-timelineservice-hbase-client/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/reader/AbstractTimelineStorageReader.java#L85 It's because there result set was empty/null. Looking at the code, it is trying to look up the flow context for this app id and it does not find anything. I am wondering if catching all NotFoundExceptions is a good idea. Perhaps we can add to the exception message and enhance it rather than printing out a completely new message. > @Path("/apps/{appid}/appattempts") error message misleads > - > > Key: YARN-9408 > URL: https://issues.apache.org/jira/browse/YARN-9408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: YARN-9408-001.patch, YARN-9408-002.patch > > > {code} @Path("/apps/{appid}/appattempts") {code} error message is misleading. > NotFoundException "Unable to find the context flow name, and flow run id, and > user id" is displayed while app attempts is looked. > {code} > [hbase@yarn-ats-3 ~]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0004/appattempts?user.name=hbase"; > | jq . > { > "exception": "NotFoundException", > "message": "java.lang.Exception: Unable to find the context flow name, and > flow run id, and user id for clusterId=ats, > appId=application_1553258815132_0004", > "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException" > } > [hbase@yarn-ats-3 ~]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats/apps/application_1553258815132_0005/appattempts?user.name=hbase"; > | jq . > { > "exception": "NotFoundException", > "message": "java.lang.Exception: Unable to find the context flow name, and > flow run id, and user id for clusterId=ats, > appId=application_1553258815132_0005", > "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException" > } > [hbase@yarn-ats-3 ~]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats1/apps/application_1553258815132_0001/containers/container_e14_1553258815132_0001_01_01?user.name=hbase"; > | jq . > { > "exception": "NotFoundException", > "message": "java.lang.Exception: Unable to find the context flow name, and > flow run id, and user id for clusterId=ats1, > appId=application_1553258815132_0001", > "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException" > } > [hbase@yarn-ats-3 ~]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/clusters/ats1/apps/application_1553258815132_0001/appattempts/appattempt_1553258815132_0001_01/containers?user.name=hbase"; > | jq . > { > "exception": "NotFoundException", > "message": "java.lang.Exception: Unable to find the context flow name, and > flow run id, and user id for clusterId=ats1, > appId=application_1553258815132_0001", > "javaClassName": "org.apache.hadoop.yarn.webapp.NotFoundException" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9403) GET /apps/{appid}/entities/YARN_APPLICATION accesses application table instead of entity table
[ https://issues.apache.org/jira/browse/YARN-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809526#comment-16809526 ] Vrushali C commented on YARN-9403: -- I am not sure I understand the issue correctly. For YARN_APPLICATIOn entities, they are being written to the application table, no? If so, why do we need to go to the entities table? If there any information missing in the response that was expected. > GET /apps/{appid}/entities/YARN_APPLICATION accesses application table > instead of entity table > -- > > Key: YARN-9403 > URL: https://issues.apache.org/jira/browse/YARN-9403 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9403-001.patch, YARN-9403-002.patch, > YARN-9403-003.patch, YARN-9403-004.patch > > > {noformat}"GET /apps/{appid}/entities/YARN_APPLICATION"{noformat} accesses > application table instead of entity table. As per the doc, With this API, you > can query generic entities identified by cluster ID, application ID and > per-framework entity type. But it also provides all the apps when entityType > is set to YARN_APPLICATION. It should only access Entity Table through > {{GenericEntityReader}}. > Wrong Output: With YARN_APPLICATION entityType, all applications listed from > application tables. > {code} > [hbase@yarn-ats-3 centos]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_APPLICATION?user.name=hbase&userid=hbase&flowname=word%20count"; > | jq . > [ > { > "metrics": [], > "events": [], > "createdtime": 1553258922721, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": "ats!application_1553258815132_0002", > "FROM_ID": "ats!hbase!word > count!1553258922721!application_1553258815132_0002" > }, > "configs": {}, > "type": "YARN_APPLICATION", > "id": "application_1553258815132_0002" > }, > { > "metrics": [], > "events": [], > "createdtime": 1553258825918, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": "ats!application_1553258815132_0001", > "FROM_ID": "ats!hbase!word > count!1553258825918!application_1553258815132_0001" > }, > "configs": {}, > "type": "YARN_APPLICATION", > "id": "application_1553258815132_0001" > } > ] > {code} > Right Output: With correct entity type (MAPREDUCE_JOB) it accesses entity > table for given applicationId and entityType. > {code} > [hbase@yarn-ats-3 centos]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/MAPREDUCE_JOB?user.name=hbase&userid=hbase&flowname=word%20count"; > | jq . > [ > { > "metrics": [], > "events": [], > "createdtime": 1553258926667, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": > "ats!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002", > "FROM_ID": "ats!hbase!word > count!1553258922721!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002" > }, > "configs": {}, > "type": "MAPREDUCE_JOB", > "id": "job_1553258815132_0002" > } > ] > {code} > Flow Activity and Flow Run tables can also be accessed using similar way. > {code} > GET /apps/{appid}/entities/YARN_FLOW_ACTIVITY > GET /apps/{appid}/entities/YARN_FLOW_RUN > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.
[ https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809527#comment-16809527 ] Vrushali C commented on YARN-9382: -- Hi Abhishek, I am somehow not able to apply the patch (with p0 or p1). Can you check? {code} [tw-mbp13-channapattan hadoop (trunk)]$ git apply -p0 -v ~/Downloads/YARN-9382.002.patch Checking patch a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java => b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java... error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java: No such file or directory Checking patch a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java => b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java... error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java: No such file or directory Checking patch a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java => b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java... error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java: No such file or directory [tw-mbp13-channapattan hadoop (trunk)]$ {code} {code} [tw-mbp13-channapattan hadoop (trunk)]$ git apply -p1 -v ~/Downloads/YARN-9382.002.patch Checking patch hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ContainerMetricsConstants.java... Checking patch hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java... Hunk #2 succeeded at 255 (offset -7 lines). error: while searching for: case INIT_CONTAINER: publishContainerCreatedEvent(event); break; default: if (LOG.isDebugEnabled()) { LOG.debug(event.getType() error: patch failed: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java:402 error: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/NMTimelinePublisher.java: patch does not apply Checking patch hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java... error: while searching for: import org.apache.hadoop.yarn.server.nodemanager.Context; import org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationContainerFinishedEvent; import org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container; import org.apache.hadoop.yarn.util.ResourceCalculatorProcessTree; import org.junit.Assert; import org.junit.Test; error: patch failed: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java:45 error: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/timelineservice/TestNMTimelinePublisher.java: patch does not apply [tw-mbp13-channapattan hadoop (trunk)]$ {code} > Publish container killed, paused and resumed events to ATSv2. > - > > Key: YARN-9382 > URL: https://issues.apache.org/jira/browse/YARN-9382 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9382.001.patch, YARN-9382.002.patch > > > There are some events missing in container lifecycle. We need to add support > for adding events for when container gets killed, paused and resumed. -- This message was sent by
[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809528#comment-16809528 ] Vrushali C commented on YARN-9335: -- Hi Abhishek, Could you check applying this patch as well? It seems to not work for me. Do you see anything incorrect in my command below: {code} [tw-mbp13-channapattan hadoop (trunk)]$ git apply -p0 ~/Downloads/YARN-9335.003.patch error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java: No such file or directory error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml: No such file or directory error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java: No such file or directory error: a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestTimelineCollector.java: No such file or directory [tw-mbp13-channapattan hadoop (trunk)]$ git apply -p1 ~/Downloads/YARN-9335.003.patch error: patch failed: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java:221 error: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java: patch does not apply [tw-mbp13-channapattan hadoop (trunk)]$ {code} > [atsv2] Restrict the number of elements held in timeline collector when > backend is unreachable for async calls > -- > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch, > YARN-9335.003.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-9303: - Fix Version/s: 3.3.0 > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: atsv2, atsv2-hbase > Fix For: 3.3.0 > > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-9303: - Labels: atsv2 atsv2-hbase (was: atsv2) > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: atsv2, atsv2-hbase > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-9303: - Labels: atsv2 (was: ) > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: atsv2 > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9382) Publish container killed, paused and resumed events to ATSv2.
[ https://issues.apache.org/jira/browse/YARN-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809536#comment-16809536 ] Abhishek Modi commented on YARN-9382: - Thanks Vrushali - let me check at my end. > Publish container killed, paused and resumed events to ATSv2. > - > > Key: YARN-9382 > URL: https://issues.apache.org/jira/browse/YARN-9382 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Abhishek Modi >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9382.001.patch, YARN-9382.002.patch > > > There are some events missing in container lifecycle. We need to add support > for adding events for when container gets killed, paused and resumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809539#comment-16809539 ] Abhishek Modi commented on YARN-9335: - Thanks [~vrushalic]. I will check at my end. Let me also run complete UTs with patch as I am afraid it can cause some other UT failures as we have made writes async. > [atsv2] Restrict the number of elements held in timeline collector when > backend is unreachable for async calls > -- > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch, > YARN-9335.003.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3488) AM get timeline service info from RM rather than Application specific configuration.
[ https://issues.apache.org/jira/browse/YARN-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809538#comment-16809538 ] Vrushali C commented on YARN-3488: -- Hi Abhishek Yes I will try to get to this soon. > AM get timeline service info from RM rather than Application specific > configuration. > > > Key: YARN-3488 > URL: https://issues.apache.org/jira/browse/YARN-3488 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications >Reporter: Junping Du >Assignee: Abhishek Modi >Priority: Major > Labels: YARN-5355 > Attachments: YARN-3488.001.patch, YARN-3488.002.patch, > YARN-3488.003.patch > > > Since v1 timeline service, we have MR configuration to enable/disable putting > history event to timeline service. For today's v2 timeline service ongoing > effort, currently we have different methods/structures between v1 and v2 for > consuming TimelineClient, so application have to be aware of which version > timeline service get used there. > There are basically two options here: > First option is as current way in DistributedShell or MR to let application > has specific configuration to point out that if enabling ATS and which > version could be, like: MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA, etc. > The other option is to let application to figure out timeline related info > from YARN/RM, it can be done through registerApplicationMaster() in > ApplicationMasterProtocol with return value for service "off", "v1_on", or > "v2_on". > We prefer the latter option because application owner doesn't have to aware > RM/YARN infrastructure details. Please note that we should keep compatible > (consistent behavior with the same setting) with released configurations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9303) Username splits won't help timelineservice.app_flow table
[ https://issues.apache.org/jira/browse/YARN-9303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809545#comment-16809545 ] Prabhu Joseph commented on YARN-9303: - Thanks [~vrushalic]! > Username splits won't help timelineservice.app_flow table > - > > Key: YARN-9303 > URL: https://issues.apache.org/jira/browse/YARN-9303 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: atsv2, atsv2-hbase > Fix For: 3.3.0 > > Attachments: Only_Last_Region_Used.png, YARN-9303-001.patch > > > timelineservice.app_flow hbase table uses pre split logic based on username > whereas the rowkeys starts with inverted timestamp (Long.MAX_VALUE - ts). All > data will go to the last region and remaining regions will never be inserted. > Need to choose right split or use auto-split. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9439) Support asynchronized scheduling mode and multi-node lookup mechanism for app activities
Tao Yang created YARN-9439: -- Summary: Support asynchronized scheduling mode and multi-node lookup mechanism for app activities Key: YARN-9439 URL: https://issues.apache.org/jira/browse/YARN-9439 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Tao Yang Assignee: Tao Yang [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time
[ https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qiuliang updated YARN-9437: --- Description: We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ completedcontainers in each RMNodeImpl that has not been released. (was: We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each RMNodeImpl has approximately 14M. The reason is that there is a 13W+ completedcontainers in each RMNodeImpl that has not been released.) > RMNodeImpls occupy too much memory and causes RM GC to take a long time > --- > > Key: YARN-9437 > URL: https://issues.apache.org/jira/browse/YARN-9437 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.1 >Reporter: qiuliang >Priority: Blocker > Attachments: 1.png, 2.png, 3.png > > > We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of > RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each > RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ > completedcontainers in each RMNodeImpl that has not been released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9439) Support asynchronized scheduling mode and multi-node lookup mechanism for app activities
[ https://issues.apache.org/jira/browse/YARN-9439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9439: --- Attachment: YARN-9439.001.patch > Support asynchronized scheduling mode and multi-node lookup mechanism for app > activities > > > Key: YARN-9439 > URL: https://issues.apache.org/jira/browse/YARN-9439 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9439.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9440) Improve diagnostics for scheduler and app activities
Tao Yang created YARN-9440: -- Summary: Improve diagnostics for scheduler and app activities Key: YARN-9440 URL: https://issues.apache.org/jira/browse/YARN-9440 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Tao Yang Assignee: Tao Yang [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9440) Improve diagnostics for scheduler and app activities
[ https://issues.apache.org/jira/browse/YARN-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9440: --- Description: [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.cyw6zeehzqmx] (was: [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] ) > Improve diagnostics for scheduler and app activities > > > Key: YARN-9440 > URL: https://issues.apache.org/jira/browse/YARN-9440 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.cyw6zeehzqmx] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9439) Support asynchronized scheduling mode and multi-node lookup mechanism for app activities
[ https://issues.apache.org/jira/browse/YARN-9439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9439: --- Description: [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.m051gyiikx7c] (was: [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] ) > Support asynchronized scheduling mode and multi-node lookup mechanism for app > activities > > > Key: YARN-9439 > URL: https://issues.apache.org/jira/browse/YARN-9439 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9439.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.m051gyiikx7c] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9403) GET /apps/{appid}/entities/YARN_APPLICATION accesses application table instead of entity table
[ https://issues.apache.org/jira/browse/YARN-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809557#comment-16809557 ] Prabhu Joseph commented on YARN-9403: - Thanks [~vrushalic] for reviewing this. The rest api {noformat} /apps/{appid}/entities/entityType{noformat} to fetch entities of any particular entityType and an applicationId can be misused to list other apps, flowruns and flow activities. This is not a serious issue since ACLs available but gives negative impression to the user. And also user won't know the internal terms YARN_APPLICATION, YARN_FLOW_RUN, YARN_FLOW_ACTIVITY and i think can use it for custom entities. This Jira fixes the negative scenario where it treats any value user set as entityType to fetch from entities table. {code:java} curl -s "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_APPLICATION"; curl -s "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_FLOW_RUN"; curl -s "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_FLOW_ACTIVITY"; {code} > GET /apps/{appid}/entities/YARN_APPLICATION accesses application table > instead of entity table > -- > > Key: YARN-9403 > URL: https://issues.apache.org/jira/browse/YARN-9403 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9403-001.patch, YARN-9403-002.patch, > YARN-9403-003.patch, YARN-9403-004.patch > > > {noformat}"GET /apps/{appid}/entities/YARN_APPLICATION"{noformat} accesses > application table instead of entity table. As per the doc, With this API, you > can query generic entities identified by cluster ID, application ID and > per-framework entity type. But it also provides all the apps when entityType > is set to YARN_APPLICATION. It should only access Entity Table through > {{GenericEntityReader}}. > Wrong Output: With YARN_APPLICATION entityType, all applications listed from > application tables. > {code} > [hbase@yarn-ats-3 centos]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/YARN_APPLICATION?user.name=hbase&userid=hbase&flowname=word%20count"; > | jq . > [ > { > "metrics": [], > "events": [], > "createdtime": 1553258922721, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": "ats!application_1553258815132_0002", > "FROM_ID": "ats!hbase!word > count!1553258922721!application_1553258815132_0002" > }, > "configs": {}, > "type": "YARN_APPLICATION", > "id": "application_1553258815132_0002" > }, > { > "metrics": [], > "events": [], > "createdtime": 1553258825918, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": "ats!application_1553258815132_0001", > "FROM_ID": "ats!hbase!word > count!1553258825918!application_1553258815132_0001" > }, > "configs": {}, > "type": "YARN_APPLICATION", > "id": "application_1553258815132_0001" > } > ] > {code} > Right Output: With correct entity type (MAPREDUCE_JOB) it accesses entity > table for given applicationId and entityType. > {code} > [hbase@yarn-ats-3 centos]$ curl -s > "http://yarn-ats-3:8198/ws/v2/timeline/apps/application_1553258815132_0002/entities/MAPREDUCE_JOB?user.name=hbase&userid=hbase&flowname=word%20count"; > | jq . > [ > { > "metrics": [], > "events": [], > "createdtime": 1553258926667, > "idprefix": 0, > "isrelatedto": {}, > "relatesto": {}, > "info": { > "UID": > "ats!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002", > "FROM_ID": "ats!hbase!word > count!1553258922721!application_1553258815132_0002!MAPREDUCE_JOB!0!job_1553258815132_0002" > }, > "configs": {}, > "type": "MAPREDUCE_JOB", > "id": "job_1553258815132_0002" > } > ] > {code} > Flow Activity and Flow Run tables can also be accessed using similar way. > {code} > GET /apps/{appid}/entities/YARN_FLOW_ACTIVITY > GET /apps/{appid}/entities/YARN_FLOW_RUN > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org