[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586675#comment-14586675 ] Yuliya Feldman commented on YARN-3803: -- [~kasha] It happens only if you have single node (at least in my testing) - since AM 2nd+ attempt will happen on the same node. Though - I was debating whether to make it Major or not. I can change it to major. I will post a patch later today for the fix. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman resolved YARN-3803. -- Resolution: Not A Problem I apologize for this one. It is not an issue in branches I mentioned, we just had duplicates handled incorrectly. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-3803: - Priority: Major (was: Minor) Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587025#comment-14587025 ] Yuliya Feldman commented on YARN-3803: -- Changed to Major Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2497) Changes for fair scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584953#comment-14584953 ] Yuliya Feldman commented on YARN-2497: -- Please take it over Changes for fair scheduler to support allocate resource respect labels -- Key: YARN-2497 URL: https://issues.apache.org/jira/browse/YARN-2497 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Yuliya Feldman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
Yuliya Feldman created YARN-3803: Summary: Application hangs after more then one localization attempt fails on the same NM Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585370#comment-14585370 ] Yuliya Feldman commented on YARN-3803: -- In LocalizedResource class in state transition there are following transitions: {code} // From INIT (ref == 0, awaiting req) .addTransition(ResourceState.INIT, ResourceState.DOWNLOADING, ResourceEventType.REQUEST, new FetchResourceTransition()) // From DOWNLOADING (ref 0, may be localizing) .addTransition(ResourceState.DOWNLOADING, ResourceState.DOWNLOADING, ResourceEventType.REQUEST, new DuplicateFetchResourceTransition()) {code} So it assumes that if from state and to state is _DOWNLOADING_ and _ResourceEventType_ is _REQUEST_ then resource is being downloaded and transition becomes _DuplicateFetchResourceTransition_. Problem is that ref is not greater then 0 here, as resources were cleaned up during first attempt and we end up in the situation where nothing is happening until RM kills this app. Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-3803: - Affects Version/s: 2.7.0 2.5.1 Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM
[ https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585372#comment-14585372 ] Yuliya Feldman commented on YARN-3803: -- This situation is easily reproducible while running any M/R job as user with id 500 on a cluster with single NM using LinuxContainerExecutor. So far the only solution I found is to proceed with localization in DuplicateFetchResourceTransition if ref == 0. This solution does not seem to look very clean according to state transitions, but there is no otherwise any evidence that previous container localization failed. I would appreciate comments/thoughts on this Application hangs after more then one localization attempt fails on the same NM --- Key: YARN-3803 URL: https://issues.apache.org/jira/browse/YARN-3803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0, 2.5.1 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Priority: Minor In the sandbox (single node) environment with LinuxContainerExecutor when first Application Localization attempt fails second attempt can not proceed and subsequently application hangs until RM kills it as non-responding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276156#comment-14276156 ] Yuliya Feldman commented on YARN-2791: -- Thank you Vinod for your comment, as [~sdaingade] suggested in his previous comment let's make this JIRA as subtask of [YARN-2139 | https://issues.apache.org/jira/browse/YARN-2139 ] and do design discussion and code walk through if necessary to determine whether it fits into existing subtask or it will continue being it's own. Yes we definitely understand you opened YARN-2139 first. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Attachments: DiskDriveAsResourceInYARN.pdf Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman reopened YARN-2791: -- Assignee: Yuliya Feldman I think this JIRA should be reopened, since https://issues.apache.org/jira/browse/YARN-2817 which is submitted later is talking about absolutely the same Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200849#comment-14200849 ] Yuliya Feldman commented on YARN-2791: -- [~kasha] I agree that consolidating both and working together is a way to go here. Let's initiate this. What I don't quite understand is how https://issues.apache.org/jira/browse/YARN-2817 is different from this one or from https://issues.apache.org/jira/browse/YARN-2139 Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125121#comment-14125121 ] Yuliya Feldman commented on YARN-2493: -- to [~wangda] I did not quite understand what is the purpose of {code} ResourceRequest getAMContainerResourceRequest {code} I can see in the above comment you are talking about the difference in ability to specify number of containers in regular request unlike in AMRequest, so is this new method to consolidate Priority, Resource into AMContainerResourceRequest? [YARN-796] API changes for users Key: YARN-2493 URL: https://issues.apache.org/jira/browse/YARN-2493 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2493.patch, YARN-2493.patch This JIRA includes API changes for users of YARN-796, like changes in {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125132#comment-14125132 ] Yuliya Feldman commented on YARN-2493: -- [~wangda], In this case why do you even bother adding labelExpression to ApplicationSubmissionContext if you plan to have it as part of AMResourceRequest, unless you want to keep it in both places for some reason? [YARN-796] API changes for users Key: YARN-2493 URL: https://issues.apache.org/jira/browse/YARN-2493 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2493.patch, YARN-2493.patch This JIRA includes API changes for users of YARN-796, like changes in {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125135#comment-14125135 ] Yuliya Feldman commented on YARN-2493: -- [~leftnoteasy]] OK, so what is the logic going to be: 1. If label is specified in ApplicationSubmisisonContext and AMResourceRequest the one from former will be set for all the ResourceRequests except AMResourceRequest (and can be overwritten later on), is it true? 2. If label is specified in ApplicationSubmisisonContext and not in AMResourceRequest it will be set for all resource requests including AMResourceRequest, is it true? 3. If label is not specified in ApplicationSubmisisonContext and specified in AMResourceRequest - then what - same as #2? [YARN-796] API changes for users Key: YARN-2493 URL: https://issues.apache.org/jira/browse/YARN-2493 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2493.patch, YARN-2493.patch This JIRA includes API changes for users of YARN-796, like changes in {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2493) [YARN-796] API changes for users
[ https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125146#comment-14125146 ] Yuliya Feldman commented on YARN-2493: -- [~leftnoteasy] I suspect that #2 will be the most common, since either somebody will set label for whole application, or will set label for non-AM container resource requests programmatically (kind of #3). I think I saw in the comments: ability to set it separately for mappers and reducers. In my personal opinion it is better to at least restrict input parameters to one (not two), otherwise we would need to take care of 4 combinations instead of just 1. In any case I am fine with the changes. [YARN-796] API changes for users Key: YARN-2493 URL: https://issues.apache.org/jira/browse/YARN-2493 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2493.patch, YARN-2493.patch This JIRA includes API changes for users of YARN-796, like changes in {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119945#comment-14119945 ] Yuliya Feldman commented on YARN-796: - [~wangda] it Is a great idea to split, otherwise it is getting too big and hard to keep track.. If you feel like assigning some JIRAs to me feel free, though I guess you are ready to roll. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119989#comment-14119989 ] Yuliya Feldman commented on YARN-796: - [~wangda] Yep - there are still 3 unassigned JIRAs out of 13 as of this moment. Please assign me JARN-2497 Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120804#comment-14120804 ] Yuliya Feldman commented on YARN-796: - [~wangda]. OK Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083753#comment-14083753 ] Yuliya Feldman commented on YARN-796: - I am out of country now with very poor internet connectivity, so won't be able to answer comprehensively. To: [~ste...@apache.org] Really appreciate your comments I definitely agree with majority of the comments you made. Especially with how much code it takes to add a single method to rmadmin command - may be we missed something, but it is really too much. regarding wrapper on top of LabelManager to behave as a service - in realy life situation service is instantiated once per process - which is exactly what we need, as it is really a singleton, but since UnitTests create service per unit test it created issues with Service States in this case. About waiting for 6 secs between tests - allowing labels ile to reload - can be reduced further. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.1 First patch based on LabelBasedScheduling design document Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Labels: (was: patch) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077458#comment-14077458 ] Yuliya Feldman commented on YARN-796: - Yes, noticed - will repost in a moment Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.2 Patch to comply with svn Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.1) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.2) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.3 Rebased from trunk Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: (was: YARN-796.patch.3) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch4 Fixing failed Test, FindBugs and JavaDocs Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070039#comment-14070039 ] Yuliya Feldman commented on YARN-796: - To everybody that were so involved in providing input for last couple of days I can provide support for App, Queue and Queue Label Policy Expression support. Also did some performance measurements - with 1000 entries with nodes and their labels it takes about additional 700 ms to process 1mln requests (hot cache). If will need reevaluate on every ResourceRequest within App performance will go down This should cover {quote} label-expressions support (AND) only app able to specify a label-expression when making a resource request - kind of (do per application at the moment, not per every resource request) queues to AND augment the label expression with the queue label-expression add support for OR and NOT to label-expressions {quote} As far as {quote} RM has list of valid labels. (hot reloadable) NMs have list of labels. (hot reloadable) {quote} With file in DFS you can get hot reloadable valid (unless somebody makes typo) labels on RM [~wangda] - How do you want to proceed here? Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060258#comment-14060258 ] Yuliya Feldman commented on YARN-796: - 1) {quote} Agree, what I meant is, we need consider performance of 2 things, - Time to evaluate a label expression, IMO we need to add labels in per container level. - If it is important to get headroom or how many nodes can be used for an expression. The easier expression will be easier for us to get result mentioned previously easier. {quote} Regarding time to evaluate label expression - we need to get some performance stats on how many ops we can process - I will try to get those performance numbers based different levels complexity of expression Did not do anything to include labels evaluation into calculation of headroom, so I don't have comments there 2) bq. Do you have any ideas about what’s the API will like? It can be as simple as yarn rmadmin -loadlabels local_file_path remote_file_path I am not sure if you mean anything else 3) bq. I think for different schedulers, we should specify queue related parameters in different configurations. Let’s get more ideas about how to specify queue parameters from community before move ahead. I have some examples in the document for Fair and Capacity Schedulers Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059260#comment-14059260 ] Yuliya Feldman commented on YARN-796: - [~leftnoteasy] Thank you for your comments. BTW - Your document is a great set of requirements, it was a real pleasure reading it. Please see my answers. 1. Label Expression Label expression - logical combination of labels (using and, || or, ! not) It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance. I definitely agree that performance needs to be considered here. Application level label expression is not going to change per application lifetime - so this can be cashed. Queue level label expression is going to change only when Queue label is changed - so this can be cached per queue So final expression to match together Queue Label, Application Label Expressions and QueueLabelPolicy does not need to be evaluated every ResourceRequest - again unless AppMaster dynamically assigns different labels per request for a container. What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue. Thank you for pointing this out. I will double check on this And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression. I am sure we can provide admin API/REST/UI to enter expression and get the result IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well. Let me understand it better: If application provides multiple labels they are ANDed and so only nodes that have the same set of labels or their superset will be used? 2. Queue Policy There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible. At least, OR is no so meaningful to me here, do you have any usecase/example on this one? Consider this as union of LabelExpression from Application and Queue. So if you have LabelExpression as blue and QueueExpression as yellow You can allocate containers on the nodes that have either label blue or yellow (if you have some nodes that are not marked as such they won't be used). So unlike in case of AND where you can only run on nodes that marked as blue and yellow (subset) I think AND should be enough to cover most usecases. 3. Labels Manager 3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via hadoop fs -copyFromLocal ...? If yes, it will be hard for admin to configure. Yes - so far this is a procedure. Not sure what is hard here, but we can have some API to do it. 3.2 We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here. Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process. 4. Specify labels in container level I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough. Yes - if Application Level is not enough user can specify on request level, otherwise not necessarily. Though I can not say we looked closely at possibility of setting label on more granular level very closely (to address your next comment) And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers. 5. Label specification for hierarchy queues We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. The closest will be used. strategy will potentially cause some configuration issues as well. Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to type more then necessary :) 6. In Considerations part 6.1 If we assume that during life of the application none of those changes can take effect on the application I think we can assume application will not change
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053885#comment-14053885 ] Yuliya Feldman commented on YARN-796: - [~bcwalrus] Thank you for your comments Regarding: The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a node label report, which is the real-time information of all nodes and their labels. Yes - you would have yarn command to showlabels that would show all the labels in the cluster yarn rmadmin -showlabels Regarding: 2. Labels are per-container, not per-app. Right? The doc keeps mentioning application label, ApplicationLabelExpression, etc. Should those be container label instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers. Proposal here to have labels per application, not per containers. Though it is not that hard to specify label per container (rather per Request) There are pros and cons for both (per container and per app): pros for App - the only place to setLabel is ApplicationSubmissionContext cons for App - as you said - you want one configuration for Mappers and other for Reducers cons for container level labels - every application that wants to take advantage of the labels will have to code it in their AppMaster while creating ResourceRequests Regarding: --- The proposal uses regexes on FQDN, such as perfnode.*. File with labels does not need to contain Regexes for FQDN - since it will be based solely on what hostname what is used in isBlackListed() method. But I surely open to suggestions to get labels from nodes, as long as it is not high burden on the Cluster Admin who needs to specify labels per node on the node Regarding: --- Can we fail container requests with no satisfying nodes? I think it would be the same behavior as for any other Request that can not be satisfied because queues were setup incorrectly, or there is no free resource available t the moment. How would you differentiate between those cases? Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2253) Label Based Scheduling
Yuliya Feldman created YARN-2253: Summary: Label Based Scheduling Key: YARN-2253 URL: https://issues.apache.org/jira/browse/YARN-2253 Project: Hadoop YARN Issue Type: Task Components: resourcemanager, scheduler Reporter: Yuliya Feldman Based on the descriptions of YARN-796 and SLIDER-81 it would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2253) Label Based Scheduling
[ https://issues.apache.org/jira/browse/YARN-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-2253: - Attachment: LabelBasedScheduling.pdf Proposed design for Label Based Scheduling Label Based Scheduling -- Key: YARN-2253 URL: https://issues.apache.org/jira/browse/YARN-2253 Project: Hadoop YARN Issue Type: Task Components: resourcemanager, scheduler Reporter: Yuliya Feldman Attachments: LabelBasedScheduling.pdf Based on the descriptions of YARN-796 and SLIDER-81 it would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2253) Label Based Scheduling
[ https://issues.apache.org/jira/browse/YARN-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052695#comment-14052695 ] Yuliya Feldman commented on YARN-2253: -- Can please somebody with authority help me to assign this JIRA to myself? Thanks. I would love to her comments on the design. Label Based Scheduling -- Key: YARN-2253 URL: https://issues.apache.org/jira/browse/YARN-2253 Project: Hadoop YARN Issue Type: Task Components: resourcemanager, scheduler Reporter: Yuliya Feldman Attachments: LabelBasedScheduling.pdf Based on the descriptions of YARN-796 and SLIDER-81 it would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: LabelBasedScheduling.pdf Since YARN-2253 was closed as duplicate. Adding proposal here. It would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)