[jira] [Commented] (YUNIKORN-400) Add doc about the helm chart install options
[ https://issues.apache.org/jira/browse/YUNIKORN-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199253#comment-17199253 ] Kinga Marton commented on YUNIKORN-400: --- I think if we want to add some documentation, the information from the chart readme is enough, so I would add that information to the docs as well. [~wwei] is that enough or you had something else in your mind? [~wilfreds], I will check the maintainers in index.yaml file. > Add doc about the helm chart install options > > > Key: YUNIKORN-400 > URL: https://issues.apache.org/jira/browse/YUNIKORN-400 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > Submit PR to the master branch of > https://github.com/apache/incubator-yunikorn-site. > If you find anything in the README not clear enough and you have problems to > make changes, please let me know. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties
[ https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-335: - Assignee: Kinga Marton > Invalid config validation and config schema checks for unsupported config > properties > > > Key: YUNIKORN-335 > URL: https://issues.apache.org/jira/browse/YUNIKORN-335 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > > * Invalid configuration does not error out - accepts silently but does not > reload the config. > 4 spaces & 2 spaces - we should error out if 2 spaces is the standard. > partitions: > {noformat} > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > properties: >application.sort.policy: stateaware{noformat} > * Any unsupported configuration should not be allowed in the policy map - > some kind of schema check is needed. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > sample: value1 > properties: > application.sort.policy: stateaware > sample: value2{noformat} > > *Impact: User thinks the config is loaded and consumed by Yunikorn but its > acutally not. So its important to error out incase any formatting issues.* > [~wilfreds] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties
[ https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199345#comment-17199345 ] Kinga Marton commented on YUNIKORN-335: --- [~ayubpathan] I added semantic check to the validation. Related the spaces, I tested it and it is accepted with mixed number of spaces, so I don't think we should be so strict to error out if there is some extra space, if it is a valid yaml. > Invalid config validation and config schema checks for unsupported config > properties > > > Key: YUNIKORN-335 > URL: https://issues.apache.org/jira/browse/YUNIKORN-335 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > * Invalid configuration does not error out - accepts silently but does not > reload the config. > 4 spaces & 2 spaces - we should error out if 2 spaces is the standard. > partitions: > {noformat} > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > properties: >application.sort.policy: stateaware{noformat} > * Any unsupported configuration should not be allowed in the policy map - > some kind of schema check is needed. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > sample: value1 > properties: > application.sort.policy: stateaware > sample: value2{noformat} > > *Impact: User thinks the config is loaded and consumed by Yunikorn but its > acutally not. So its important to error out incase any formatting issues.* > [~wilfreds] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties
[ https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199346#comment-17199346 ] Kinga Marton commented on YUNIKORN-335: --- Also yaml specification says that the amount of indentation is only a presentation detail: [https://yaml.org/spec/1.2/spec.html#id2777534] > Invalid config validation and config schema checks for unsupported config > properties > > > Key: YUNIKORN-335 > URL: https://issues.apache.org/jira/browse/YUNIKORN-335 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > * Invalid configuration does not error out - accepts silently but does not > reload the config. > 4 spaces & 2 spaces - we should error out if 2 spaces is the standard. > partitions: > {noformat} > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > properties: >application.sort.policy: stateaware{noformat} > * Any unsupported configuration should not be allowed in the policy map - > some kind of schema check is needed. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > sample: value1 > properties: > application.sort.policy: stateaware > sample: value2{noformat} > > *Impact: User thinks the config is loaded and consumed by Yunikorn but its > acutally not. So its important to error out incase any formatting issues.* > [~wilfreds] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues
[ https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-351: - Assignee: Kinga Marton > Support validate conf to reject the config with 2 top level queues > -- > > Key: YUNIKORN-351 > URL: https://issues.apache.org/jira/browse/YUNIKORN-351 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Critical > > admission controller should reject the config with 2 top level queues.. > something like below. > {noformat} > queues.yaml: > > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > - name: queue1 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2024 > cpu: 2000 > {noformat} > YK reads this config and creates the queue1 as child to the first top level > queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues
[ https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199926#comment-17199926 ] Kinga Marton commented on YUNIKORN-351: --- [~wilfreds] pointed out that this is the expected and documented behaviour: [http://yunikorn.apache.org/docs/next/user_guide/queue_config] {code:java} It can have a root queue defined but it is not a required element. If the root queue is not defined the configuration parsing will insert the root queue for consistency. The insertion of the root queue is triggered by: * If the configuration has more than one queue defined at the top level a root queue is inserted. * If there is only one queue defined at the top level and it is not called root a root queue is inserted.{code} > Support validate conf to reject the config with 2 top level queues > -- > > Key: YUNIKORN-351 > URL: https://issues.apache.org/jira/browse/YUNIKORN-351 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Critical > > admission controller should reject the config with 2 top level queues.. > something like below. > {noformat} > queues.yaml: > > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > - name: queue1 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2024 > cpu: 2000 > {noformat} > YK reads this config and creates the queue1 as child to the first top level > queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues
[ https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-351. --- Resolution: Not A Bug > Support validate conf to reject the config with 2 top level queues > -- > > Key: YUNIKORN-351 > URL: https://issues.apache.org/jira/browse/YUNIKORN-351 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Critical > > admission controller should reject the config with 2 top level queues.. > something like below. > {noformat} > queues.yaml: > > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > - name: queue1 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2024 > cpu: 2000 > {noformat} > YK reads this config and creates the queue1 as child to the first top level > queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-419) Add helm packaging to the release tool
[ https://issues.apache.org/jira/browse/YUNIKORN-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-419. --- Fix Version/s: 0.10 Resolution: Fixed Thank you for fixing this! The PR is merged > Add helm packaging to the release tool > -- > > Key: YUNIKORN-419 > URL: https://issues.apache.org/jira/browse/YUNIKORN-419 > Project: Apache YuniKorn > Issue Type: New Feature > Components: release >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 0.10 > > > The current tool does not package the helm charts at the same time as we > generate the rest of the release artefacts. > Details around signing etc are not documented either. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-315) use resources.DAOString in handler instead of trim
[ https://issues.apache.org/jira/browse/YUNIKORN-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-315. --- Fix Version/s: 0.10 Resolution: Fixed > use resources.DAOString in handler instead of trim > -- > > Key: YUNIKORN-315 > URL: https://issues.apache.org/jira/browse/YUNIKORN-315 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Ankit Kumar >Priority: Major > Labels: newbie, pull-request-available > Fix For: 0.10 > > > In the resource object we have a special function to write the resources in a > format that is used for the web responses. > The handler code does not use this call and uses a local trim over a String > call. > For readability and consistency we should move all web service code that > returns resources to use the DAOString() call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-315) use resources.DAOString in handler instead of trim
[ https://issues.apache.org/jira/browse/YUNIKORN-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200122#comment-17200122 ] Kinga Marton commented on YUNIKORN-315: --- Thank you [~akumar] for working on this. I merged your PR. > use resources.DAOString in handler instead of trim > -- > > Key: YUNIKORN-315 > URL: https://issues.apache.org/jira/browse/YUNIKORN-315 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Ankit Kumar >Priority: Major > Labels: newbie, pull-request-available > > In the resource object we have a special function to write the resources in a > format that is used for the web responses. > The handler code does not use this call and uses a local trim over a String > call. > For readability and consistency we should move all web service code that > returns resources to use the DAOString() call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203844#comment-17203844 ] Kinga Marton commented on YUNIKORN-326: --- [~Huang Ting Yao] can you please check if we can add the new endpoint according to the new design of the API: [https://docs.google.com/document/d/1KgoGqmNGR7TK3yBeqmefiFso2_lQujA_TzlsU0SsG1o/edit?usp=sharing] > Add rest API to retrieve cluster nodes resource utilization > --- > > Key: YUNIKORN-326 > URL: https://issues.apache.org/jira/browse/YUNIKORN-326 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > > URL: ws/v1/nodes/utilization > returns the nodes resource utilization summary, a distribution based on usage: > {code} > { > type: "CPU", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > { > type: "Memory", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203843#comment-17203843 ] Kinga Marton commented on YUNIKORN-324: --- [~Huang Ting Yao] can you please check if we can add the new endpoint according to the new design of the API: [https://docs.google.com/document/d/1KgoGqmNGR7TK3yBeqmefiFso2_lQujA_TzlsU0SsG1o/edit?usp=sharing] > Add rest API to retrieve cluster resource utilization > - > > Key: YUNIKORN-324 > URL: https://issues.apache.org/jira/browse/YUNIKORN-324 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > > URL: ws/v1/clusters/utilization > this should something like the following (per-partition): > {code} > [ > { > partition: default, > utilization: [ { > type: "cpu", > total: 100, > used: 50, > usage: 50% > }, > { > type: "memory", > total: 1000, > used: 500, > usage: 50% > } > ] > }, > ... > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204687#comment-17204687 ] Kinga Marton commented on YUNIKORN-324: --- [~maniraj...@gmail.com] for this issue and also YUNIKORN-326 there is already an open PR, but it aligns with the actual design of the REST API. I am wondering what ifs your progress with the refactoring of the API, and if we can do this issues first in accordance to the new design. What do you think? cc [~wwei] > Add rest API to retrieve cluster resource utilization > - > > Key: YUNIKORN-324 > URL: https://issues.apache.org/jira/browse/YUNIKORN-324 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > > URL: ws/v1/clusters/utilization > this should something like the following (per-partition): > {code} > [ > { > partition: default, > utilization: [ { > type: "cpu", > total: 100, > used: 50, > usage: 50% > }, > { > type: "memory", > total: 1000, > used: 500, > usage: 50% > } > ] > }, > ... > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-430) WARN and ERROR level log entries should not show a stack trace
[ https://issues.apache.org/jira/browse/YUNIKORN-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204730#comment-17204730 ] Kinga Marton commented on YUNIKORN-430: --- I think is useful to have the stack trace in case of the error message. In case of the WARN messages we can think about removing it, but in case of the ERROR I would keep it. > WARN and ERROR level log entries should not show a stack trace > -- > > Key: YUNIKORN-430 > URL: https://issues.apache.org/jira/browse/YUNIKORN-430 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - common, shim - kubernetes >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Minor > > Currently in every WARN or ERROR level log in the logs shows a stack trace of > the log entry. I haven't seen any similar behaviour in other projects, and I > think this is too verbose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties
[ https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204831#comment-17204831 ] Kinga Marton commented on YUNIKORN-335: --- [~wwei] I added upper case letter validation for the queue names. Not only lower case letters are accepted. Related the multiple queue issue, we agreed that actually that is the expected, documented behaviour. > Invalid config validation and config schema checks for unsupported config > properties > > > Key: YUNIKORN-335 > URL: https://issues.apache.org/jira/browse/YUNIKORN-335 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > * Invalid configuration does not error out - accepts silently but does not > reload the config. > 4 spaces & 2 spaces - we should error out if 2 spaces is the standard. > partitions: > {noformat} > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > properties: >application.sort.policy: stateaware{noformat} > * Any unsupported configuration should not be allowed in the policy map - > some kind of schema check is needed. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > sample: value1 > properties: > application.sort.policy: stateaware > sample: value2{noformat} > > *Impact: User thinks the config is loaded and consumed by Yunikorn but its > acutally not. So its important to error out incase any formatting issues.* > [~wilfreds] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-266) Delete related pods when deleting an app CRD
[ https://issues.apache.org/jira/browse/YUNIKORN-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205389#comment-17205389 ] Kinga Marton commented on YUNIKORN-266: --- When thinking about this we should consider the following scenario as well: if there is an application with all *tasks completed* and we delete the CRD, the application in the scheduler will *NOT* be deleted because of the assigned pods. I think if the pods are in terminated state we can delete them. > Delete related pods when deleting an app CRD > > > Key: YUNIKORN-266 > URL: https://issues.apache.org/jira/browse/YUNIKORN-266 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Kinga Marton >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-431) Submitting pods with non-existing queue name
Kinga Marton created YUNIKORN-431: - Summary: Submitting pods with non-existing queue name Key: YUNIKORN-431 URL: https://issues.apache.org/jira/browse/YUNIKORN-431 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Kinga Marton When we submit the CRD with let's say queueName=root.default, then we submit the pods linked to this application with queueName=root.xyz, where xyz is not an existing queue. The pod will be placed into queue root.default defined in the CRD, what is not the right approach -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-433) Extend configwatcher expiration time when a new request comes in
Kinga Marton created YUNIKORN-433: - Summary: Extend configwatcher expiration time when a new request comes in Key: YUNIKORN-433 URL: https://issues.apache.org/jira/browse/YUNIKORN-433 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton When two configuration reloading is triggered closely to each other it might happen that, before the update is available the watcher times out, because it is already running, since it was triggered during the first update/configmap creation. Then the update triggers it again, the expiration time is not modified. Everything is about the timing: if you wait with the update until the first triggered configwatcher times out, then the changes will be synced. Also it works if you are quick enough with the update and the changes takes effect before the expire time. For avoiding this kind of issues with the config changes we need to add 2 changes: * increase the timeout for the configwatcher * restart the configwatcher timer when the configwatcher is triggered and theer is already one running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-433) Extend configwatcher expiration time when a new request comes in
[ https://issues.apache.org/jira/browse/YUNIKORN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206186#comment-17206186 ] Kinga Marton commented on YUNIKORN-433: --- The K8s documentation says the following about the delay([https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap):] {quote}As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the pod can be as long as kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet. {quote} According to this I think we should set the timeout for at least 2 minutes. Even with this change I think we cannot guarantee that the updated configurations will take effect in the scheduler every time. > Extend configwatcher expiration time when a new request comes in > > > Key: YUNIKORN-433 > URL: https://issues.apache.org/jira/browse/YUNIKORN-433 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > When two configuration reloading is triggered closely to each other it might > happen that, before the update is available the watcher times out, because > it is already running, since it was triggered during the first > update/configmap creation. Then the update triggers it again, the expiration > time is not modified. Everything is about the timing: if you wait with the > update until the first triggered configwatcher times out, then the changes > will be synced. Also it works if you are quick enough with the update and the > changes takes effect before the expire time. > For avoiding this kind of issues with the config changes we need to add 2 > changes: > * increase the timeout for the configwatcher > * restart the configwatcher timer when the configwatcher is triggered and > theer is already one running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-433) Extend configwatcher expiration time in case of a new update
[ https://issues.apache.org/jira/browse/YUNIKORN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-433: -- Summary: Extend configwatcher expiration time in case of a new update (was: Extend configwatcher expiration time when a new request comes in) > Extend configwatcher expiration time in case of a new update > > > Key: YUNIKORN-433 > URL: https://issues.apache.org/jira/browse/YUNIKORN-433 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > When two configuration reloading is triggered closely to each other it might > happen that, before the update is available the watcher times out, because > it is already running, since it was triggered during the first > update/configmap creation. Then the update triggers it again, the expiration > time is not modified. Everything is about the timing: if you wait with the > update until the first triggered configwatcher times out, then the changes > will be synced. Also it works if you are quick enough with the update and the > changes takes effect before the expire time. > For avoiding this kind of issues with the config changes we need to add 2 > changes: > * increase the timeout for the configwatcher > * restart the configwatcher timer when the configwatcher is triggered and > theer is already one running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-436) serviceAccount is hardcoded in rbac.yaml
[ https://issues.apache.org/jira/browse/YUNIKORN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-436. --- Fix Version/s: 0.10 Resolution: Fixed > serviceAccount is hardcoded in rbac.yaml > - > > Key: YUNIKORN-436 > URL: https://issues.apache.org/jira/browse/YUNIKORN-436 > Project: Apache YuniKorn > Issue Type: Bug > Components: deployment >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > serviceAccountName is exposed in the values.yaml but the change in value is > not reflected when service account is created. > In rbac.yaml, the serviceAccountName is hardcoded to yunikorn-admin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-436) serviceAccount is hardcoded in rbac.yaml
[ https://issues.apache.org/jira/browse/YUNIKORN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207936#comment-17207936 ] Kinga Marton commented on YUNIKORN-436: --- Thank you [~vbm] for fixing it. I committed your changes to master. > serviceAccount is hardcoded in rbac.yaml > - > > Key: YUNIKORN-436 > URL: https://issues.apache.org/jira/browse/YUNIKORN-436 > Project: Apache YuniKorn > Issue Type: Bug > Components: deployment >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Labels: pull-request-available > > serviceAccountName is exposed in the values.yaml but the change in value is > not reflected when service account is created. > In rbac.yaml, the serviceAccountName is hardcoded to yunikorn-admin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-435) Admission-Controller pod goes into pending state because of default serviceAccount
[ https://issues.apache.org/jira/browse/YUNIKORN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208023#comment-17208023 ] Kinga Marton commented on YUNIKORN-435: --- [~vbm] can you please add some context to this failure?(like how did you installed it, versions, anything that can be relevant in reproducing this bug) I installed YK multiple times and I haven't seen this issue. > Admission-Controller pod goes into pending state because of default > serviceAccount > -- > > Key: YUNIKORN-435 > URL: https://issues.apache.org/jira/browse/YUNIKORN-435 > Project: Apache YuniKorn > Issue Type: Bug > Components: deployment, shim - kubernetes >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Labels: pull-request-available > > The admission controller pod which is created inside the scheduler pod uses > the wrong service account. > The admission controller pod is launched with default service account. This > causes the admission controller pod to be in pending state because of > insufficient privileges. > > Error message indicating pod in pending state: > {code:java} > NAMEREADY UP-TO-DATE > AVAILABLE AGE > deployment.apps/yunikorn-admission-controller 0/1 00 >8m14s > deployment.apps/yunikorn-scheduler 1/1 11 >8m20sNAME DESIRED > CURRENT READY AGE > replicaset.apps/yunikorn-admission-controller-854f64bcbf 1 0 > 0 8m14s > replicaset.apps/yunikorn-scheduler-585fcfbb46 1 1 > 1 8m20s > {code} > {code:java} > [root@vm5 vbm]# kubectl describe > replicaset.apps/yunikorn-admission-controller-854f64bcbf -n yunikorn > Name: yunikorn-admission-controller-854f64bcbf > Namespace: yunikorn > Selector: app=yunikorn,pod-template-hash=854f64bcbf > Labels: app=yunikorn > pod-template-hash=854f64bcbf > Annotations:deployment.kubernetes.io/desired-replicas: 1 > deployment.kubernetes.io/max-replicas: 2 > deployment.kubernetes.io/revision: 1 > Controlled By: Deployment/yunikorn-admission-controller > Events: > Type ReasonAge From Message > -- --- > Warning FailedCreate 19s (x13 over 40s) replicaset-controller Error > creating: pods "yunikorn-admission-controller-854f64bcbf-" is forbidden: > unable to validate against any pod security policy: > [spec.securityContext.hostNetwork: Invalid value: true: Host network is not > allowed to be used spec.containers[0].hostPort: Invalid value: 8443: Host > port 8443 is not allowed to be used. Allowed ports: []] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user
[ https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-352: - Assignee: Kinga Marton > when child queue capacity greater than parent, the configmap update is > rejected but not notified to end user > > > Key: YUNIKORN-352 > URL: https://issues.apache.org/jira/browse/YUNIKORN-352 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Critical > > Create a nested static queue like below. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > queues: > - name: queue2 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 1000 > cpu: 1000 > queues: > - name: queue3 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2000 > cpu: 2000 > {noformat} > Validate the same through rest API /queues - queues3 is not even shown in the > response. > {noformat} > { > "capacity": { > "capacity": "map[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "0" > }, > "nodes": null, > "partitionName": "[mycluster]default", > "queues": { > "capacities": { > "absusedcapacity": "[memory:0 vcore:2]", > "capacity": "[]", > "maxcapacity": "[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "[memory:1 vcore:110]" > }, > "properties": {}, > "queuename": "root", > "queues": [ > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[memory:1]" > }, > "properties": {}, > "queuename": "monitoring", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[vcore:110]" > }, > "properties": {}, > "queuename": "kube-system", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[cpu:300 memory:300]", > "maxcapacity": "[cpu:1000 memory:1000]", > "usedcapacity": "[]" > }, > "properties": {}, > "queuename": "queue2", > "queues": null, > "status": "Active" > } > ], > "status": "Active" > } > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-324. --- Fix Version/s: 0.10 Resolution: Fixed Thank you [~Huang Ting Yao] for working on this. Merged your PR. > Add rest API to retrieve cluster resource utilization > - > > Key: YUNIKORN-324 > URL: https://issues.apache.org/jira/browse/YUNIKORN-324 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > URL: ws/v1/clusters/utilization > this should something like the following (per-partition): > {code} > [ > { > partition: default, > utilization: [ { > type: "cpu", > total: 100, > used: 50, > usage: 50% > }, > { > type: "memory", > total: 1000, > used: 500, > usage: 50% > } > ] > }, > ... > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208678#comment-17208678 ] Kinga Marton commented on YUNIKORN-326: --- [~Huang Ting Yao] I merged your PR. Before resolving this Jira can you please upload a sample REST output for this one as well? > Add rest API to retrieve cluster nodes resource utilization > --- > > Key: YUNIKORN-326 > URL: https://issues.apache.org/jira/browse/YUNIKORN-326 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > > URL: ws/v1/nodes/utilization > returns the nodes resource utilization summary, a distribution based on usage: > {code} > { > type: "CPU", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > { > type: "Memory", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization
[ https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-326. --- Fix Version/s: 0.10 Resolution: Fixed > Add rest API to retrieve cluster nodes resource utilization > --- > > Key: YUNIKORN-326 > URL: https://issues.apache.org/jira/browse/YUNIKORN-326 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > URL: ws/v1/nodes/utilization > returns the nodes resource utilization summary, a distribution based on usage: > {code} > { > type: "CPU", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > { > type: "Memory", > utilization: [ { > bucketID: "1", > bucketName: "0-10%", > numOfNodes: 5, > nodeNames: [...] > }, { > bucketID: "2", > bucketName: "10-20%", > numOfNodes: 5, > nodeNames: [...] > }, > ... > ] > }, > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-438) Move informers synch call at the beginning of scheduler start
Kinga Marton created YUNIKORN-438: - Summary: Move informers synch call at the beginning of scheduler start Key: YUNIKORN-438 URL: https://issues.apache.org/jira/browse/YUNIKORN-438 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton We do a wait until the informers get sync up with API server during the recovery. But this currently is done in the node-recovery phase. But before this phase the application might got created and asking for namespace information from k8s might fail if the informers are not yet synched. We should make sure that we wait for this sync before doing anything. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-438) Move informers synch call to the beginning of scheduler start
[ https://issues.apache.org/jira/browse/YUNIKORN-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-438: -- Summary: Move informers synch call to the beginning of scheduler start (was: Move informers synch call at the beginning of scheduler start) > Move informers synch call to the beginning of scheduler start > - > > Key: YUNIKORN-438 > URL: https://issues.apache.org/jira/browse/YUNIKORN-438 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > We do a wait until the informers get sync up with API server during the > recovery. But this currently is done in the node-recovery phase. But before > this phase the application might got created and asking for namespace > information from k8s might fail if the informers are not yet synched. > We should make sure that we wait for this sync before doing anything. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-435) Admission-Controller pod goes into pending state because of default serviceAccount
[ https://issues.apache.org/jira/browse/YUNIKORN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-435. --- Fix Version/s: 0.10 Resolution: Fixed Thank you [~vbm] for the clarification and for the fix. Thanks [~adam.antal] for the validation. I merged the PRs to the master branch. > Admission-Controller pod goes into pending state because of default > serviceAccount > -- > > Key: YUNIKORN-435 > URL: https://issues.apache.org/jira/browse/YUNIKORN-435 > Project: Apache YuniKorn > Issue Type: Bug > Components: deployment, shim - kubernetes >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > The admission controller pod which is created inside the scheduler pod uses > the wrong service account. > The admission controller pod is launched with default service account. This > causes the admission controller pod to be in pending state because of > insufficient privileges. > > Error message indicating pod in pending state: > {code:java} > NAMEREADY UP-TO-DATE > AVAILABLE AGE > deployment.apps/yunikorn-admission-controller 0/1 00 >8m14s > deployment.apps/yunikorn-scheduler 1/1 11 >8m20sNAME DESIRED > CURRENT READY AGE > replicaset.apps/yunikorn-admission-controller-854f64bcbf 1 0 > 0 8m14s > replicaset.apps/yunikorn-scheduler-585fcfbb46 1 1 > 1 8m20s > {code} > {code:java} > [root@vm5 vbm]# kubectl describe > replicaset.apps/yunikorn-admission-controller-854f64bcbf -n yunikorn > Name: yunikorn-admission-controller-854f64bcbf > Namespace: yunikorn > Selector: app=yunikorn,pod-template-hash=854f64bcbf > Labels: app=yunikorn > pod-template-hash=854f64bcbf > Annotations:deployment.kubernetes.io/desired-replicas: 1 > deployment.kubernetes.io/max-replicas: 2 > deployment.kubernetes.io/revision: 1 > Controlled By: Deployment/yunikorn-admission-controller > Events: > Type ReasonAge From Message > -- --- > Warning FailedCreate 19s (x13 over 40s) replicaset-controller Error > creating: pods "yunikorn-admission-controller-854f64bcbf-" is forbidden: > unable to validate against any pod security policy: > [spec.securityContext.hostNetwork: Invalid value: true: Host network is not > allowed to be used spec.containers[0].hostPort: Invalid value: 8443: Host > port 8443 is not allowed to be used. Allowed ports: []] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user
[ https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-352: -- Priority: Major (was: Critical) > when child queue capacity greater than parent, the configmap update is > rejected but not notified to end user > > > Key: YUNIKORN-352 > URL: https://issues.apache.org/jira/browse/YUNIKORN-352 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > > Create a nested static queue like below. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > queues: > - name: queue2 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 1000 > cpu: 1000 > queues: > - name: queue3 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2000 > cpu: 2000 > {noformat} > Validate the same through rest API /queues - queues3 is not even shown in the > response. > {noformat} > { > "capacity": { > "capacity": "map[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "0" > }, > "nodes": null, > "partitionName": "[mycluster]default", > "queues": { > "capacities": { > "absusedcapacity": "[memory:0 vcore:2]", > "capacity": "[]", > "maxcapacity": "[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "[memory:1 vcore:110]" > }, > "properties": {}, > "queuename": "root", > "queues": [ > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[memory:1]" > }, > "properties": {}, > "queuename": "monitoring", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[vcore:110]" > }, > "properties": {}, > "queuename": "kube-system", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[cpu:300 memory:300]", > "maxcapacity": "[cpu:1000 memory:1000]", > "usedcapacity": "[]" > }, > "properties": {}, > "queuename": "queue2", > "queues": null, > "status": "Active" > } > ], > "status": "Active" > } > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-438) Move informers synch call to the beginning of scheduler start
[ https://issues.apache.org/jira/browse/YUNIKORN-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-438. --- Fix Version/s: 0.10 Resolution: Fixed > Move informers synch call to the beginning of scheduler start > - > > Key: YUNIKORN-438 > URL: https://issues.apache.org/jira/browse/YUNIKORN-438 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > We do a wait until the informers get sync up with API server during the > recovery. But this currently is done in the node-recovery phase. But before > this phase the application might got created and asking for namespace > information from k8s might fail if the informers are not yet synched. > We should make sure that we wait for this sync before doing anything. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Reopened] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user
[ https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reopened YUNIKORN-352: --- > when child queue capacity greater than parent, the configmap update is > rejected but not notified to end user > > > Key: YUNIKORN-352 > URL: https://issues.apache.org/jira/browse/YUNIKORN-352 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > Create a nested static queue like below. > {noformat} > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > queues: > - name: queue2 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 1000 > cpu: 1000 > queues: > - name: queue3 > resources: > guaranteed: > memory: 300 > cpu: 300 > max: > memory: 2000 > cpu: 2000 > {noformat} > Validate the same through rest API /queues - queues3 is not even shown in the > response. > {noformat} > { > "capacity": { > "capacity": "map[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "0" > }, > "nodes": null, > "partitionName": "[mycluster]default", > "queues": { > "capacities": { > "absusedcapacity": "[memory:0 vcore:2]", > "capacity": "[]", > "maxcapacity": "[attachable-volumes-aws-ebs:75 > ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 > pods:87 vcore:4875]", > "usedcapacity": "[memory:1 vcore:110]" > }, > "properties": {}, > "queuename": "root", > "queues": [ > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[memory:1]" > }, > "properties": {}, > "queuename": "monitoring", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[]", > "maxcapacity": "[]", > "usedcapacity": "[vcore:110]" > }, > "properties": {}, > "queuename": "kube-system", > "queues": null, > "status": "Active" > }, > { > "capacities": { > "absusedcapacity": "[]", > "capacity": "[cpu:300 memory:300]", > "maxcapacity": "[cpu:1000 memory:1000]", > "usedcapacity": "[]" > }, > "properties": {}, > "queuename": "queue2", > "queues": null, > "status": "Active" > } > ], > "status": "Active" > } > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-447) Remove wrong error message from admission_controller.go
Kinga Marton created YUNIKORN-447: - Summary: Remove wrong error message from admission_controller.go Key: YUNIKORN-447 URL: https://issues.apache.org/jira/browse/YUNIKORN-447 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton There is an error log message left in the admission_controller.go file: [https://github.com/apache/incubator-yunikorn-k8shim/blob/master/pkg/plugin/admissioncontrollers/webhook/admission_controller.go#L211-L213] That message was used for debug purpose and it remained there by mistake. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-449) Add predicates.MatchNodeSelectorPred to reservation list
Kinga Marton created YUNIKORN-449: - Summary: Add predicates.MatchNodeSelectorPred to reservation list Key: YUNIKORN-449 URL: https://issues.apache.org/jira/browse/YUNIKORN-449 Project: Apache YuniKorn Issue Type: Improvement Reporter: Kinga Marton Assignee: Kinga Marton -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-452) Check and fix YK page in artifactory
Kinga Marton created YUNIKORN-452: - Summary: Check and fix YK page in artifactory Key: YUNIKORN-452 URL: https://issues.apache.org/jira/browse/YUNIKORN-452 Project: Apache YuniKorn Issue Type: Improvement Reporter: Kinga Marton Starting from October Helm Hub moved to Artifactory Hub and some information from the YK page is broken(logo, maintainers, etc) Let's check what we need to change to have all the information populated again. More info: [https://helm.sh/blog/helm-hub-moving-to-artifact-hub/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-455) Make the core configurable
Kinga Marton created YUNIKORN-455: - Summary: Make the core configurable Key: YUNIKORN-455 URL: https://issues.apache.org/jira/browse/YUNIKORN-455 Project: Apache YuniKorn Issue Type: Improvement Reporter: Kinga Marton Assignee: Kinga Marton -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-455) Make the core configurable
[ https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-455: -- Description: There are some startup options in the core side, but they are not configurable from outside. > Make the core configurable > -- > > Key: YUNIKORN-455 > URL: https://issues.apache.org/jira/browse/YUNIKORN-455 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > There are some startup options in the core side, but they are not > configurable from outside. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-455) Make the core configurable
[ https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-455: -- Description: There are some startup options in the core side, but they are not configurable from outside. Also make the reservation expiration configurable as well. was:There are some startup options in the core side, but they are not configurable from outside. > Make the core configurable > -- > > Key: YUNIKORN-455 > URL: https://issues.apache.org/jira/browse/YUNIKORN-455 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > There are some startup options in the core side, but they are not > configurable from outside. > Also make the reservation expiration configurable as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-455) Make the core configurable
[ https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222784#comment-17222784 ] Kinga Marton commented on YUNIKORN-455: --- [~cheersyang] I just checked the meeting minutes from todays community synch. And I have a question: {quote}A: there are 2 options, 1) through REST API, 2) through another config file. Opt 2 gives a way to persistent configs, however, it might be slower due to the configmap update delays. {quote} here a config file is mentioned. Do you mean, creating a new configmap and pass the values to the core instead of having a .properties, or .json or any file format used for storing key-value pairs, and let the core side process it? > Make the core configurable > -- > > Key: YUNIKORN-455 > URL: https://issues.apache.org/jira/browse/YUNIKORN-455 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > There are some startup options in the core side, but they are not > configurable from outside. > Also make the reservation expiration configurable as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-456) Add ENV var to disable the reservation
Kinga Marton created YUNIKORN-456: - Summary: Add ENV var to disable the reservation Key: YUNIKORN-456 URL: https://issues.apache.org/jira/browse/YUNIKORN-456 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Kinga Marton -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-456) Add ENV var to disable the reservation
[ https://issues.apache.org/jira/browse/YUNIKORN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-456: - Assignee: Kinga Marton > Add ENV var to disable the reservation > -- > > Key: YUNIKORN-456 > URL: https://issues.apache.org/jira/browse/YUNIKORN-456 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-457) Find a way to pass the RMID to the webservice
Kinga Marton created YUNIKORN-457: - Summary: Find a way to pass the RMID to the webservice Key: YUNIKORN-457 URL: https://issues.apache.org/jira/browse/YUNIKORN-457 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton When updating the configuration through the REST API, we need an RMId to reflect the changes in the configmap as well. With the actual approach this might not work properly if we would have more than one RM registered, or if we don't have any RM's. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-455) Make the core configurable
[ https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229431#comment-17229431 ] Kinga Marton commented on YUNIKORN-455: --- I like the idea of having an another configmap. I think that we should force a restart in order to make the changes available. If there will be a restart, the update delay will not be a problem in this case. For the configs we want to change runtime, we can pass them via ENV variables as well. > Make the core configurable > -- > > Key: YUNIKORN-455 > URL: https://issues.apache.org/jira/browse/YUNIKORN-455 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > There are some startup options in the core side, but they are not > configurable from outside. > Also make the reservation expiration configurable as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-457) Find a way to pass the RMID to the webservice
[ https://issues.apache.org/jira/browse/YUNIKORN-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229441#comment-17229441 ] Kinga Marton commented on YUNIKORN-457: --- Today I thought a bit about how we can solve this issue and I found a better way to solve it, than the actual implementation: - When calling {{updateSchedulerConfig}} from the webservice, pass an empty RMID - In the {{updateSchedulerConfig}} if the RMID is empty, while processing the partitions, the clusterContext has a map with the partitions where the key is in the form of {{[rmID]partitionName}}. We can iterate through the partitions and check if in the map is a key having the partitionName from the changed config. If yes, then we can extract the rmID from the value stored in the clusterContext's partition map. - If the partition is a new one, use a default RMID, what we can store in the ClusterContext and set it when the first RM is registered. > Find a way to pass the RMID to the webservice > - > > Key: YUNIKORN-457 > URL: https://issues.apache.org/jira/browse/YUNIKORN-457 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > When updating the configuration through the REST API, we need an RMId to > reflect the changes in the configmap as well. With the actual approach this > might not work properly if we would have more than one RM registered, or if > we don't have any RM's. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-465) scheduler health check REST API
Kinga Marton created YUNIKORN-465: - Summary: scheduler health check REST API Key: YUNIKORN-465 URL: https://issues.apache.org/jira/browse/YUNIKORN-465 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton We need to build a health check REST API for the scheduler This is needed for chaos monkey tests, the validation script can call the API to verify the scheduler state periodically We should leverage scheduler metrics to do the validation, things to validate like: # Negative resources on node/app/cluster # Consistency of the data, e.g sum of allocated resource of apps = allocated resource in the partition # critical errors logged in the metrics (things should not happen but happened) # ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-465) scheduler health check REST API
[ https://issues.apache.org/jira/browse/YUNIKORN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-465: -- Attachment: HealthCheck_output > scheduler health check REST API > --- > > Key: YUNIKORN-465 > URL: https://issues.apache.org/jira/browse/YUNIKORN-465 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Attachments: HealthCheck_output > > > We need to build a health check REST API for the scheduler > This is needed for chaos monkey tests, the validation script can call the API > to verify the scheduler state periodically > We should leverage scheduler metrics to do the validation, things to validate > like: > # Negative resources on node/app/cluster > # Consistency of the data, e.g sum of allocated resource of apps = allocated > resource in the partition > # critical errors logged in the metrics (things should not happen but > happened) > # ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-285) Lint check doesn't work on branch-0.9
[ https://issues.apache.org/jira/browse/YUNIKORN-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-285. --- Fix Version/s: 0.10 Resolution: Fixed Thank you [~wilfreds] for fixing this. I merged the PR's to the master branch. > Lint check doesn't work on branch-0.9 > - > > Key: YUNIKORN-285 > URL: https://issues.apache.org/jira/browse/YUNIKORN-285 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler, shim - kubernetes, test - unit >Reporter: Weiwei Yang >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10 > > > Looks like the lint check always fails on branch-0.9. For both repos, shim > and core. See the following jobs: > https://travis-ci.org/github/apache/incubator-yunikorn-core/builds > https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/builds > such as > - > https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/jobs/708324431 > - > https://travis-ci.org/github/apache/incubator-yunikorn-core/builds/707911720 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-285) Lint check doesn't work on branch-0.9
[ https://issues.apache.org/jira/browse/YUNIKORN-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246505#comment-17246505 ] Kinga Marton commented on YUNIKORN-285: --- [~wilfreds] We need to back port it to branch-0.9 as well, right? > Lint check doesn't work on branch-0.9 > - > > Key: YUNIKORN-285 > URL: https://issues.apache.org/jira/browse/YUNIKORN-285 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler, shim - kubernetes, test - unit >Reporter: Weiwei Yang >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10 > > > Looks like the lint check always fails on branch-0.9. For both repos, shim > and core. See the following jobs: > https://travis-ci.org/github/apache/incubator-yunikorn-core/builds > https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/builds > such as > - > https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/jobs/708324431 > - > https://travis-ci.org/github/apache/incubator-yunikorn-core/builds/707911720 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-477) Include K8s 1.18 to e2e test matrix
[ https://issues.apache.org/jira/browse/YUNIKORN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-477. --- Fix Version/s: 0.10 Resolution: Fixed [~wwei] I merged your changes and will open a Jira for checking the coverage issue, since t was complaining about a file what wasn't touched by your changes. > Include K8s 1.18 to e2e test matrix > > > Key: YUNIKORN-477 > URL: https://issues.apache.org/jira/browse/YUNIKORN-477 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-482) Code coverage complaining about untouched files
Kinga Marton created YUNIKORN-482: - Summary: Code coverage complaining about untouched files Key: YUNIKORN-482 URL: https://issues.apache.org/jira/browse/YUNIKORN-482 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton The following change: [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292] is complaining about decreased project coverage in a file what wasn't touched by the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-477) Include K8s 1.18 to e2e test matrix
[ https://issues.apache.org/jira/browse/YUNIKORN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246528#comment-17246528 ] Kinga Marton edited comment on YUNIKORN-477 at 12/9/20, 1:19 PM: - [~wwei] I merged your changes and opened a Jira(YUNIKORN-482) for checking the coverage issue, since t was complaining about a file what wasn't touched by your changes. was (Author: kmarton): [~wwei] I merged your changes and will open a Jira for checking the coverage issue, since t was complaining about a file what wasn't touched by your changes. > Include K8s 1.18 to e2e test matrix > > > Key: YUNIKORN-477 > URL: https://issues.apache.org/jira/browse/YUNIKORN-477 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-480) root queue max resource gets reset on config load
[ https://issues.apache.org/jira/browse/YUNIKORN-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-480. --- Fix Version/s: 0.10 Resolution: Fixed > root queue max resource gets reset on config load > - > > Key: YUNIKORN-480 > URL: https://issues.apache.org/jira/browse/YUNIKORN-480 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 0.10 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10 > > > When updating the configuration the root queue max resource is getting reset > to {{nil}}. The configuration should never set the root queue resources. On > creation of the queue this is not a problem as there cannot be any registered > node. On update we have registered nodes and should thus not clear it out. > When the max resource gets reset it stops allocation as the scheduler thinks > there are no nodes registered and nothing can be done. > Found by the e2e tests as part of the shim dependency change YUNIKORN-475 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-482) Code coverage complaining about untouched files
[ https://issues.apache.org/jira/browse/YUNIKORN-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247764#comment-17247764 ] Kinga Marton commented on YUNIKORN-482: --- [~wilfreds] I agree that we need to cover that negative case as well, but it should report this when that part of code is added. My problem is that it shows this issue as a decrease in the coverage, when only the travis.yaml was modified. > Code coverage complaining about untouched files > --- > > Key: YUNIKORN-482 > URL: https://issues.apache.org/jira/browse/YUNIKORN-482 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Priority: Minor > Labels: coverage, pre-commit > Attachments: image-2020-12-11-16-30-46-654.png > > > The following change: > [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292] > is complaining about decreased project coverage in a file what wasn't touched > by the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-482) Code coverage complaining about untouched files
[ https://issues.apache.org/jira/browse/YUNIKORN-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247764#comment-17247764 ] Kinga Marton edited comment on YUNIKORN-482 at 12/11/20, 9:05 AM: -- [~wilfreds] I agree that we need to cover that negative case as well, but it should report this when that part of code is added. My problem is that it shows this issue as a decrease in the coverage, when only the travis.yaml was modified: [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/files] was (Author: kmarton): [~wilfreds] I agree that we need to cover that negative case as well, but it should report this when that part of code is added. My problem is that it shows this issue as a decrease in the coverage, when only the travis.yaml was modified. > Code coverage complaining about untouched files > --- > > Key: YUNIKORN-482 > URL: https://issues.apache.org/jira/browse/YUNIKORN-482 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Kinga Marton >Priority: Minor > Labels: coverage, pre-commit > Attachments: image-2020-12-11-16-30-46-654.png > > > The following change: > [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292] > is complaining about decreased project coverage in a file what wasn't touched > by the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-402) Make sure when there is no allocation in an app, the app state is "Waiting".
[ https://issues.apache.org/jira/browse/YUNIKORN-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-402. --- Fix Version/s: 0.10 Resolution: Fixed > Make sure when there is no allocation in an app, the app state is "Waiting". > > > Key: YUNIKORN-402 > URL: https://issues.apache.org/jira/browse/YUNIKORN-402 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Fix For: 0.10 > > > If there is no allocation for an app, according to > [http://yunikorn.apache.org/docs/next/design/scheduler_object_states] it's > status should be waiting instead of running, as mentioned here: > https://issues.apache.org/jira/browse/YUNIKORN-201?focusedCommentId=17186402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17186402 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-114) Clone the shallow clone version of protobuf repo
[ https://issues.apache.org/jira/browse/YUNIKORN-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-114. --- Fix Version/s: 0.10 Resolution: Fixed > Clone the shallow clone version of protobuf repo > > > Key: YUNIKORN-114 > URL: https://issues.apache.org/jira/browse/YUNIKORN-114 > Project: Apache YuniKorn > Issue Type: Bug > Components: scheduler-interface >Reporter: Adam Antal >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > When building the scheduler interface, we pull the whole protobuf repo. While > the source files are not that big, the git history (that we actually don't > need) makes it a bit slower to clone it. > We actually want to check out the latest tag, that we could do without > cloning by getting the tag with {{git ls-remote --tags}}) and > cloning/checking out just that revision with {{git clone --depth=1}}. Though > the the scheduler build time is not a bottleneck, I think we can improve the > build time a bit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-484) Handle the app completion in the core side
[ https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247952#comment-17247952 ] Kinga Marton commented on YUNIKORN-484: --- [~wwei] most of the workflow seems ok to me, but I have a question related the first step: {quote}when the core sees there is no pending ask and running containers, it moves the app to the "Competed" state {quote} here I think we should keep the app in waiting state for a while and use a timeout for moving it to Completed state. > Handle the app completion in the core side > -- > > Key: YUNIKORN-484 > URL: https://issues.apache.org/jira/browse/YUNIKORN-484 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Wilfred Spiegelenburg >Priority: Major > > Currently, if there is no pending ask or running container for an app, we > transmit it to the "Waiting" state. To step further, we keep the "Waiting" > state for a short period of time and then transit the state to "Completed". > Before YUNIKORN-2 is done, this is to track the core side changes to do the > state transition. When the app moves to the completed state, the core needs > to send a "UpdateResponse#UpdatedApplication" so the shim can do proper > cleanup. > After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the > shim to release all the placeholder pods. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-418) Add "config" REST API
[ https://issues.apache.org/jira/browse/YUNIKORN-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247968#comment-17247968 ] Kinga Marton commented on YUNIKORN-418: --- I agree with [~wwei] that for simplicity we should keep the "v1/config". Also it would be nice to have all the config related endpoint in the same place, but I think the nicer solution would be to use POST for doing the validation instead of adding the dry_run parameter. > Add "config" REST API > - > > Key: YUNIKORN-418 > URL: https://issues.apache.org/jira/browse/YUNIKORN-418 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-484) Handle the app completion in the core side
[ https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-484: - Assignee: Kinga Marton (was: Wilfred Spiegelenburg) > Handle the app completion in the core side > -- > > Key: YUNIKORN-484 > URL: https://issues.apache.org/jira/browse/YUNIKORN-484 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > Currently, if there is no pending ask or running container for an app, we > transmit it to the "Waiting" state. To step further, we keep the "Waiting" > state for a short period of time and then transit the state to "Completed". > Before YUNIKORN-2 is done, this is to track the core side changes to do the > state transition. When the app moves to the completed state, the core needs > to send a "UpdateResponse#UpdatedApplication" so the shim can do proper > cleanup. > After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the > shim to release all the placeholder pods. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-484) Handle the app completion in the core side
[ https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249696#comment-17249696 ] Kinga Marton commented on YUNIKORN-484: --- [~wilfreds], I started to implement the Completed state for the *non gang scheduling case* and I have a question related the removal of the application from the queue: - in the design doc stays that "_Entering into the completed state will move the application out of the queue automatically._ " - right now the applications are removed from the partition and queue when the shim sends to the core a {{RemoveApplicationRequest}} If we remove the application from the partition and queue right after the transition to Completed state, we will lose the application in the UI. Is it OK that we will not be able to track the already completed applications? > Handle the app completion in the core side > -- > > Key: YUNIKORN-484 > URL: https://issues.apache.org/jira/browse/YUNIKORN-484 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > Currently, if there is no pending ask or running container for an app, we > transmit it to the "Waiting" state. To step further, we keep the "Waiting" > state for a short period of time and then transit the state to "Completed". > Before YUNIKORN-2 is done, this is to track the core side changes to do the > state transition. When the app moves to the completed state, the core needs > to send a "UpdateResponse#UpdatedApplication" so the shim can do proper > cleanup. > After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the > shim to release all the placeholder pods. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-484) Handle the app completion in the core side
[ https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249696#comment-17249696 ] Kinga Marton edited comment on YUNIKORN-484 at 12/16/20, 5:16 PM: -- [~wwei], [~wilfreds], I started to implement the Completed state for the *non gang scheduling case* and I have a question related the removal of the application from the queue: - in the design doc stays that "_Entering into the completed state will move the application out of the queue automatically._ " - right now the applications are removed from the partition and queue when the shim sends to the core a {{RemoveApplicationRequest}} If we remove the application from the partition and queue right after the transition to Completed state, we will lose the application in the UI. Is it OK that we will not be able to track the already completed applications? was (Author: kmarton): [~wilfreds], I started to implement the Completed state for the *non gang scheduling case* and I have a question related the removal of the application from the queue: - in the design doc stays that "_Entering into the completed state will move the application out of the queue automatically._ " - right now the applications are removed from the partition and queue when the shim sends to the core a {{RemoveApplicationRequest}} If we remove the application from the partition and queue right after the transition to Completed state, we will lose the application in the UI. Is it OK that we will not be able to track the already completed applications? > Handle the app completion in the core side > -- > > Key: YUNIKORN-484 > URL: https://issues.apache.org/jira/browse/YUNIKORN-484 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > Currently, if there is no pending ask or running container for an app, we > transmit it to the "Waiting" state. To step further, we keep the "Waiting" > state for a short period of time and then transit the state to "Completed". > Before YUNIKORN-2 is done, this is to track the core side changes to do the > state transition. When the app moves to the completed state, the core needs > to send a "UpdateResponse#UpdatedApplication" so the shim can do proper > cleanup. > After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the > shim to release all the placeholder pods. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-503) Fix recovery for completed apps
Kinga Marton created YUNIKORN-503: - Summary: Fix recovery for completed apps Key: YUNIKORN-503 URL: https://issues.apache.org/jira/browse/YUNIKORN-503 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Kinga Marton After implementing the completed state, we need to fix the recovery part. Right now is there is an application in waiting or completed state and we restart the scheduler, when the application is recreated it will have the New state and if there will be no pods assigned to that app, it will never transition to completed state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-504) Show the completed apps in the UI
Kinga Marton created YUNIKORN-504: - Summary: Show the completed apps in the UI Key: YUNIKORN-504 URL: https://issues.apache.org/jira/browse/YUNIKORN-504 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Kinga Marton After YUNIKORN-484, we will store the completed apps in a separate list than the other ones. It would be useful if we could show this apps in the UI as well, but in a separate table than the Apps table. cc [~ayubpathan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-507) Add git version pre-requisite in the build guide
[ https://issues.apache.org/jira/browse/YUNIKORN-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264720#comment-17264720 ] Kinga Marton commented on YUNIKORN-507: --- [~wwei], I don't have the exact version from what is already working. What I know is that with 2.22.0.rc2 and newer ones it working. I think we should mention 2.22. > Add git version pre-requisite in the build guide > > > Key: YUNIKORN-507 > URL: https://issues.apache.org/jira/browse/YUNIKORN-507 > Project: Apache YuniKorn > Issue Type: Task > Components: documentation >Reporter: Weiwei Yang >Priority: Minor > > Recently we found our build will fail if the git version is too old, e.g 2.4.x > We should document this on our web-site: > http://yunikorn.apache.org/docs/next/developer_guide/build. To require a > minimal version of git to be installed before trying to build YK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-513) ApplicationState remains in Accepted after recovery
Kinga Marton created YUNIKORN-513: - Summary: ApplicationState remains in Accepted after recovery Key: YUNIKORN-513 URL: https://issues.apache.org/jira/browse/YUNIKORN-513 Project: Apache YuniKorn Issue Type: Bug Components: core - cache Affects Versions: 0.10 Reporter: Kinga Marton Assignee: Kinga Marton Steps to reproduce: * Start 2 sleep jobs * Wait for both to run and applicationState to be Running * Kill yunikorn * After 10 minutes, the rest call now shows both applicationState as accepted instead of running -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-514) Intermittent issues in e2e tests after YUNIKORN-317
[ https://issues.apache.org/jira/browse/YUNIKORN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272112#comment-17272112 ] Kinga Marton commented on YUNIKORN-514: --- Thank you [~wwei] for the fix! I merged your changes both to master and branch-0.10 branches. Before resolving this issue please update the dependency in the shim side. > Intermittent issues in e2e tests after YUNIKORN-317 > --- > > Key: YUNIKORN-514 > URL: https://issues.apache.org/jira/browse/YUNIKORN-514 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > > Post YUNIKORN-317, we've seen some intermittent issues in e2e tests, such as > https://travis-ci.com/github/apache/incubator-yunikorn-k8shim/builds/213126549. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272320#comment-17272320 ] Kinga Marton commented on YUNIKORN-513: --- After the core refactoring, the app state transition happens in the following steps: * New -> Accepted: when an allocationAsk is processed * Accepted -> Starting: when the allocation is processed * Starting -> Running: when the second allocation is processed or when the Starting state times out. In case of recovery, we don't have AllocationAsk, just already existing Allocations, so the first transition is skipped. This means, that if we have only 2 allocations, the application will not progress into the Running state. For the recovery we need to progress it manually from New to Accepted. > ApplicationState remains in Accepted after recovery > --- > > Key: YUNIKORN-513 > URL: https://issues.apache.org/jira/browse/YUNIKORN-513 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - cache >Affects Versions: 0.10 >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > Steps to reproduce: > * Start 2 sleep jobs > * Wait for both to run and applicationState to be Running > * Kill yunikorn > * After 10 minutes, the rest call now shows both applicationState as > accepted instead of running -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-517) Yunikorn v0.10 logs are filled with "clean up orphan pod" message for every 5 seconds
[ https://issues.apache.org/jira/browse/YUNIKORN-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-517. --- Resolution: Duplicate This is already addressed by YUNIKORN-512 > Yunikorn v0.10 logs are filled with "clean up orphan pod" message for every 5 > seconds > - > > Key: YUNIKORN-517 > URL: https://issues.apache.org/jira/browse/YUNIKORN-517 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Ayub Pathan >Priority: Major > > {noformat} > 2021-01-24T15:04:25.940Z INFO cache/placeholder_manager.go:148 clean up > orphan pod > 2021-01-24T15:04:30.940Z INFO cache/placeholder_manager.go:148 clean up > orphan pod > 2021-01-24T15:04:35.940Z INFO cache/placeholder_manager.go:148 clean up > orphan pod > 2021-01-24T15:04:40.944Z INFO cache/placeholder_manager.go:148 clean up > orphan pod > 2021-01-24T15:04:45.947Z INFO cache/placeholder_manager.go:148 clean up > orphan pod \{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-512) Remove some useless log messages
[ https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-512: - Assignee: Weiwei Yang > Remove some useless log messages > > > Key: YUNIKORN-512 > URL: https://issues.apache.org/jira/browse/YUNIKORN-512 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler, shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Labels: pull-request-available > Attachments: k9s.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-512) Remove some useless log messages
[ https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272704#comment-17272704 ] Kinga Marton commented on YUNIKORN-512: --- Thank you [~wwei] for handling this. I merged your PR both to master and branch-0.10 > Remove some useless log messages > > > Key: YUNIKORN-512 > URL: https://issues.apache.org/jira/browse/YUNIKORN-512 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler, shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Labels: pull-request-available > Attachments: k9s.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-512) Remove some useless log messages
[ https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-512. --- Fix Version/s: 0.10 Resolution: Fixed > Remove some useless log messages > > > Key: YUNIKORN-512 > URL: https://issues.apache.org/jira/browse/YUNIKORN-512 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler, shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Minor > Labels: pull-request-available > Fix For: 0.10 > > Attachments: k9s.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-519) Cleanup placeholders when the app is Completed
Kinga Marton created YUNIKORN-519: - Summary: Cleanup placeholders when the app is Completed Key: YUNIKORN-519 URL: https://issues.apache.org/jira/browse/YUNIKORN-519 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Kinga Marton Assignee: Kinga Marton App completion is handled by YUNIKORN-484, however for the Gang scheduling we need to do some further cleanup around the unused placeholders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-507) Add git version pre-requisite in the build guide
[ https://issues.apache.org/jira/browse/YUNIKORN-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274461#comment-17274461 ] Kinga Marton commented on YUNIKORN-507: --- [~wilfreds] I merged you changes to the master branch. It is enough to have it on that branch, right? > Add git version pre-requisite in the build guide > > > Key: YUNIKORN-507 > URL: https://issues.apache.org/jira/browse/YUNIKORN-507 > Project: Apache YuniKorn > Issue Type: Task > Components: documentation >Reporter: Weiwei Yang >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > > Recently we found our build will fail if the git version is too old, e.g 2.4.x > We should document this on our web-site: > http://yunikorn.apache.org/docs/next/developer_guide/build. To require a > minimal version of git to be installed before trying to build YK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout
[ https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274516#comment-17274516 ] Kinga Marton commented on YUNIKORN-460: --- Yes, [~wwei]. Right now we have a field in the application called {{execTimeout}}. We already have this in the interface as well. I think that we should use this field to set the timeout, so the workflow would be the following one: - The user can define this timeout in an annotation such as : {{yunikorn.apache.org/schedulingPolicyParameters: “timeoutInSec=600”}} - The shim will process this information and when the Application is created and sent to the core, it will need to populate this {{execTimeout}} field - In the core, when we start to schedule the application (so the queue has enough headroom for the gang members), we can start the timer. - This timeout can be the time measured from starting the scheduling until the application runs. So we will reset the timer when the application will progress into the Completed state. (or Waiting state and reinitialise it if an application in a Waiting state gets some new allocations.) - If it times out, we can kill the application and send a UpdateResponse message to the shim with the following content: - UpdatedApplication for the state change of the application - AllocationRelease messages, with the actual allocations of the application, what needs to be released. [~wwei],[~wilfreds] what are your thoughts? > Handle app reservation timeout > -- > > Key: YUNIKORN-460 > URL: https://issues.apache.org/jira/browse/YUNIKORN-460 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > When an app is configured with a timeout, that determines the maximum time > permitted to stay in the Reserving phase. If that times out, then all the > existing placeholders should be deleted and the application will be scheduled > normally. This timeout is needed because otherwise an app’s partial > placeholders may occupy cluster resources and they are wasted. > See more in [this > doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-525) Update Application states doc
Kinga Marton created YUNIKORN-525: - Summary: Update Application states doc Key: YUNIKORN-525 URL: https://issues.apache.org/jira/browse/YUNIKORN-525 Project: Apache YuniKorn Issue Type: Bug Reporter: Kinga Marton Assignee: Kinga Marton Now, that we have implemented and redefined the completed state of an application, we need to update the documentation as well: [http://yunikorn.apache.org/docs/next/design/scheduler_object_states/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-533) Improve admission controller logging
Kinga Marton created YUNIKORN-533: - Summary: Improve admission controller logging Key: YUNIKORN-533 URL: https://issues.apache.org/jira/browse/YUNIKORN-533 Project: Apache YuniKorn Issue Type: Improvement Reporter: Kinga Marton Right now it is not so easy to debug issues related to the admission controller, also the logging level is not configurable, even if we have some log entries defined at debug level, I think we cannot change the level, what by default is INFO. We need the following improvements related the admission controller logging: * review and cleanup the logging * make the log level configurable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-533) Improve admission controller logging
[ https://issues.apache.org/jira/browse/YUNIKORN-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-533: - Assignee: Kinga Marton > Improve admission controller logging > > > Key: YUNIKORN-533 > URL: https://issues.apache.org/jira/browse/YUNIKORN-533 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > Right now it is not so easy to debug issues related to the admission > controller, also the logging level is not configurable, even if we have some > log entries defined at debug level, I think we cannot change the level, what > by default is INFO. > We need the following improvements related the admission controller logging: > * review and cleanup the logging > * make the log level configurable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-460) Handle app reservation timeout
[ https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276393#comment-17276393 ] Kinga Marton edited comment on YUNIKORN-460 at 2/1/21, 3:04 PM: Today we had a sync with [~wilfreds] on this topic. I am summarising here what we discussed: When it comes about the timeout we have 2 cases # The queue is full, so only a part of the placeholders got allocated(for example the app ask for 100GB but the placeholders are using 50GB) # The placeholders are all allocated, but not all of them were replaced by real pods ( it can be due to configuration issue, but can be because something is changed in the cluster as well) We will kill the placeholder pods in both cases if it times out, but we will not kill the whole application, so in the second case the already running real pods will continue to do their job. We kill only the placeholders. * We will start the timer when the first placeholder is getting allocated * When it times out we just kill all the placeholders if we have any [~wwei] related to the new state you mentioned, I don't think that we can add this new state, because when the first placeholder is replaced by the new pod, the application is already transitioning into the Running state. I don't think it is a good idea to make a difference between a simple app and one with a gang defined related to when it will start running. [~wilfreds] please correct me if I am wrong, or if I missed something. was (Author: kmarton): Today we had a sync with [~wilfreds] on this topic. I am summarising here what we discussed: When it comes about the timeout we have 2 cases # The queue is full, so only a part of the placeholders got allocated(for example the app ask for 100GB but the placeholders are using 50GB) # The placeholders are all allocated, but not all of them were replaced by real pods ( it can be due to configuration issue, but can be because something is changed in the cluster as well) We will kill the placeholder pods in both cases if it times out, but we will not kill the whole application, so in the second case the already running real pods will continue to do their job. We kill only the placeholders. * We will start the timer when the first placeholder is getting allocated * When it times out we just kill all the placeholders if we have any [~wwei] related to the new state you mentioned, I don't think that we can add this new state, because when the first placeholder is replaced by the new pod, the application is already transitioning into the Running state. I don't think it is a good idea to make a difference between a simple app and one with a gang defined related to when it will start running. > Handle app reservation timeout > -- > > Key: YUNIKORN-460 > URL: https://issues.apache.org/jira/browse/YUNIKORN-460 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > When an app is configured with a timeout, that determines the maximum time > permitted to stay in the Reserving phase. If that times out, then all the > existing placeholders should be deleted and the application will be scheduled > normally. This timeout is needed because otherwise an app’s partial > placeholders may occupy cluster resources and they are wasted. > See more in [this > doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout
[ https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276393#comment-17276393 ] Kinga Marton commented on YUNIKORN-460: --- Today we had a sync with [~wilfreds] on this topic. I am summarising here what we discussed: When it comes about the timeout we have 2 cases # The queue is full, so only a part of the placeholders got allocated(for example the app ask for 100GB but the placeholders are using 50GB) # The placeholders are all allocated, but not all of them were replaced by real pods ( it can be due to configuration issue, but can be because something is changed in the cluster as well) We will kill the placeholder pods in both cases if it times out, but we will not kill the whole application, so in the second case the already running real pods will continue to do their job. We kill only the placeholders. * We will start the timer when the first placeholder is getting allocated * When it times out we just kill all the placeholders if we have any [~wwei] related to the new state you mentioned, I don't think that we can add this new state, because when the first placeholder is replaced by the new pod, the application is already transitioning into the Running state. I don't think it is a good idea to make a difference between a simple app and one with a gang defined related to when it will start running. > Handle app reservation timeout > -- > > Key: YUNIKORN-460 > URL: https://issues.apache.org/jira/browse/YUNIKORN-460 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > When an app is configured with a timeout, that determines the maximum time > permitted to stay in the Reserving phase. If that times out, then all the > existing placeholders should be deleted and the application will be scheduled > normally. This timeout is needed because otherwise an app’s partial > placeholders may occupy cluster resources and they are wasted. > See more in [this > doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout
[ https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277017#comment-17277017 ] Kinga Marton commented on YUNIKORN-460: --- Yesterday [~wwei] mentioned a scenario what is not covered by the previous design: * if we don't have any placeholders allocated, so all of them are pending and we have just the AllocationAsks. If we want to cover this case as well, would mean that we should start the timer when we try to allocate the placeholders instead of waiting for the first placeholder allocation. When it will timeout, then we should remove not only the placeholder allocations, but the AllocationAsks as well. Since there is no way right to sent back to the shim the AllocationAsks what needs to be removed, then the simpler solution is to Fail the application, or add a new State, and based of the termination state the shim will be able to handle in it's side both the asks and the allocations as well. [~wwei] please correct me if I missed something. [~wilfreds] what do think about this approach? I think we should coved the mentioned case as well. > Handle app reservation timeout > -- > > Key: YUNIKORN-460 > URL: https://issues.apache.org/jira/browse/YUNIKORN-460 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Kinga Marton >Priority: Major > > When an app is configured with a timeout, that determines the maximum time > permitted to stay in the Reserving phase. If that times out, then all the > existing placeholders should be deleted and the application will be scheduled > normally. This timeout is needed because otherwise an app’s partial > placeholders may occupy cluster resources and they are wasted. > See more in [this > doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted
[ https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277211#comment-17277211 ] Kinga Marton commented on YUNIKORN-521: --- [~ayubpathan], right now in Yunikorn we don't have the Job definition. We just listen to when the pods are created and use the configured ApplicationId to match the pods to an application, so we can have different pods belonging to different jobs, but still part of the same application, if we pass the ID accordingly. If we would create one on one mapping for the Job and internal application that would break this functionality. Right now in this case the application will transit into Waiting state if no new pods will be attached to the application, then into Completed state. At this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you think this solution is acceptable, or we should find out something else to delete the placeholders when the job is deleted as well? The total time the unused placeholders will occupy the resources after the job is finished is 30 seconds (this is the waiting timeout). I think this is acceptable. > Placeholder pods are not cleaned up even when the job is deleted > > > Key: YUNIKORN-521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-521 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Attachments: job.yaml, ns.yaml > > > This one is a negative test... > * Create a namespace with quota > * Submit a job where the placeholder pods resource requests are more than > queue quota. > * Delete the job using kubectl > * Still the placeholder pods are in running state occupying the resources. > From an end user perspective, each job is an application consisting of all > related pods. If the user decides to purge the job, Yunikorn should also > recognize this action and clean up the placeholder pods. > From a yunikorn point of view, the application and job are 2 different > entities. The placeholder pods are not cleaned up because the application is > still alive even though the job is deleted. Does it make sense to create a > one on one mapping for job and application? Once the lifecycle of job is > complete, application should also terminate in Yunikorn world. Let me know > your thoughts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted
[ https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277211#comment-17277211 ] Kinga Marton edited comment on YUNIKORN-521 at 2/2/21, 3:47 PM: [~ayubpathan], right now in Yunikorn we don't have the Job definition. We just listen to when the pods are created and use the configured ApplicationId to match the pods to an application, so we can have different pods belonging to different jobs, but still part of the same application, if we pass the ID accordingly. If we would create one on one mapping for the Job and internal application that would break this functionality. Right now in this case the application will transit into Waiting state if no new pods will be attached to the application, then into Completed state. At this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you think this solution is acceptable, or we should find out something else to delete the placeholders when the job is deleted as well? The total time the unused placeholders will occupy the resources after the job is finished is 40 seconds (waiting timeout + clean interval). I think this is acceptable. was (Author: kmarton): [~ayubpathan], right now in Yunikorn we don't have the Job definition. We just listen to when the pods are created and use the configured ApplicationId to match the pods to an application, so we can have different pods belonging to different jobs, but still part of the same application, if we pass the ID accordingly. If we would create one on one mapping for the Job and internal application that would break this functionality. Right now in this case the application will transit into Waiting state if no new pods will be attached to the application, then into Completed state. At this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you think this solution is acceptable, or we should find out something else to delete the placeholders when the job is deleted as well? The total time the unused placeholders will occupy the resources after the job is finished is 30 seconds (this is the waiting timeout). I think this is acceptable. > Placeholder pods are not cleaned up even when the job is deleted > > > Key: YUNIKORN-521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-521 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Attachments: job.yaml, ns.yaml > > > This one is a negative test... > * Create a namespace with quota > * Submit a job where the placeholder pods resource requests are more than > queue quota. > * Delete the job using kubectl > * Still the placeholder pods are in running state occupying the resources. > From an end user perspective, each job is an application consisting of all > related pods. If the user decides to purge the job, Yunikorn should also > recognize this action and clean up the placeholder pods. > From a yunikorn point of view, the application and job are 2 different > entities. The placeholder pods are not cleaned up because the application is > still alive even though the job is deleted. Does it make sense to create a > one on one mapping for job and application? Once the lifecycle of job is > complete, application should also terminate in Yunikorn world. Let me know > your thoughts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-510) Remove the sleep in placeholder manager stop function
[ https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277925#comment-17277925 ] Kinga Marton commented on YUNIKORN-510: --- Thank you [~Huang Ting Yao] for addressing this! I merged your changes both to branch-0.10 and master > Remove the sleep in placeholder manager stop function > - > > Key: YUNIKORN-510 > URL: https://issues.apache.org/jira/browse/YUNIKORN-510 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Minor > Labels: pull-request-available > > There is a 3s sleep in the stop function of placeholder manager, per Tingyao: > "When we send the struct{}{} to stopChan, the Start() might not set Running > to false immediately. Or we can move sleep to > TestPlaceholderManagerStartStop(), which is located in > placeholder_manager_test.go." > We should remove this from the stop function and move this to the UT code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-510) Remove the sleep in placeholder manager stop function
[ https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-510: -- Fix Version/s: 0.10 > Remove the sleep in placeholder manager stop function > - > > Key: YUNIKORN-510 > URL: https://issues.apache.org/jira/browse/YUNIKORN-510 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Minor > Labels: pull-request-available > Fix For: 0.10 > > > There is a 3s sleep in the stop function of placeholder manager, per Tingyao: > "When we send the struct{}{} to stopChan, the Start() might not set Running > to false immediately. Or we can move sleep to > TestPlaceholderManagerStartStop(), which is located in > placeholder_manager_test.go." > We should remove this from the stop function and move this to the UT code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-510) Remove the sleep in placeholder manager stop function
[ https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-510. --- Resolution: Fixed > Remove the sleep in placeholder manager stop function > - > > Key: YUNIKORN-510 > URL: https://issues.apache.org/jira/browse/YUNIKORN-510 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Ting Yao,Huang >Priority: Minor > Labels: pull-request-available > Fix For: 0.10 > > > There is a 3s sleep in the stop function of placeholder manager, per Tingyao: > "When we send the struct{}{} to stopChan, the Start() might not set Running > to false immediately. Or we can move sleep to > TestPlaceholderManagerStartStop(), which is located in > placeholder_manager_test.go." > We should remove this from the stop function and move this to the UT code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-537) log spam in CalculateAbsUsedCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277956#comment-17277956 ] Kinga Marton commented on YUNIKORN-537: --- Thank you [~wilfreds] for addressing this issue! I merged your changes to both branch-0.10 and master > log spam in CalculateAbsUsedCapacity > > > Key: YUNIKORN-537 > URL: https://issues.apache.org/jira/browse/YUNIKORN-537 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The log gets spammed by a call in the Resource object of the core: > {code:java} > 2021-02-02T08:55:32.797Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage > 2021-02-02T08:55:32.798Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage > 2021-02-02T08:55:34.862Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage {code} > This is linked to a REST call and should not be logged as a warning but at a > debug level at best. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-537) log spam in CalculateAbsUsedCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton resolved YUNIKORN-537. --- Fix Version/s: 0.10 Resolution: Fixed > log spam in CalculateAbsUsedCapacity > > > Key: YUNIKORN-537 > URL: https://issues.apache.org/jira/browse/YUNIKORN-537 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > The log gets spammed by a call in the Resource object of the core: > {code:java} > 2021-02-02T08:55:32.797Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage > 2021-02-02T08:55:32.798Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage > 2021-02-02T08:55:34.862Z WARNresources/resources.go:817 Cannot > calculate absolute capacity because of missing capacity or usage {code} > This is linked to a REST call and should not be logged as a warning but at a > debug level at best. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-536) Add resource requests/limits for the admission-controller
[ https://issues.apache.org/jira/browse/YUNIKORN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reassigned YUNIKORN-536: - Assignee: Weiwei Yang > Add resource requests/limits for the admission-controller > - > > Key: YUNIKORN-536 > URL: https://issues.apache.org/jira/browse/YUNIKORN-536 > Project: Apache YuniKorn > Issue Type: Task > Components: shim - kubernetes >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: pull-request-available > > We need to specify resource requests/limits for the admission-controller. > Start the pod as a best-effort QoS class could possibly cause issues when the > node is under heavy load. That can slow down the admission-controller and > subsequentially slows down the api-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Reopened] (YUNIKORN-513) ApplicationState remains in Accepted after recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton reopened YUNIKORN-513: --- > ApplicationState remains in Accepted after recovery > --- > > Key: YUNIKORN-513 > URL: https://issues.apache.org/jira/browse/YUNIKORN-513 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - cache >Affects Versions: 0.10 >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > Steps to reproduce: > * Start 2 sleep jobs > * Wait for both to run and applicationState to be Running > * Kill yunikorn > * After 10 minutes, the rest call now shows both applicationState as > accepted instead of running -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278024#comment-17278024 ] Kinga Marton commented on YUNIKORN-513: --- [~rozhang], the steps during recovery is the following one: * When the application is created in the core side, it will have the New state * During the node recovery we recover the existing allocations as well. * When the first allocation is recovered, the application will transit into Starting state through the Accepted one. So if there is only one allocation to recover the application will stay in the Starting state until it times out and will auto-progress into the Running one * When the second allocation is recovered, than the application will transit into Running state Actually the expected behaviour is exactly the same as in case of apps and tasks submission. So in case of a 2-allocation application, (if both pods are in running state when the recovery happens) is expected to be in Running state. However I checked it now and it seems to be broken again. I think now the locking has some issues. I suspect a deadlock, so I will reopen this issue. [~rozhang] have you encountered any issues as well? > ApplicationState remains in Accepted after recovery > --- > > Key: YUNIKORN-513 > URL: https://issues.apache.org/jira/browse/YUNIKORN-513 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - cache >Affects Versions: 0.10 >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > Steps to reproduce: > * Start 2 sleep jobs > * Wait for both to run and applicationState to be Running > * Kill yunikorn > * After 10 minutes, the rest call now shows both applicationState as > accepted instead of running -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-538) Fix node recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kinga Marton updated YUNIKORN-538: -- Target Version: 0.10 > Fix node recovery > - > > Key: YUNIKORN-538 > URL: https://issues.apache.org/jira/browse/YUNIKORN-538 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Kinga Marton >Priority: Major > > After the changes made in YUNIKORN-518 node recovery is broken. > Actually, the nodes are not recovered. The first commit when the issue is > observed is the following one: > [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-538) Fix node recovery
Kinga Marton created YUNIKORN-538: - Summary: Fix node recovery Key: YUNIKORN-538 URL: https://issues.apache.org/jira/browse/YUNIKORN-538 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Kinga Marton After the changes made in YUNIKORN-518 node recovery is broken. Actually, the nodes are not recovered. The first commit when the issue is observed is the following one: [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278105#comment-17278105 ] Kinga Marton commented on YUNIKORN-513: --- So far I found an issue what is causing some troubles during recover: YUNIKORN-538. I hope that fixing that one will solve the issue with the application states as well. > ApplicationState remains in Accepted after recovery > --- > > Key: YUNIKORN-513 > URL: https://issues.apache.org/jira/browse/YUNIKORN-513 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - cache >Affects Versions: 0.10 >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > Fix For: 0.10 > > > Steps to reproduce: > * Start 2 sleep jobs > * Wait for both to run and applicationState to be Running > * Kill yunikorn > * After 10 minutes, the rest call now shows both applicationState as > accepted instead of running -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-538) Fix node recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278183#comment-17278183 ] Kinga Marton commented on YUNIKORN-538: --- I think here the order of starting the services matters. I think the issue is caused by starting the apiFactory before the appmanagers. > Fix node recovery > - > > Key: YUNIKORN-538 > URL: https://issues.apache.org/jira/browse/YUNIKORN-538 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Kinga Marton >Priority: Major > > After the changes made in YUNIKORN-518 node recovery is broken. > Actually, the nodes are not recovered. The first commit when the issue is > observed is the following one: > [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-538) Scheduler is unable to recovery from a restart
[ https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676 ] Kinga Marton commented on YUNIKORN-538: --- Thanks [~wwei]! I tested your changes and it solves the recovery issue: both the nodes and the applications are recovered as expected. Also the applications will have the correct state after recovery. > Scheduler is unable to recovery from a restart > --- > > Key: YUNIKORN-538 > URL: https://issues.apache.org/jira/browse/YUNIKORN-538 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Weiwei Yang >Priority: Blocker > Labels: pull-request-available > > After the changes made in YUNIKORN-518 node recovery is broken. > Actually, the nodes are not recovered. The first commit when the issue is > observed is the following one: > [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-538) Scheduler is unable to recovery from a restart
[ https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676 ] Kinga Marton edited comment on YUNIKORN-538 at 2/4/21, 9:02 AM: Thanks [~wwei]! I tested your changes and it solves the recovery issue: both the nodes and the applications are recovered as expected. Also the applications will have the correct state after recovery, *but* the queue name is not filled. was (Author: kmarton): Thanks [~wwei]! I tested your changes and it solves the recovery issue: both the nodes and the applications are recovered as expected. Also the applications will have the correct state after recovery. > Scheduler is unable to recovery from a restart > --- > > Key: YUNIKORN-538 > URL: https://issues.apache.org/jira/browse/YUNIKORN-538 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Weiwei Yang >Priority: Blocker > Labels: pull-request-available > > After the changes made in YUNIKORN-518 node recovery is broken. > Actually, the nodes are not recovered. The first commit when the issue is > observed is the following one: > [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-538) Scheduler is unable to recovery from a restart
[ https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676 ] Kinga Marton edited comment on YUNIKORN-538 at 2/4/21, 9:02 AM: Thanks [~wwei]! I tested your changes and it solves the recovery issue: both the nodes and the applications are recovered as expected. Also the applications will have the correct state after recovery, *but* the queue name in the allocation is not filled. was (Author: kmarton): Thanks [~wwei]! I tested your changes and it solves the recovery issue: both the nodes and the applications are recovered as expected. Also the applications will have the correct state after recovery, *but* the queue name is not filled. > Scheduler is unable to recovery from a restart > --- > > Key: YUNIKORN-538 > URL: https://issues.apache.org/jira/browse/YUNIKORN-538 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Weiwei Yang >Priority: Blocker > Labels: pull-request-available > > After the changes made in YUNIKORN-518 node recovery is broken. > Actually, the nodes are not recovered. The first commit when the issue is > observed is the following one: > [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted
[ https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278785#comment-17278785 ] Kinga Marton commented on YUNIKORN-521: --- I tested this scenario today and I can see multiple issues here: * if the queue quota is lower than the total gang resource, the application should be rejected instead of accepting and scheduling only a part of the placeholders. This issue will be addressed in YUNIKORN-520 * I faced some deadlock while debugging it: when removing an allocation ask and also when we recover an allocation. We will need to fix this issue as well. cc [~wwei], [~wilfreds] > Placeholder pods are not cleaned up even when the job is deleted > > > Key: YUNIKORN-521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-521 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Attachments: job.yaml, ns.yaml > > > This one is a negative test... > * Create a namespace with quota > * Submit a job where the placeholder pods resource requests are more than > queue quota. > * Delete the job using kubectl > * Still the placeholder pods are in running state occupying the resources. > From an end user perspective, each job is an application consisting of all > related pods. If the user decides to purge the job, Yunikorn should also > recognize this action and clean up the placeholder pods. > From a yunikorn point of view, the application and job are 2 different > entities. The placeholder pods are not cleaned up because the application is > still alive even though the job is deleted. Does it make sense to create a > one on one mapping for job and application? Once the lifecycle of job is > complete, application should also terminate in Yunikorn world. Let me know > your thoughts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted
[ https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278785#comment-17278785 ] Kinga Marton edited comment on YUNIKORN-521 at 2/4/21, 11:50 AM: - I tested this scenario today and I can see multiple issues here: * if the queue quota is lower than the total gang resource, the application should be rejected instead of accepting and scheduling only a part of the placeholders. This issue will be addressed in YUNIKORN-520 * I faced some deadlock while debugging it: when removing an allocation ask (deleted manually the placeholder) and also when we recover an allocation. We will need to fix this issue as well. cc [~wwei], [~wilfreds] was (Author: kmarton): I tested this scenario today and I can see multiple issues here: * if the queue quota is lower than the total gang resource, the application should be rejected instead of accepting and scheduling only a part of the placeholders. This issue will be addressed in YUNIKORN-520 * I faced some deadlock while debugging it: when removing an allocation ask and also when we recover an allocation. We will need to fix this issue as well. cc [~wwei], [~wilfreds] > Placeholder pods are not cleaned up even when the job is deleted > > > Key: YUNIKORN-521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-521 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Ayub Pathan >Assignee: Kinga Marton >Priority: Major > Attachments: job.yaml, ns.yaml > > > This one is a negative test... > * Create a namespace with quota > * Submit a job where the placeholder pods resource requests are more than > queue quota. > * Delete the job using kubectl > * Still the placeholder pods are in running state occupying the resources. > From an end user perspective, each job is an application consisting of all > related pods. If the user decides to purge the job, Yunikorn should also > recognize this action and clean up the placeholder pods. > From a yunikorn point of view, the application and job are 2 different > entities. The placeholder pods are not cleaned up because the application is > still alive even though the job is deleted. Does it make sense to create a > one on one mapping for job and application? Once the lifecycle of job is > complete, application should also terminate in Yunikorn world. Let me know > your thoughts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org