[jira] [Commented] (YUNIKORN-400) Add doc about the helm chart install options

2020-09-21 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199253#comment-17199253
 ] 

Kinga Marton commented on YUNIKORN-400:
---

I think if we want to add some documentation, the information from the chart 
readme is enough, so I would add that information to the docs as well. [~wwei] 
is that enough or you had something else in your mind?

[~wilfreds], I will check the maintainers in index.yaml file.

> Add doc about the helm chart install options
> 
>
> Key: YUNIKORN-400
> URL: https://issues.apache.org/jira/browse/YUNIKORN-400
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> Submit PR to the master branch of 
> https://github.com/apache/incubator-yunikorn-site.
> If you find anything in the README not clear enough and you have problems to 
> make changes, please let me know. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties

2020-09-21 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-335:
-

Assignee: Kinga Marton

> Invalid config validation and config schema checks for unsupported config 
> properties
> 
>
> Key: YUNIKORN-335
> URL: https://issues.apache.org/jira/browse/YUNIKORN-335
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>
> * Invalid configuration does not error out - accepts silently but does not 
> reload the config.
>  4 spaces & 2 spaces - we should error out if 2 spaces is the standard.  
> partitions:
> {noformat}
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> properties:
>application.sort.policy: stateaware{noformat}
>  * Any unsupported configuration should not be allowed in the policy map - 
> some kind of schema check is needed.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> sample: value1
> properties:
>   application.sort.policy: stateaware
>   sample: value2{noformat}
>  
> *Impact: User thinks the config is loaded and consumed by Yunikorn but its 
> acutally not. So its important to error out incase any formatting issues.*
> [~wilfreds]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties

2020-09-21 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199345#comment-17199345
 ] 

Kinga Marton commented on YUNIKORN-335:
---

[~ayubpathan] I added semantic check to the validation. Related the spaces, I 
tested it and it is accepted with mixed number of spaces, so I don't think we 
should be so strict to error out if there is some extra space, if it is a valid 
yaml.

> Invalid config validation and config schema checks for unsupported config 
> properties
> 
>
> Key: YUNIKORN-335
> URL: https://issues.apache.org/jira/browse/YUNIKORN-335
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
>
> * Invalid configuration does not error out - accepts silently but does not 
> reload the config.
>  4 spaces & 2 spaces - we should error out if 2 spaces is the standard.  
> partitions:
> {noformat}
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> properties:
>application.sort.policy: stateaware{noformat}
>  * Any unsupported configuration should not be allowed in the policy map - 
> some kind of schema check is needed.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> sample: value1
> properties:
>   application.sort.policy: stateaware
>   sample: value2{noformat}
>  
> *Impact: User thinks the config is loaded and consumed by Yunikorn but its 
> acutally not. So its important to error out incase any formatting issues.*
> [~wilfreds]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties

2020-09-21 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199346#comment-17199346
 ] 

Kinga Marton commented on YUNIKORN-335:
---

Also yaml specification says that the amount of indentation is only a 
presentation detail: [https://yaml.org/spec/1.2/spec.html#id2777534]

 

> Invalid config validation and config schema checks for unsupported config 
> properties
> 
>
> Key: YUNIKORN-335
> URL: https://issues.apache.org/jira/browse/YUNIKORN-335
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
>
> * Invalid configuration does not error out - accepts silently but does not 
> reload the config.
>  4 spaces & 2 spaces - we should error out if 2 spaces is the standard.  
> partitions:
> {noformat}
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> properties:
>application.sort.policy: stateaware{noformat}
>  * Any unsupported configuration should not be allowed in the policy map - 
> some kind of schema check is needed.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> sample: value1
> properties:
>   application.sort.policy: stateaware
>   sample: value2{noformat}
>  
> *Impact: User thinks the config is loaded and consumed by Yunikorn but its 
> acutally not. So its important to error out incase any formatting issues.*
> [~wilfreds]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues

2020-09-22 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-351:
-

Assignee: Kinga Marton

> Support validate conf to reject the config with 2 top level queues
> --
>
> Key: YUNIKORN-351
> URL: https://issues.apache.org/jira/browse/YUNIKORN-351
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Critical
>
> admission controller should reject the config with 2 top level queues.. 
> something like below.
> {noformat}
> queues.yaml:
> 
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
>   - name: queue1
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2024
> cpu: 2000
> {noformat}
> YK reads this config and creates the queue1 as child to the first top level 
> queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues

2020-09-22 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199926#comment-17199926
 ] 

Kinga Marton commented on YUNIKORN-351:
---

[~wilfreds] pointed out that this is the expected and documented behaviour:

[http://yunikorn.apache.org/docs/next/user_guide/queue_config]
{code:java}
It can have a root queue defined but it is not a required element. If the root 
queue is not defined the configuration parsing will insert the root queue for 
consistency. The insertion of the root queue is triggered by:
* If the configuration has more than one queue defined at the top level a root 
queue is inserted.
* If there is only one queue defined at the top level and it is not called root 
a root queue is inserted.{code}

> Support validate conf to reject the config with 2 top level queues
> --
>
> Key: YUNIKORN-351
> URL: https://issues.apache.org/jira/browse/YUNIKORN-351
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Critical
>
> admission controller should reject the config with 2 top level queues.. 
> something like below.
> {noformat}
> queues.yaml:
> 
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
>   - name: queue1
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2024
> cpu: 2000
> {noformat}
> YK reads this config and creates the queue1 as child to the first top level 
> queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-351) Support validate conf to reject the config with 2 top level queues

2020-09-22 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-351.
---
Resolution: Not A Bug

> Support validate conf to reject the config with 2 top level queues
> --
>
> Key: YUNIKORN-351
> URL: https://issues.apache.org/jira/browse/YUNIKORN-351
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Critical
>
> admission controller should reject the config with 2 top level queues.. 
> something like below.
> {noformat}
> queues.yaml:
> 
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
>   - name: queue1
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2024
> cpu: 2000
> {noformat}
> YK reads this config and creates the queue1 as child to the first top level 
> queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-419) Add helm packaging to the release tool

2020-09-22 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-419.
---
Fix Version/s: 0.10
   Resolution: Fixed

Thank you for fixing this! The PR is merged

> Add helm packaging to the release tool
> --
>
> Key: YUNIKORN-419
> URL: https://issues.apache.org/jira/browse/YUNIKORN-419
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: release
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> The current tool does not package the helm charts at the same time as we 
> generate the rest of the release artefacts.
> Details around signing etc are not documented either.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-315) use resources.DAOString in handler instead of trim

2020-09-22 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-315.
---
Fix Version/s: 0.10
   Resolution: Fixed

> use resources.DAOString in handler instead of trim
> --
>
> Key: YUNIKORN-315
> URL: https://issues.apache.org/jira/browse/YUNIKORN-315
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Ankit Kumar
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.10
>
>
> In the resource object we have a special function to write the resources in a 
> format that is used for the web responses.
> The handler code does not use this call and uses a local trim over a String 
> call.
> For readability and consistency we should move all web service code that 
> returns resources to use the DAOString() call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-315) use resources.DAOString in handler instead of trim

2020-09-22 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200122#comment-17200122
 ] 

Kinga Marton commented on YUNIKORN-315:
---

Thank you [~akumar] for working on this. I merged your PR.

> use resources.DAOString in handler instead of trim
> --
>
> Key: YUNIKORN-315
> URL: https://issues.apache.org/jira/browse/YUNIKORN-315
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Ankit Kumar
>Priority: Major
>  Labels: newbie, pull-request-available
>
> In the resource object we have a special function to write the resources in a 
> format that is used for the web responses.
> The handler code does not use this call and uses a local trim over a String 
> call.
> For readability and consistency we should move all web service code that 
> returns resources to use the DAOString() call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization

2020-09-29 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203844#comment-17203844
 ] 

Kinga Marton commented on YUNIKORN-326:
---

[~Huang Ting Yao] can you please check if we can add the new endpoint according 
to the new design of the API: 
[https://docs.google.com/document/d/1KgoGqmNGR7TK3yBeqmefiFso2_lQujA_TzlsU0SsG1o/edit?usp=sharing]

> Add rest API to retrieve cluster nodes resource utilization
> ---
>
> Key: YUNIKORN-326
> URL: https://issues.apache.org/jira/browse/YUNIKORN-326
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
>
> URL: ws/v1/nodes/utilization
> returns the nodes resource utilization summary, a distribution based on usage:
> {code}
> {
>   type: "CPU",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {
>   type: "Memory",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization

2020-09-29 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203843#comment-17203843
 ] 

Kinga Marton commented on YUNIKORN-324:
---

[~Huang Ting Yao] can you please check if we can add the new endpoint according 
to the new design of the API: 
[https://docs.google.com/document/d/1KgoGqmNGR7TK3yBeqmefiFso2_lQujA_TzlsU0SsG1o/edit?usp=sharing]

 

> Add rest API to retrieve cluster resource utilization
> -
>
> Key: YUNIKORN-324
> URL: https://issues.apache.org/jira/browse/YUNIKORN-324
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
>
> URL: ws/v1/clusters/utilization
> this should something like the following (per-partition):
> {code}
> [
> {
>  partition: default,
>  utilization: [ {
> type: "cpu",
> total: 100,
> used: 50,
> usage: 50%
>   },
>   {
>  type: "memory",
>  total: 1000,
>  used: 500,
>  usage: 50%
>   }
>  ]
> }, 
> ...
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization

2020-09-30 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204687#comment-17204687
 ] 

Kinga Marton commented on YUNIKORN-324:
---

[~maniraj...@gmail.com] for this issue and also YUNIKORN-326 there is already 
an open PR, but it aligns with the actual design of the REST API. I am 
wondering what ifs your progress with the refactoring of the API, and if we can 
do this issues first in accordance to the new design. What do you think? cc 
[~wwei]

> Add rest API to retrieve cluster resource utilization
> -
>
> Key: YUNIKORN-324
> URL: https://issues.apache.org/jira/browse/YUNIKORN-324
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
>
> URL: ws/v1/clusters/utilization
> this should something like the following (per-partition):
> {code}
> [
> {
>  partition: default,
>  utilization: [ {
> type: "cpu",
> total: 100,
> used: 50,
> usage: 50%
>   },
>   {
>  type: "memory",
>  total: 1000,
>  used: 500,
>  usage: 50%
>   }
>  ]
> }, 
> ...
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-430) WARN and ERROR level log entries should not show a stack trace

2020-09-30 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204730#comment-17204730
 ] 

Kinga Marton commented on YUNIKORN-430:
---

I think is useful to have the stack trace in case of the error message. In case 
of the WARN messages we can think about removing it, but in case of the ERROR I 
would keep it.

> WARN and ERROR level log entries should not show a stack trace
> --
>
> Key: YUNIKORN-430
> URL: https://issues.apache.org/jira/browse/YUNIKORN-430
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - common, shim - kubernetes
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Minor
>
> Currently in every WARN or ERROR level log in the logs shows a stack trace of 
> the log entry. I haven't seen any similar behaviour in other projects, and I 
> think this is too verbose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-335) Invalid config validation and config schema checks for unsupported config properties

2020-09-30 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204831#comment-17204831
 ] 

Kinga Marton commented on YUNIKORN-335:
---

[~wwei] I added upper case letter validation for the queue names. Not only 
lower case letters are accepted. 

Related the multiple queue issue, we agreed that actually that is the expected, 
documented behaviour.

 

> Invalid config validation and config schema checks for unsupported config 
> properties
> 
>
> Key: YUNIKORN-335
> URL: https://issues.apache.org/jira/browse/YUNIKORN-335
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
>
> * Invalid configuration does not error out - accepts silently but does not 
> reload the config.
>  4 spaces & 2 spaces - we should error out if 2 spaces is the standard.  
> partitions:
> {noformat}
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> properties:
>application.sort.policy: stateaware{noformat}
>  * Any unsupported configuration should not be allowed in the policy map - 
> some kind of schema check is needed.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> sample: value1
> properties:
>   application.sort.policy: stateaware
>   sample: value2{noformat}
>  
> *Impact: User thinks the config is loaded and consumed by Yunikorn but its 
> acutally not. So its important to error out incase any formatting issues.*
> [~wilfreds]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-266) Delete related pods when deleting an app CRD

2020-10-01 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205389#comment-17205389
 ] 

Kinga Marton commented on YUNIKORN-266:
---

When thinking about this we should consider the following scenario as well: if 
there is an application with all *tasks completed* and we delete the CRD, the 
application in the scheduler will *NOT* be deleted because of the assigned 
pods. I think if the pods are in terminated state we can delete them.

> Delete related pods when deleting an app CRD
> 
>
> Key: YUNIKORN-266
> URL: https://issues.apache.org/jira/browse/YUNIKORN-266
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Kinga Marton
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-431) Submitting pods with non-existing queue name

2020-10-01 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-431:
-

 Summary: Submitting pods with non-existing queue name
 Key: YUNIKORN-431
 URL: https://issues.apache.org/jira/browse/YUNIKORN-431
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Kinga Marton


When we submit the CRD with let's say queueName=root.default, then we submit 
the pods linked to this application with queueName=root.xyz, where xyz is not 
an existing queue. The pod will be placed into queue root.default defined in 
the CRD, what is not the right approach

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-433) Extend configwatcher expiration time when a new request comes in

2020-10-02 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-433:
-

 Summary: Extend configwatcher expiration time when a new request 
comes in
 Key: YUNIKORN-433
 URL: https://issues.apache.org/jira/browse/YUNIKORN-433
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


When two configuration reloading is triggered closely to each other it might 
happen that,  before the update is available the watcher times out, because it 
is already running, since it was triggered during the first update/configmap 
creation. Then the update triggers it again, the expiration time is not 
modified.  Everything is about the timing: if you wait with the update until 
the first triggered configwatcher times out, then the changes will be synced. 
Also it works if you are quick enough with the update and the changes takes 
effect before the expire time.

For avoiding this kind of issues with the config changes we need to add 2 
changes:
 * increase the timeout for the configwatcher
 * restart the configwatcher timer when the configwatcher is triggered and 
theer is already one running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-433) Extend configwatcher expiration time when a new request comes in

2020-10-02 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206186#comment-17206186
 ] 

Kinga Marton commented on YUNIKORN-433:
---

The K8s documentation says the following about the 
delay([https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap):]
{quote}As a result, the total delay from the moment when the ConfigMap is 
updated to the moment when new keys are projected to the pod can be as long as 
kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute 
by default) in kubelet. 
{quote}
According to this I think we should set the timeout for at least 2 minutes. 
Even with this change I think we cannot guarantee that the updated 
configurations will take effect in the scheduler every time.

> Extend configwatcher expiration time when a new request comes in
> 
>
> Key: YUNIKORN-433
> URL: https://issues.apache.org/jira/browse/YUNIKORN-433
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> When two configuration reloading is triggered closely to each other it might 
> happen that,  before the update is available the watcher times out, because 
> it is already running, since it was triggered during the first 
> update/configmap creation. Then the update triggers it again, the expiration 
> time is not modified.  Everything is about the timing: if you wait with the 
> update until the first triggered configwatcher times out, then the changes 
> will be synced. Also it works if you are quick enough with the update and the 
> changes takes effect before the expire time.
> For avoiding this kind of issues with the config changes we need to add 2 
> changes:
>  * increase the timeout for the configwatcher
>  * restart the configwatcher timer when the configwatcher is triggered and 
> theer is already one running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-433) Extend configwatcher expiration time in case of a new update

2020-10-02 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-433:
--
Summary: Extend configwatcher expiration time in case of a new update  
(was: Extend configwatcher expiration time when a new request comes in)

> Extend configwatcher expiration time in case of a new update
> 
>
> Key: YUNIKORN-433
> URL: https://issues.apache.org/jira/browse/YUNIKORN-433
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> When two configuration reloading is triggered closely to each other it might 
> happen that,  before the update is available the watcher times out, because 
> it is already running, since it was triggered during the first 
> update/configmap creation. Then the update triggers it again, the expiration 
> time is not modified.  Everything is about the timing: if you wait with the 
> update until the first triggered configwatcher times out, then the changes 
> will be synced. Also it works if you are quick enough with the update and the 
> changes takes effect before the expire time.
> For avoiding this kind of issues with the config changes we need to add 2 
> changes:
>  * increase the timeout for the configwatcher
>  * restart the configwatcher timer when the configwatcher is triggered and 
> theer is already one running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-436) serviceAccount is hardcoded in rbac.yaml

2020-10-05 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-436.
---
Fix Version/s: 0.10
   Resolution: Fixed

> serviceAccount is hardcoded in rbac.yaml 
> -
>
> Key: YUNIKORN-436
> URL: https://issues.apache.org/jira/browse/YUNIKORN-436
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: deployment
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> serviceAccountName is exposed in the values.yaml but the change in value is 
> not reflected when service account is created.
> In rbac.yaml, the serviceAccountName is hardcoded to yunikorn-admin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-436) serviceAccount is hardcoded in rbac.yaml

2020-10-05 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207936#comment-17207936
 ] 

Kinga Marton commented on YUNIKORN-436:
---

Thank you [~vbm] for fixing it. I committed your changes to master.

> serviceAccount is hardcoded in rbac.yaml 
> -
>
> Key: YUNIKORN-436
> URL: https://issues.apache.org/jira/browse/YUNIKORN-436
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: deployment
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Major
>  Labels: pull-request-available
>
> serviceAccountName is exposed in the values.yaml but the change in value is 
> not reflected when service account is created.
> In rbac.yaml, the serviceAccountName is hardcoded to yunikorn-admin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-435) Admission-Controller pod goes into pending state because of default serviceAccount

2020-10-05 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208023#comment-17208023
 ] 

Kinga Marton commented on YUNIKORN-435:
---

[~vbm] can you please add some context to this failure?(like how did you 
installed it, versions, anything that can be relevant in reproducing this bug)

I installed YK multiple times and I haven't seen this issue.

> Admission-Controller pod goes into pending state because of default 
> serviceAccount
> --
>
> Key: YUNIKORN-435
> URL: https://issues.apache.org/jira/browse/YUNIKORN-435
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: deployment, shim - kubernetes
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Major
>  Labels: pull-request-available
>
> The admission controller pod which is created inside the scheduler pod uses 
> the wrong service account.
> The admission controller pod is launched with default service account. This 
> causes the admission controller pod to be in pending state because of 
> insufficient privileges.
>  
> Error message indicating pod in pending state:
> {code:java}
> NAMEREADY   UP-TO-DATE   
> AVAILABLE   AGE
> deployment.apps/yunikorn-admission-controller   0/1 00
>8m14s
> deployment.apps/yunikorn-scheduler  1/1 11
>8m20sNAME   DESIRED   
> CURRENT   READY   AGE
> replicaset.apps/yunikorn-admission-controller-854f64bcbf   1 0
>  0   8m14s
> replicaset.apps/yunikorn-scheduler-585fcfbb46  1 1
>  1   8m20s
> {code}
> {code:java}
> [root@vm5 vbm]# kubectl describe 
> replicaset.apps/yunikorn-admission-controller-854f64bcbf -n yunikorn
> Name:   yunikorn-admission-controller-854f64bcbf
> Namespace:  yunikorn
> Selector:   app=yunikorn,pod-template-hash=854f64bcbf
> Labels: app=yunikorn
> pod-template-hash=854f64bcbf
> Annotations:deployment.kubernetes.io/desired-replicas: 1
> deployment.kubernetes.io/max-replicas: 2
> deployment.kubernetes.io/revision: 1
> Controlled By:  Deployment/yunikorn-admission-controller
> Events:
>   Type ReasonAge From   Message
>    --   ---
>   Warning  FailedCreate  19s (x13 over 40s)  replicaset-controller  Error 
> creating: pods "yunikorn-admission-controller-854f64bcbf-" is forbidden: 
> unable to validate against any pod security policy: 
> [spec.securityContext.hostNetwork: Invalid value: true: Host network is not 
> allowed to be used spec.containers[0].hostPort: Invalid value: 8443: Host 
> port 8443 is not allowed to be used. Allowed ports: []]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user

2020-10-05 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-352:
-

Assignee: Kinga Marton

> when child queue capacity greater than parent, the configmap update is 
> rejected but not notified to end user
> 
>
> Key: YUNIKORN-352
> URL: https://issues.apache.org/jira/browse/YUNIKORN-352
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Critical
>
> Create a nested static queue like below.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: queue2
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 1000
> cpu: 1000
> queues:
>   - name: queue3
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2000
> cpu: 2000
> {noformat}
> Validate the same through rest API /queues - queues3 is not even shown in the 
> response.
> {noformat}
> {
> "capacity": {
> "capacity": "map[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "0"
> },
> "nodes": null,
> "partitionName": "[mycluster]default",
> "queues": {
> "capacities": {
> "absusedcapacity": "[memory:0 vcore:2]",
> "capacity": "[]",
> "maxcapacity": "[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "[memory:1 vcore:110]"
> },
> "properties": {},
> "queuename": "root",
> "queues": [
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[memory:1]"
> },
> "properties": {},
> "queuename": "monitoring",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[vcore:110]"
> },
> "properties": {},
> "queuename": "kube-system",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[cpu:300 memory:300]",
> "maxcapacity": "[cpu:1000 memory:1000]",
> "usedcapacity": "[]"
> },
> "properties": {},
> "queuename": "queue2",
> "queues": null,
> "status": "Active"
> }
> ],
> "status": "Active"
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-324) Add rest API to retrieve cluster resource utilization

2020-10-05 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-324.
---
Fix Version/s: 0.10
   Resolution: Fixed

Thank you [~Huang Ting Yao] for working on this. Merged your PR.

> Add rest API to retrieve cluster resource utilization
> -
>
> Key: YUNIKORN-324
> URL: https://issues.apache.org/jira/browse/YUNIKORN-324
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> URL: ws/v1/clusters/utilization
> this should something like the following (per-partition):
> {code}
> [
> {
>  partition: default,
>  utilization: [ {
> type: "cpu",
> total: 100,
> used: 50,
> usage: 50%
>   },
>   {
>  type: "memory",
>  total: 1000,
>  used: 500,
>  usage: 50%
>   }
>  ]
> }, 
> ...
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization

2020-10-06 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208678#comment-17208678
 ] 

Kinga Marton commented on YUNIKORN-326:
---

[~Huang Ting Yao] I merged your PR. Before resolving this Jira can you please 
upload a sample REST output for this one as well?

> Add rest API to retrieve cluster nodes resource utilization
> ---
>
> Key: YUNIKORN-326
> URL: https://issues.apache.org/jira/browse/YUNIKORN-326
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
>
> URL: ws/v1/nodes/utilization
> returns the nodes resource utilization summary, a distribution based on usage:
> {code}
> {
>   type: "CPU",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {
>   type: "Memory",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-326) Add rest API to retrieve cluster nodes resource utilization

2020-10-06 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-326.
---
Fix Version/s: 0.10
   Resolution: Fixed

> Add rest API to retrieve cluster nodes resource utilization
> ---
>
> Key: YUNIKORN-326
> URL: https://issues.apache.org/jira/browse/YUNIKORN-326
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> URL: ws/v1/nodes/utilization
> returns the nodes resource utilization summary, a distribution based on usage:
> {code}
> {
>   type: "CPU",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {
>   type: "Memory",
>   utilization: [ {
>   bucketID: "1",
>   bucketName: "0-10%",
>   numOfNodes: 5,
>   nodeNames: [...]
> }, {
>   bucketID: "2",
>   bucketName: "10-20%",
>   numOfNodes: 5,
>   nodeNames: [...]
> },
> ...
>   ]
> },
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-438) Move informers synch call at the beginning of scheduler start

2020-10-07 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-438:
-

 Summary: Move informers synch call at the beginning of scheduler 
start
 Key: YUNIKORN-438
 URL: https://issues.apache.org/jira/browse/YUNIKORN-438
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


We do a wait until the informers get sync up with API server during the 
recovery. But this currently is done in the node-recovery phase. But before 
this phase the application might got created and asking for namespace 
information from k8s might fail if the informers are not yet synched. 

We should make sure that we wait for this sync before doing anything.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-438) Move informers synch call to the beginning of scheduler start

2020-10-07 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-438:
--
Summary: Move informers synch call to the beginning of scheduler start  
(was: Move informers synch call at the beginning of scheduler start)

> Move informers synch call to the beginning of scheduler start
> -
>
> Key: YUNIKORN-438
> URL: https://issues.apache.org/jira/browse/YUNIKORN-438
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> We do a wait until the informers get sync up with API server during the 
> recovery. But this currently is done in the node-recovery phase. But before 
> this phase the application might got created and asking for namespace 
> information from k8s might fail if the informers are not yet synched. 
> We should make sure that we wait for this sync before doing anything.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-435) Admission-Controller pod goes into pending state because of default serviceAccount

2020-10-08 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-435.
---
Fix Version/s: 0.10
   Resolution: Fixed

Thank you [~vbm] for the clarification and for the fix.

Thanks [~adam.antal] for the validation.

I merged the PRs to the master branch.

> Admission-Controller pod goes into pending state because of default 
> serviceAccount
> --
>
> Key: YUNIKORN-435
> URL: https://issues.apache.org/jira/browse/YUNIKORN-435
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: deployment, shim - kubernetes
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> The admission controller pod which is created inside the scheduler pod uses 
> the wrong service account.
> The admission controller pod is launched with default service account. This 
> causes the admission controller pod to be in pending state because of 
> insufficient privileges.
>  
> Error message indicating pod in pending state:
> {code:java}
> NAMEREADY   UP-TO-DATE   
> AVAILABLE   AGE
> deployment.apps/yunikorn-admission-controller   0/1 00
>8m14s
> deployment.apps/yunikorn-scheduler  1/1 11
>8m20sNAME   DESIRED   
> CURRENT   READY   AGE
> replicaset.apps/yunikorn-admission-controller-854f64bcbf   1 0
>  0   8m14s
> replicaset.apps/yunikorn-scheduler-585fcfbb46  1 1
>  1   8m20s
> {code}
> {code:java}
> [root@vm5 vbm]# kubectl describe 
> replicaset.apps/yunikorn-admission-controller-854f64bcbf -n yunikorn
> Name:   yunikorn-admission-controller-854f64bcbf
> Namespace:  yunikorn
> Selector:   app=yunikorn,pod-template-hash=854f64bcbf
> Labels: app=yunikorn
> pod-template-hash=854f64bcbf
> Annotations:deployment.kubernetes.io/desired-replicas: 1
> deployment.kubernetes.io/max-replicas: 2
> deployment.kubernetes.io/revision: 1
> Controlled By:  Deployment/yunikorn-admission-controller
> Events:
>   Type ReasonAge From   Message
>    --   ---
>   Warning  FailedCreate  19s (x13 over 40s)  replicaset-controller  Error 
> creating: pods "yunikorn-admission-controller-854f64bcbf-" is forbidden: 
> unable to validate against any pod security policy: 
> [spec.securityContext.hostNetwork: Invalid value: true: Host network is not 
> allowed to be used spec.containers[0].hostPort: Invalid value: 8443: Host 
> port 8443 is not allowed to be used. Allowed ports: []]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user

2020-10-09 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-352:
--
Priority: Major  (was: Critical)

> when child queue capacity greater than parent, the configmap update is 
> rejected but not notified to end user
> 
>
> Key: YUNIKORN-352
> URL: https://issues.apache.org/jira/browse/YUNIKORN-352
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>
> Create a nested static queue like below.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: queue2
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 1000
> cpu: 1000
> queues:
>   - name: queue3
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2000
> cpu: 2000
> {noformat}
> Validate the same through rest API /queues - queues3 is not even shown in the 
> response.
> {noformat}
> {
> "capacity": {
> "capacity": "map[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "0"
> },
> "nodes": null,
> "partitionName": "[mycluster]default",
> "queues": {
> "capacities": {
> "absusedcapacity": "[memory:0 vcore:2]",
> "capacity": "[]",
> "maxcapacity": "[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "[memory:1 vcore:110]"
> },
> "properties": {},
> "queuename": "root",
> "queues": [
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[memory:1]"
> },
> "properties": {},
> "queuename": "monitoring",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[vcore:110]"
> },
> "properties": {},
> "queuename": "kube-system",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[cpu:300 memory:300]",
> "maxcapacity": "[cpu:1000 memory:1000]",
> "usedcapacity": "[]"
> },
> "properties": {},
> "queuename": "queue2",
> "queues": null,
> "status": "Active"
> }
> ],
> "status": "Active"
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-438) Move informers synch call to the beginning of scheduler start

2020-10-12 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-438.
---
Fix Version/s: 0.10
   Resolution: Fixed

> Move informers synch call to the beginning of scheduler start
> -
>
> Key: YUNIKORN-438
> URL: https://issues.apache.org/jira/browse/YUNIKORN-438
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> We do a wait until the informers get sync up with API server during the 
> recovery. But this currently is done in the node-recovery phase. But before 
> this phase the application might got created and asking for namespace 
> information from k8s might fail if the informers are not yet synched. 
> We should make sure that we wait for this sync before doing anything.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-352) when child queue capacity greater than parent, the configmap update is rejected but not notified to end user

2020-10-20 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reopened YUNIKORN-352:
---

> when child queue capacity greater than parent, the configmap update is 
> rejected but not notified to end user
> 
>
> Key: YUNIKORN-352
> URL: https://issues.apache.org/jira/browse/YUNIKORN-352
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Create a nested static queue like below.
> {noformat}
> partitions:
>   -
> name: default
> placementrules:
>   - name: tag
> value: namespace
> create: true
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: queue2
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 1000
> cpu: 1000
> queues:
>   - name: queue3
> resources:
>   guaranteed:
> memory: 300
> cpu: 300
>   max:
> memory: 2000
> cpu: 2000
> {noformat}
> Validate the same through rest API /queues - queues3 is not even shown in the 
> response.
> {noformat}
> {
> "capacity": {
> "capacity": "map[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "0"
> },
> "nodes": null,
> "partitionName": "[mycluster]default",
> "queues": {
> "capacities": {
> "absusedcapacity": "[memory:0 vcore:2]",
> "capacity": "[]",
> "maxcapacity": "[attachable-volumes-aws-ebs:75 
> ephemeral-storage:94992122100 hugepages-1Gi:0 hugepages-2Mi:0 memory:18966 
> pods:87 vcore:4875]",
> "usedcapacity": "[memory:1 vcore:110]"
> },
> "properties": {},
> "queuename": "root",
> "queues": [
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[memory:1]"
> },
> "properties": {},
> "queuename": "monitoring",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[]",
> "maxcapacity": "[]",
> "usedcapacity": "[vcore:110]"
> },
> "properties": {},
> "queuename": "kube-system",
> "queues": null,
> "status": "Active"
> },
> {
> "capacities": {
> "absusedcapacity": "[]",
> "capacity": "[cpu:300 memory:300]",
> "maxcapacity": "[cpu:1000 memory:1000]",
> "usedcapacity": "[]"
> },
> "properties": {},
> "queuename": "queue2",
> "queues": null,
> "status": "Active"
> }
> ],
> "status": "Active"
> }
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-447) Remove wrong error message from admission_controller.go

2020-10-20 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-447:
-

 Summary: Remove wrong error message from admission_controller.go
 Key: YUNIKORN-447
 URL: https://issues.apache.org/jira/browse/YUNIKORN-447
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


There is an error log message left in the admission_controller.go file: 
[https://github.com/apache/incubator-yunikorn-k8shim/blob/master/pkg/plugin/admissioncontrollers/webhook/admission_controller.go#L211-L213]

 

That message was used for debug purpose and it remained there by mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-449) Add predicates.MatchNodeSelectorPred to reservation list

2020-10-26 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-449:
-

 Summary: Add predicates.MatchNodeSelectorPred to reservation list
 Key: YUNIKORN-449
 URL: https://issues.apache.org/jira/browse/YUNIKORN-449
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Kinga Marton
Assignee: Kinga Marton






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-452) Check and fix YK page in artifactory

2020-10-27 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-452:
-

 Summary: Check and fix YK page in artifactory
 Key: YUNIKORN-452
 URL: https://issues.apache.org/jira/browse/YUNIKORN-452
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Kinga Marton


Starting from October Helm Hub moved to Artifactory Hub and some information 
from the YK page is broken(logo, maintainers, etc)

Let's check what we need to change to have all the information populated again.

More info:

[https://helm.sh/blog/helm-hub-moving-to-artifact-hub/]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-455) Make the core configurable

2020-10-28 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-455:
-

 Summary: Make the core configurable
 Key: YUNIKORN-455
 URL: https://issues.apache.org/jira/browse/YUNIKORN-455
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Kinga Marton
Assignee: Kinga Marton






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-455) Make the core configurable

2020-10-28 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-455:
--
Description: There are some startup options in the core side, but they are 
not configurable from outside. 

> Make the core configurable
> --
>
> Key: YUNIKORN-455
> URL: https://issues.apache.org/jira/browse/YUNIKORN-455
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> There are some startup options in the core side, but they are not 
> configurable from outside. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-455) Make the core configurable

2020-10-28 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-455:
--
Description: 
There are some startup options in the core side, but they are not configurable 
from outside. 

Also make the reservation expiration configurable as well.

  was:There are some startup options in the core side, but they are not 
configurable from outside. 


> Make the core configurable
> --
>
> Key: YUNIKORN-455
> URL: https://issues.apache.org/jira/browse/YUNIKORN-455
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> There are some startup options in the core side, but they are not 
> configurable from outside. 
> Also make the reservation expiration configurable as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-455) Make the core configurable

2020-10-29 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222784#comment-17222784
 ] 

Kinga Marton commented on YUNIKORN-455:
---

[~cheersyang] I just checked the meeting minutes from todays community synch. 
And I have a question:
{quote}A: there are 2 options, 1) through REST API, 2) through another config 
file. Opt 2 gives a way to persistent configs, however, it might be slower due 
to the configmap update delays.
{quote}
here a config file is mentioned. Do you mean, creating a new configmap and pass 
the values to the core instead of having a .properties, or .json or any file 
format used for storing key-value pairs, and let the core side process it?

> Make the core configurable
> --
>
> Key: YUNIKORN-455
> URL: https://issues.apache.org/jira/browse/YUNIKORN-455
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> There are some startup options in the core side, but they are not 
> configurable from outside. 
> Also make the reservation expiration configurable as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-456) Add ENV var to disable the reservation

2020-10-30 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-456:
-

 Summary: Add ENV var to disable the reservation
 Key: YUNIKORN-456
 URL: https://issues.apache.org/jira/browse/YUNIKORN-456
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Kinga Marton






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-456) Add ENV var to disable the reservation

2020-10-30 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-456:
-

Assignee: Kinga Marton

> Add ENV var to disable the reservation
> --
>
> Key: YUNIKORN-456
> URL: https://issues.apache.org/jira/browse/YUNIKORN-456
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-457) Find a way to pass the RMID to the webservice

2020-11-10 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-457:
-

 Summary: Find a way to pass the RMID to the webservice
 Key: YUNIKORN-457
 URL: https://issues.apache.org/jira/browse/YUNIKORN-457
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


When updating the configuration through the REST API, we need an RMId to 
reflect the changes in the configmap as well. With the actual approach this 
might not work properly if we would have more than one RM registered, or if we 
don't have any RM's.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-455) Make the core configurable

2020-11-10 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229431#comment-17229431
 ] 

Kinga Marton commented on YUNIKORN-455:
---

I like the idea of having an another configmap. I think that we should force a 
restart in order to make the changes available. If there will be a restart, the 
update delay will not be a problem in this case. For the configs we want to 
change runtime, we can pass them via ENV variables as well.

> Make the core configurable
> --
>
> Key: YUNIKORN-455
> URL: https://issues.apache.org/jira/browse/YUNIKORN-455
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> There are some startup options in the core side, but they are not 
> configurable from outside. 
> Also make the reservation expiration configurable as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-457) Find a way to pass the RMID to the webservice

2020-11-10 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229441#comment-17229441
 ] 

Kinga Marton commented on YUNIKORN-457:
---

Today I thought a bit about how we can solve this issue and I found a better 
way to solve it, than the actual implementation:
 - When calling {{updateSchedulerConfig}} from the webservice, pass an empty 
RMID
 - In the {{updateSchedulerConfig}} if the RMID is empty, while processing the 
partitions, the clusterContext has a map with the partitions where the key is 
in the form of {{[rmID]partitionName}}. We can iterate through the partitions 
and check if in the map is a key having the partitionName from the changed 
config. If yes, then we can extract the rmID from the value stored in the 
clusterContext's partition map.
 - If the partition is a new one, use a default RMID, what we can store in the 
ClusterContext and set it when the first RM is registered.

> Find a way to pass the RMID to the webservice
> -
>
> Key: YUNIKORN-457
> URL: https://issues.apache.org/jira/browse/YUNIKORN-457
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> When updating the configuration through the REST API, we need an RMId to 
> reflect the changes in the configmap as well. With the actual approach this 
> might not work properly if we would have more than one RM registered, or if 
> we don't have any RM's.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-465) scheduler health check REST API

2020-11-21 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-465:
-

 Summary: scheduler health check REST API
 Key: YUNIKORN-465
 URL: https://issues.apache.org/jira/browse/YUNIKORN-465
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


We need to build a health check REST API for the scheduler
This is needed for chaos monkey tests, the validation script can call the API 
to verify the scheduler state periodically
We should leverage scheduler metrics to do the validation, things to validate 
like:
 # Negative resources on node/app/cluster
 # Consistency of the data, e.g sum of allocated resource of apps = allocated 
resource in the partition
 # critical errors logged in the metrics (things should not happen but happened)
 # ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-465) scheduler health check REST API

2020-12-02 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-465:
--
Attachment: HealthCheck_output

> scheduler health check REST API
> ---
>
> Key: YUNIKORN-465
> URL: https://issues.apache.org/jira/browse/YUNIKORN-465
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Attachments: HealthCheck_output
>
>
> We need to build a health check REST API for the scheduler
> This is needed for chaos monkey tests, the validation script can call the API 
> to verify the scheduler state periodically
> We should leverage scheduler metrics to do the validation, things to validate 
> like:
>  # Negative resources on node/app/cluster
>  # Consistency of the data, e.g sum of allocated resource of apps = allocated 
> resource in the partition
>  # critical errors logged in the metrics (things should not happen but 
> happened)
>  # ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-285) Lint check doesn't work on branch-0.9

2020-12-09 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-285.
---
Fix Version/s: 0.10
   Resolution: Fixed

Thank you [~wilfreds] for fixing this. I merged the PR's to the master branch.

> Lint check doesn't work on branch-0.9
> -
>
> Key: YUNIKORN-285
> URL: https://issues.apache.org/jira/browse/YUNIKORN-285
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler, shim - kubernetes, test - unit
>Reporter: Weiwei Yang
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Looks like the lint check always fails on branch-0.9. For both repos, shim 
> and core. See the following jobs: 
> https://travis-ci.org/github/apache/incubator-yunikorn-core/builds
> https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/builds
> such as
>  - 
> https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/jobs/708324431
>  - 
> https://travis-ci.org/github/apache/incubator-yunikorn-core/builds/707911720



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-285) Lint check doesn't work on branch-0.9

2020-12-09 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246505#comment-17246505
 ] 

Kinga Marton commented on YUNIKORN-285:
---

[~wilfreds]  We need to back port it to branch-0.9 as well, right?

> Lint check doesn't work on branch-0.9
> -
>
> Key: YUNIKORN-285
> URL: https://issues.apache.org/jira/browse/YUNIKORN-285
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler, shim - kubernetes, test - unit
>Reporter: Weiwei Yang
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Looks like the lint check always fails on branch-0.9. For both repos, shim 
> and core. See the following jobs: 
> https://travis-ci.org/github/apache/incubator-yunikorn-core/builds
> https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/builds
> such as
>  - 
> https://travis-ci.org/github/apache/incubator-yunikorn-k8shim/jobs/708324431
>  - 
> https://travis-ci.org/github/apache/incubator-yunikorn-core/builds/707911720



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-477) Include K8s 1.18 to e2e test matrix

2020-12-09 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-477.
---
Fix Version/s: 0.10
   Resolution: Fixed

[~wwei] I merged your changes and will open a Jira for checking the coverage 
issue, since t was complaining about a file what wasn't touched by your changes.

> Include K8s 1.18 to e2e test matrix 
> 
>
> Key: YUNIKORN-477
> URL: https://issues.apache.org/jira/browse/YUNIKORN-477
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-482) Code coverage complaining about untouched files

2020-12-09 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-482:
-

 Summary: Code coverage complaining about untouched files
 Key: YUNIKORN-482
 URL: https://issues.apache.org/jira/browse/YUNIKORN-482
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton


The following change: 
[https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292]

is complaining about decreased project coverage in a file what wasn't touched 
by the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-477) Include K8s 1.18 to e2e test matrix

2020-12-09 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246528#comment-17246528
 ] 

Kinga Marton edited comment on YUNIKORN-477 at 12/9/20, 1:19 PM:
-

[~wwei] I merged your changes and opened a Jira(YUNIKORN-482) for checking the 
coverage issue, since t was complaining about a file what wasn't touched by 
your changes.


was (Author: kmarton):
[~wwei] I merged your changes and will open a Jira for checking the coverage 
issue, since t was complaining about a file what wasn't touched by your changes.

> Include K8s 1.18 to e2e test matrix 
> 
>
> Key: YUNIKORN-477
> URL: https://issues.apache.org/jira/browse/YUNIKORN-477
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-480) root queue max resource gets reset on config load

2020-12-10 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-480.
---
Fix Version/s: 0.10
   Resolution: Fixed

> root queue max resource gets reset on config load
> -
>
> Key: YUNIKORN-480
> URL: https://issues.apache.org/jira/browse/YUNIKORN-480
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 0.10
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> When updating the configuration the root queue max resource is getting reset 
> to {{nil}}. The configuration should never set the root queue resources. On 
> creation of the queue this is not a problem as there cannot be any registered 
> node. On update we have registered nodes and should thus not clear it out.
> When the max resource gets reset it stops allocation as the scheduler thinks 
> there are no nodes registered and nothing can be done.
> Found by the e2e tests as part of the shim dependency change YUNIKORN-475



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-482) Code coverage complaining about untouched files

2020-12-11 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247764#comment-17247764
 ] 

Kinga Marton commented on YUNIKORN-482:
---

[~wilfreds]  I agree that we need to cover that negative case as well, but it 
should report this when that part of code is added. My problem is that it shows 
this issue as a decrease in the coverage, when only the travis.yaml was 
modified.

> Code coverage complaining about untouched files
> ---
>
> Key: YUNIKORN-482
> URL: https://issues.apache.org/jira/browse/YUNIKORN-482
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Priority: Minor
>  Labels: coverage, pre-commit
> Attachments: image-2020-12-11-16-30-46-654.png
>
>
> The following change: 
> [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292]
> is complaining about decreased project coverage in a file what wasn't touched 
> by the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-482) Code coverage complaining about untouched files

2020-12-11 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247764#comment-17247764
 ] 

Kinga Marton edited comment on YUNIKORN-482 at 12/11/20, 9:05 AM:
--

[~wilfreds]  I agree that we need to cover that negative case as well, but it 
should report this when that part of code is added. My problem is that it shows 
this issue as a decrease in the coverage, when only the travis.yaml was 
modified: [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/files]

 


was (Author: kmarton):
[~wilfreds]  I agree that we need to cover that negative case as well, but it 
should report this when that part of code is added. My problem is that it shows 
this issue as a decrease in the coverage, when only the travis.yaml was 
modified.

> Code coverage complaining about untouched files
> ---
>
> Key: YUNIKORN-482
> URL: https://issues.apache.org/jira/browse/YUNIKORN-482
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Kinga Marton
>Priority: Minor
>  Labels: coverage, pre-commit
> Attachments: image-2020-12-11-16-30-46-654.png
>
>
> The following change: 
> [https://github.com/apache/incubator-yunikorn-k8shim/pull/212/checks?check_run_id=1500902292]
> is complaining about decreased project coverage in a file what wasn't touched 
> by the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-402) Make sure when there is no allocation in an app, the app state is "Waiting".

2020-12-11 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-402.
---
Fix Version/s: 0.10
   Resolution: Fixed

> Make sure when there is no allocation in an app, the app state is "Waiting".
> 
>
> Key: YUNIKORN-402
> URL: https://issues.apache.org/jira/browse/YUNIKORN-402
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
> Fix For: 0.10
>
>
> If there is no allocation for an app, according to 
> [http://yunikorn.apache.org/docs/next/design/scheduler_object_states] it's 
> status should be waiting instead of running, as mentioned here: 
> https://issues.apache.org/jira/browse/YUNIKORN-201?focusedCommentId=17186402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17186402



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-114) Clone the shallow clone version of protobuf repo

2020-12-11 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-114.
---
Fix Version/s: 0.10
   Resolution: Fixed

> Clone the shallow clone version of protobuf repo
> 
>
> Key: YUNIKORN-114
> URL: https://issues.apache.org/jira/browse/YUNIKORN-114
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: scheduler-interface
>Reporter: Adam Antal
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> When building the scheduler interface, we pull the whole protobuf repo. While 
> the source files are not that big, the git history (that we actually don't 
> need) makes it a bit slower to clone it. 
> We actually want to check out the latest tag, that we could do without 
> cloning by getting the tag with {{git ls-remote --tags}}) and 
> cloning/checking out just that revision with {{git clone --depth=1}}. Though 
> the the scheduler build time is not a bottleneck, I think we can improve the 
> build time a bit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-484) Handle the app completion in the core side

2020-12-11 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247952#comment-17247952
 ] 

Kinga Marton commented on YUNIKORN-484:
---

[~wwei] most of the workflow seems ok to me, but I have a question related the 
first step:
{quote}when the core sees there is no pending ask and running containers, it 
moves the app to the "Competed" state
{quote} here I think we should keep the app in waiting state for a while and 
use a timeout for moving it to Completed state. 

> Handle the app completion in the core side
> --
>
> Key: YUNIKORN-484
> URL: https://issues.apache.org/jira/browse/YUNIKORN-484
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> Currently, if there is no pending ask or running container for an app, we 
> transmit it to the "Waiting" state. To step further, we keep the "Waiting" 
> state for a short period of time and then transit the state to "Completed".
> Before YUNIKORN-2 is done, this is to track the core side changes to do the 
> state transition. When the app moves to the completed state, the core needs 
> to send a "UpdateResponse#UpdatedApplication" so the shim can do proper 
> cleanup.
> After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the 
> shim to release all the placeholder pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-418) Add "config" REST API

2020-12-11 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247968#comment-17247968
 ] 

Kinga Marton commented on YUNIKORN-418:
---

I agree with [~wwei] that for simplicity we should keep the "v1/config". Also 
it would be nice to have all the config related endpoint in the same place, but 
I think the nicer solution would be to use POST for doing the validation 
instead of adding the dry_run parameter.

> Add "config" REST API
> -
>
> Key: YUNIKORN-418
> URL: https://issues.apache.org/jira/browse/YUNIKORN-418
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-484) Handle the app completion in the core side

2020-12-15 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-484:
-

Assignee: Kinga Marton  (was: Wilfred Spiegelenburg)

> Handle the app completion in the core side
> --
>
> Key: YUNIKORN-484
> URL: https://issues.apache.org/jira/browse/YUNIKORN-484
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> Currently, if there is no pending ask or running container for an app, we 
> transmit it to the "Waiting" state. To step further, we keep the "Waiting" 
> state for a short period of time and then transit the state to "Completed".
> Before YUNIKORN-2 is done, this is to track the core side changes to do the 
> state transition. When the app moves to the completed state, the core needs 
> to send a "UpdateResponse#UpdatedApplication" so the shim can do proper 
> cleanup.
> After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the 
> shim to release all the placeholder pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-484) Handle the app completion in the core side

2020-12-15 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249696#comment-17249696
 ] 

Kinga Marton commented on YUNIKORN-484:
---

[~wilfreds], I started to implement the Completed state for the *non gang 
scheduling case* and I have a question related the removal of the application 
from the queue:
 - in the design doc stays that "_Entering into the completed state will move 
the application out of the queue automatically._ " 
 - right now the applications are removed from the partition and queue when the 
shim sends to the core a  {{RemoveApplicationRequest}}

If we remove the application from the partition and queue right after the 
transition to Completed state, we will lose the application in the UI. Is it OK 
that we will not be able to track the already completed applications?

> Handle the app completion in the core side
> --
>
> Key: YUNIKORN-484
> URL: https://issues.apache.org/jira/browse/YUNIKORN-484
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> Currently, if there is no pending ask or running container for an app, we 
> transmit it to the "Waiting" state. To step further, we keep the "Waiting" 
> state for a short period of time and then transit the state to "Completed".
> Before YUNIKORN-2 is done, this is to track the core side changes to do the 
> state transition. When the app moves to the completed state, the core needs 
> to send a "UpdateResponse#UpdatedApplication" so the shim can do proper 
> cleanup.
> After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the 
> shim to release all the placeholder pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-484) Handle the app completion in the core side

2020-12-16 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249696#comment-17249696
 ] 

Kinga Marton edited comment on YUNIKORN-484 at 12/16/20, 5:16 PM:
--

[~wwei],  [~wilfreds], I started to implement the Completed state for the *non 
gang scheduling case* and I have a question related the removal of the 
application from the queue:
 - in the design doc stays that "_Entering into the completed state will move 
the application out of the queue automatically._ " 
 - right now the applications are removed from the partition and queue when the 
shim sends to the core a  {{RemoveApplicationRequest}}

If we remove the application from the partition and queue right after the 
transition to Completed state, we will lose the application in the UI. Is it OK 
that we will not be able to track the already completed applications?


was (Author: kmarton):
[~wilfreds], I started to implement the Completed state for the *non gang 
scheduling case* and I have a question related the removal of the application 
from the queue:
 - in the design doc stays that "_Entering into the completed state will move 
the application out of the queue automatically._ " 
 - right now the applications are removed from the partition and queue when the 
shim sends to the core a  {{RemoveApplicationRequest}}

If we remove the application from the partition and queue right after the 
transition to Completed state, we will lose the application in the UI. Is it OK 
that we will not be able to track the already completed applications?

> Handle the app completion in the core side
> --
>
> Key: YUNIKORN-484
> URL: https://issues.apache.org/jira/browse/YUNIKORN-484
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if there is no pending ask or running container for an app, we 
> transmit it to the "Waiting" state. To step further, we keep the "Waiting" 
> state for a short period of time and then transit the state to "Completed".
> Before YUNIKORN-2 is done, this is to track the core side changes to do the 
> state transition. When the app moves to the completed state, the core needs 
> to send a "UpdateResponse#UpdatedApplication" so the shim can do proper 
> cleanup.
> After YUNIKORN-2 is done, when the app is "Completed", core needs to ask the 
> shim to release all the placeholder pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-503) Fix recovery for completed apps

2021-01-13 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-503:
-

 Summary: Fix recovery for completed apps
 Key: YUNIKORN-503
 URL: https://issues.apache.org/jira/browse/YUNIKORN-503
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Kinga Marton


After implementing the completed state, we need to fix the recovery part.

Right now is there is an application in waiting or completed state and we 
restart the scheduler, when the application is recreated it will have the New 
state and if there will be no pods assigned to that app, it will never 
transition to completed state. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-504) Show the completed apps in the UI

2021-01-13 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-504:
-

 Summary: Show the completed apps in the UI
 Key: YUNIKORN-504
 URL: https://issues.apache.org/jira/browse/YUNIKORN-504
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Kinga Marton


After YUNIKORN-484, we will store the completed apps in a separate list than 
the other ones. It would be useful if we could show this apps in the UI as 
well, but in a separate table than the Apps table. 

cc [~ayubpathan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-507) Add git version pre-requisite in the build guide

2021-01-14 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264720#comment-17264720
 ] 

Kinga Marton commented on YUNIKORN-507:
---

[~wwei], I don't have the exact version from what is already working. What I 
know is that with 2.22.0.rc2 and newer ones it working. I think we should 
mention 2.22. 

> Add git version pre-requisite in the build guide
> 
>
> Key: YUNIKORN-507
> URL: https://issues.apache.org/jira/browse/YUNIKORN-507
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Weiwei Yang
>Priority: Minor
>
> Recently we found our build will fail if the git version is too old, e.g 2.4.x
> We should document this on our web-site: 
> http://yunikorn.apache.org/docs/next/developer_guide/build. To require a 
> minimal version of git to be installed before trying to build YK. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-513) ApplicationState remains in Accepted after recovery

2021-01-25 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-513:
-

 Summary: ApplicationState remains in Accepted after recovery
 Key: YUNIKORN-513
 URL: https://issues.apache.org/jira/browse/YUNIKORN-513
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - cache
Affects Versions: 0.10
Reporter: Kinga Marton
Assignee: Kinga Marton


Steps to reproduce:
 * Start 2 sleep jobs
 * Wait for both to run and applicationState to be Running
 * Kill yunikorn
 * After 10 minutes, the rest call now shows both applicationState as accepted 
instead of running



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-514) Intermittent issues in e2e tests after YUNIKORN-317

2021-01-26 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272112#comment-17272112
 ] 

Kinga Marton commented on YUNIKORN-514:
---

Thank you [~wwei] for the fix! I merged your changes both to master and 
branch-0.10 branches. Before resolving this issue please update the dependency 
in the shim side.

> Intermittent issues in e2e tests after YUNIKORN-317
> ---
>
> Key: YUNIKORN-514
> URL: https://issues.apache.org/jira/browse/YUNIKORN-514
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Post YUNIKORN-317, we've seen some intermittent issues in e2e tests, such as 
> https://travis-ci.com/github/apache/incubator-yunikorn-k8shim/builds/213126549.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery

2021-01-26 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272320#comment-17272320
 ] 

Kinga Marton commented on YUNIKORN-513:
---

After the core refactoring, the app state transition happens in the following 
steps:
 * New -> Accepted: when an allocationAsk is processed
 * Accepted -> Starting: when the allocation is processed
 * Starting -> Running: when the second allocation is processed or when the 
Starting state times out.

In case of recovery, we don't have AllocationAsk, just already existing 
Allocations, so the first transition is skipped. This means, that if we have 
only 2 allocations, the application will not progress into the Running state. 
For the recovery we need to progress it manually from New to Accepted.

> ApplicationState remains in Accepted after recovery
> ---
>
> Key: YUNIKORN-513
> URL: https://issues.apache.org/jira/browse/YUNIKORN-513
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - cache
>Affects Versions: 0.10
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
>  * Start 2 sleep jobs
>  * Wait for both to run and applicationState to be Running
>  * Kill yunikorn
>  * After 10 minutes, the rest call now shows both applicationState as 
> accepted instead of running



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-517) Yunikorn v0.10 logs are filled with "clean up orphan pod" message for every 5 seconds

2021-01-27 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-517.
---
Resolution: Duplicate

This is already addressed by YUNIKORN-512

> Yunikorn v0.10 logs are filled with "clean up orphan pod" message for every 5 
> seconds
> -
>
> Key: YUNIKORN-517
> URL: https://issues.apache.org/jira/browse/YUNIKORN-517
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Priority: Major
>
> {noformat}
> 2021-01-24T15:04:25.940Z INFO cache/placeholder_manager.go:148 clean up 
> orphan pod
> 2021-01-24T15:04:30.940Z INFO cache/placeholder_manager.go:148 clean up 
> orphan pod
> 2021-01-24T15:04:35.940Z INFO cache/placeholder_manager.go:148 clean up 
> orphan pod
> 2021-01-24T15:04:40.944Z INFO cache/placeholder_manager.go:148 clean up 
> orphan pod
> 2021-01-24T15:04:45.947Z INFO cache/placeholder_manager.go:148 clean up 
> orphan pod \{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-512) Remove some useless log messages

2021-01-27 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-512:
-

Assignee: Weiwei Yang

> Remove some useless log messages
> 
>
> Key: YUNIKORN-512
> URL: https://issues.apache.org/jira/browse/YUNIKORN-512
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: k9s.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-512) Remove some useless log messages

2021-01-27 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272704#comment-17272704
 ] 

Kinga Marton commented on YUNIKORN-512:
---

Thank you [~wwei] for handling this. I merged your PR both to master and 
branch-0.10

> Remove some useless log messages
> 
>
> Key: YUNIKORN-512
> URL: https://issues.apache.org/jira/browse/YUNIKORN-512
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
>  Labels: pull-request-available
> Attachments: k9s.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-512) Remove some useless log messages

2021-01-27 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-512.
---
Fix Version/s: 0.10
   Resolution: Fixed

> Remove some useless log messages
> 
>
> Key: YUNIKORN-512
> URL: https://issues.apache.org/jira/browse/YUNIKORN-512
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler, shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10
>
> Attachments: k9s.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-519) Cleanup placeholders when the app is Completed

2021-01-27 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-519:
-

 Summary: Cleanup placeholders when the app is Completed
 Key: YUNIKORN-519
 URL: https://issues.apache.org/jira/browse/YUNIKORN-519
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Kinga Marton
Assignee: Kinga Marton


App completion is handled by YUNIKORN-484, however for the Gang scheduling we 
need to do some further cleanup around the unused placeholders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-507) Add git version pre-requisite in the build guide

2021-01-29 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274461#comment-17274461
 ] 

Kinga Marton commented on YUNIKORN-507:
---

[~wilfreds] I merged you changes to the master branch. It is enough to have it 
on that branch, right?

> Add git version pre-requisite in the build guide
> 
>
> Key: YUNIKORN-507
> URL: https://issues.apache.org/jira/browse/YUNIKORN-507
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
>
> Recently we found our build will fail if the git version is too old, e.g 2.4.x
> We should document this on our web-site: 
> http://yunikorn.apache.org/docs/next/developer_guide/build. To require a 
> minimal version of git to be installed before trying to build YK. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout

2021-01-29 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274516#comment-17274516
 ] 

Kinga Marton commented on YUNIKORN-460:
---

Yes, [~wwei].

Right now we have a field in the application called {{execTimeout}}. We already 
have this in the interface as well. I think that we should use this field to 
set the timeout, so the workflow would be the following one:
 - The user can define this timeout in an annotation such as : 
{{yunikorn.apache.org/schedulingPolicyParameters: “timeoutInSec=600”}}
 - The shim will process this information and when the Application is created 
and sent to the core, it will need to populate this {{execTimeout}} field
 - In the core, when we start to schedule the application (so the queue has 
enough headroom for the gang members), we can start the timer.
 - This timeout can be the time measured from starting the scheduling until the 
application runs. So we will reset the timer when the application will progress 
into the Completed state. (or Waiting state and reinitialise it if an 
application in a Waiting state gets some new allocations.)
 - If it times out, we can kill the application and send a UpdateResponse 
message to the shim with the following content:
 - UpdatedApplication for the state change of the application
 - AllocationRelease messages, with the actual allocations of the application, 
what needs to be released.

[~wwei],[~wilfreds] what are your thoughts?

> Handle app reservation timeout
> --
>
> Key: YUNIKORN-460
> URL: https://issues.apache.org/jira/browse/YUNIKORN-460
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> When an app is configured with a timeout, that determines the maximum time 
> permitted to stay in the Reserving phase. If that times out, then all the 
> existing placeholders should be deleted and the application will be scheduled 
> normally. This timeout is needed because otherwise an app’s partial 
> placeholders may occupy cluster resources and they are wasted.
> See more in [this 
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-525) Update Application states doc

2021-01-29 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-525:
-

 Summary: Update Application states doc
 Key: YUNIKORN-525
 URL: https://issues.apache.org/jira/browse/YUNIKORN-525
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Kinga Marton
Assignee: Kinga Marton


Now, that we have implemented and redefined the completed state of an 
application, we need to update the documentation as well: 
[http://yunikorn.apache.org/docs/next/design/scheduler_object_states/]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-533) Improve admission controller logging

2021-02-01 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-533:
-

 Summary: Improve admission controller logging
 Key: YUNIKORN-533
 URL: https://issues.apache.org/jira/browse/YUNIKORN-533
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Kinga Marton


Right now it is not so easy to debug issues related to the admission 
controller, also the logging level is not configurable, even if we have some 
log entries defined at debug level, I think we cannot change the level, what by 
default is INFO.

We need the following improvements related the admission controller logging:
 * review and cleanup the logging
 * make the log level configurable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-533) Improve admission controller logging

2021-02-01 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-533:
-

Assignee: Kinga Marton

> Improve admission controller logging
> 
>
> Key: YUNIKORN-533
> URL: https://issues.apache.org/jira/browse/YUNIKORN-533
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>
> Right now it is not so easy to debug issues related to the admission 
> controller, also the logging level is not configurable, even if we have some 
> log entries defined at debug level, I think we cannot change the level, what 
> by default is INFO.
> We need the following improvements related the admission controller logging:
>  * review and cleanup the logging
>  * make the log level configurable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-460) Handle app reservation timeout

2021-02-01 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276393#comment-17276393
 ] 

Kinga Marton edited comment on YUNIKORN-460 at 2/1/21, 3:04 PM:


Today we had a sync with [~wilfreds] on this topic. I am summarising here what 
we discussed:

When it comes about the timeout we have 2 cases
 # The queue is full, so only a part of the placeholders got allocated(for 
example the app ask for 100GB but the placeholders are using 50GB)
 # The placeholders are all allocated, but not all of them were replaced by 
real pods ( it can be due to configuration issue, but can be because something 
is changed in the cluster as well) 

We will kill the placeholder pods in both cases if it times out, but we will 
not kill the whole application, so in the second case the already running real 
pods will continue to do their job. We kill only the placeholders.
 * We will start the timer when the first placeholder is getting allocated
 * When it times out we just kill all the placeholders if we have any

[~wwei] related to the new state you mentioned, I don't think that we can add 
this new state, because when the first placeholder is replaced by the new pod, 
the application is already transitioning into the Running state. I don't think 
it is a good idea to make a difference between a simple app and one with a gang 
defined related to when it will start running.

[~wilfreds] please correct me if I am wrong, or if I missed something.


was (Author: kmarton):
Today we had a sync with [~wilfreds] on this topic. I am summarising here what 
we discussed:

When it comes about the timeout we have 2 cases
 # The queue is full, so only a part of the placeholders got allocated(for 
example the app ask for 100GB but the placeholders are using 50GB)
 # The placeholders are all allocated, but not all of them were replaced by 
real pods ( it can be due to configuration issue, but can be because something 
is changed in the cluster as well) 

We will kill the placeholder pods in both cases if it times out, but we will 
not kill the whole application, so in the second case the already running real 
pods will continue to do their job. We kill only the placeholders.
 * We will start the timer when the first placeholder is getting allocated
 * When it times out we just kill all the placeholders if we have any

[~wwei] related to the new state you mentioned, I don't think that we can add 
this new state, because when the first placeholder is replaced by the new pod, 
the application is already transitioning into the Running state. I don't think 
it is a good idea to make a difference between a simple app and one with a gang 
defined related to when it will start running.

> Handle app reservation timeout
> --
>
> Key: YUNIKORN-460
> URL: https://issues.apache.org/jira/browse/YUNIKORN-460
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> When an app is configured with a timeout, that determines the maximum time 
> permitted to stay in the Reserving phase. If that times out, then all the 
> existing placeholders should be deleted and the application will be scheduled 
> normally. This timeout is needed because otherwise an app’s partial 
> placeholders may occupy cluster resources and they are wasted.
> See more in [this 
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout

2021-02-01 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276393#comment-17276393
 ] 

Kinga Marton commented on YUNIKORN-460:
---

Today we had a sync with [~wilfreds] on this topic. I am summarising here what 
we discussed:

When it comes about the timeout we have 2 cases
 # The queue is full, so only a part of the placeholders got allocated(for 
example the app ask for 100GB but the placeholders are using 50GB)
 # The placeholders are all allocated, but not all of them were replaced by 
real pods ( it can be due to configuration issue, but can be because something 
is changed in the cluster as well) 

We will kill the placeholder pods in both cases if it times out, but we will 
not kill the whole application, so in the second case the already running real 
pods will continue to do their job. We kill only the placeholders.
 * We will start the timer when the first placeholder is getting allocated
 * When it times out we just kill all the placeholders if we have any

[~wwei] related to the new state you mentioned, I don't think that we can add 
this new state, because when the first placeholder is replaced by the new pod, 
the application is already transitioning into the Running state. I don't think 
it is a good idea to make a difference between a simple app and one with a gang 
defined related to when it will start running.

> Handle app reservation timeout
> --
>
> Key: YUNIKORN-460
> URL: https://issues.apache.org/jira/browse/YUNIKORN-460
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> When an app is configured with a timeout, that determines the maximum time 
> permitted to stay in the Reserving phase. If that times out, then all the 
> existing placeholders should be deleted and the application will be scheduled 
> normally. This timeout is needed because otherwise an app’s partial 
> placeholders may occupy cluster resources and they are wasted.
> See more in [this 
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-460) Handle app reservation timeout

2021-02-02 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277017#comment-17277017
 ] 

Kinga Marton commented on YUNIKORN-460:
---

Yesterday [~wwei] mentioned a scenario what is not covered by the previous 
design:
 * if we don't have any placeholders allocated, so all of them are pending and 
we have just the AllocationAsks.

If we want to cover this case as well, would mean that we should start the 
timer when we try to allocate the placeholders instead of waiting for the first 
placeholder allocation.

When it will timeout, then we should remove not only the placeholder 
allocations, but the AllocationAsks as well. Since there is no way right to 
sent back to the shim the AllocationAsks what needs to be removed, then the 
simpler solution is to Fail the application, or add a new State, and based of 
the termination state the shim will be able to handle in it's side both the 
asks and the allocations as well. 

[~wwei] please correct me if I missed something. 

[~wilfreds] what do think about this approach? I think we should coved the 
mentioned case as well.

> Handle app reservation timeout
> --
>
> Key: YUNIKORN-460
> URL: https://issues.apache.org/jira/browse/YUNIKORN-460
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Kinga Marton
>Priority: Major
>
> When an app is configured with a timeout, that determines the maximum time 
> permitted to stay in the Reserving phase. If that times out, then all the 
> existing placeholders should be deleted and the application will be scheduled 
> normally. This timeout is needed because otherwise an app’s partial 
> placeholders may occupy cluster resources and they are wasted.
> See more in [this 
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted

2021-02-02 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277211#comment-17277211
 ] 

Kinga Marton commented on YUNIKORN-521:
---

[~ayubpathan], right now in Yunikorn we don't have the Job definition. We just 
listen to when the pods are created and use the configured ApplicationId to 
match the pods to an application, so we can have different pods belonging to 
different jobs, but still part of the same application, if we pass the ID 
accordingly. 

If we would create one on one mapping for the Job and internal application that 
would break this functionality.

Right now in this case the application will transit into Waiting state if no 
new pods will be attached to the application, then into Completed state. At 
this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you 
think this solution is acceptable, or we should find out something else to 
delete the placeholders when the job is deleted as well?

The total time the unused placeholders will occupy the resources after the job 
is finished is 30 seconds (this is the waiting timeout). I think this is 
acceptable.

> Placeholder pods are not cleaned up even when the job is deleted
> 
>
> Key: YUNIKORN-521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-521
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
> Attachments: job.yaml, ns.yaml
>
>
> This one is a negative test...
> * Create a namespace with quota
> * Submit a job where the placeholder pods resource requests are more than 
> queue quota.
> * Delete the job using kubectl
> * Still the placeholder pods are in running state occupying the resources.
> From an end user perspective, each job is an application consisting of all 
> related pods. If the user decides to purge the job, Yunikorn should also 
> recognize this action and clean up the placeholder pods.
> From a yunikorn point of view, the application and job are 2 different 
> entities. The placeholder pods are not cleaned up because the application is 
> still alive even though the job is deleted. Does it make sense to create a 
> one on one mapping for job and application? Once the lifecycle of job is 
> complete, application should also terminate in Yunikorn world. Let me know 
> your thoughts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted

2021-02-02 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277211#comment-17277211
 ] 

Kinga Marton edited comment on YUNIKORN-521 at 2/2/21, 3:47 PM:


[~ayubpathan], right now in Yunikorn we don't have the Job definition. We just 
listen to when the pods are created and use the configured ApplicationId to 
match the pods to an application, so we can have different pods belonging to 
different jobs, but still part of the same application, if we pass the ID 
accordingly. 

If we would create one on one mapping for the Job and internal application that 
would break this functionality.

Right now in this case the application will transit into Waiting state if no 
new pods will be attached to the application, then into Completed state. At 
this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you 
think this solution is acceptable, or we should find out something else to 
delete the placeholders when the job is deleted as well?

The total time the unused placeholders will occupy the resources after the job 
is finished is 40 seconds (waiting timeout + clean interval). I think this is 
acceptable.


was (Author: kmarton):
[~ayubpathan], right now in Yunikorn we don't have the Job definition. We just 
listen to when the pods are created and use the configured ApplicationId to 
match the pods to an application, so we can have different pods belonging to 
different jobs, but still part of the same application, if we pass the ID 
accordingly. 

If we would create one on one mapping for the Job and internal application that 
would break this functionality.

Right now in this case the application will transit into Waiting state if no 
new pods will be attached to the application, then into Completed state. At 
this point the placeholders will be cleaned up. [~ayubpathan], [~wwei] do you 
think this solution is acceptable, or we should find out something else to 
delete the placeholders when the job is deleted as well?

The total time the unused placeholders will occupy the resources after the job 
is finished is 30 seconds (this is the waiting timeout). I think this is 
acceptable.

> Placeholder pods are not cleaned up even when the job is deleted
> 
>
> Key: YUNIKORN-521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-521
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
> Attachments: job.yaml, ns.yaml
>
>
> This one is a negative test...
> * Create a namespace with quota
> * Submit a job where the placeholder pods resource requests are more than 
> queue quota.
> * Delete the job using kubectl
> * Still the placeholder pods are in running state occupying the resources.
> From an end user perspective, each job is an application consisting of all 
> related pods. If the user decides to purge the job, Yunikorn should also 
> recognize this action and clean up the placeholder pods.
> From a yunikorn point of view, the application and job are 2 different 
> entities. The placeholder pods are not cleaned up because the application is 
> still alive even though the job is deleted. Does it make sense to create a 
> one on one mapping for job and application? Once the lifecycle of job is 
> complete, application should also terminate in Yunikorn world. Let me know 
> your thoughts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-510) Remove the sleep in placeholder manager stop function

2021-02-03 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277925#comment-17277925
 ] 

Kinga Marton commented on YUNIKORN-510:
---

Thank you [~Huang Ting Yao] for addressing this! I merged your changes both to 
branch-0.10 and master

> Remove the sleep in placeholder manager stop function
> -
>
> Key: YUNIKORN-510
> URL: https://issues.apache.org/jira/browse/YUNIKORN-510
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Minor
>  Labels: pull-request-available
>
> There is a 3s sleep in the stop function of placeholder manager, per Tingyao:
> "When we send the struct{}{} to stopChan, the Start() might not set Running 
> to false immediately. Or we can move sleep to 
> TestPlaceholderManagerStartStop(), which is located in 
> placeholder_manager_test.go."
> We should remove this from the stop function and move this to the UT code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-510) Remove the sleep in placeholder manager stop function

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-510:
--
Fix Version/s: 0.10

> Remove the sleep in placeholder manager stop function
> -
>
> Key: YUNIKORN-510
> URL: https://issues.apache.org/jira/browse/YUNIKORN-510
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> There is a 3s sleep in the stop function of placeholder manager, per Tingyao:
> "When we send the struct{}{} to stopChan, the Start() might not set Running 
> to false immediately. Or we can move sleep to 
> TestPlaceholderManagerStartStop(), which is located in 
> placeholder_manager_test.go."
> We should remove this from the stop function and move this to the UT code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-510) Remove the sleep in placeholder manager stop function

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-510.
---
Resolution: Fixed

> Remove the sleep in placeholder manager stop function
> -
>
> Key: YUNIKORN-510
> URL: https://issues.apache.org/jira/browse/YUNIKORN-510
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Ting Yao,Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> There is a 3s sleep in the stop function of placeholder manager, per Tingyao:
> "When we send the struct{}{} to stopChan, the Start() might not set Running 
> to false immediately. Or we can move sleep to 
> TestPlaceholderManagerStartStop(), which is located in 
> placeholder_manager_test.go."
> We should remove this from the stop function and move this to the UT code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-537) log spam in CalculateAbsUsedCapacity

2021-02-03 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277956#comment-17277956
 ] 

Kinga Marton commented on YUNIKORN-537:
---

Thank you [~wilfreds] for addressing this issue! I merged your changes to both 
branch-0.10 and master

> log spam in CalculateAbsUsedCapacity
> 
>
> Key: YUNIKORN-537
> URL: https://issues.apache.org/jira/browse/YUNIKORN-537
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The log gets spammed by a call in the Resource object of the core:
> {code:java}
> 2021-02-02T08:55:32.797Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage
> 2021-02-02T08:55:32.798Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage
> 2021-02-02T08:55:34.862Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage {code}
> This is linked to a REST call and should not be logged as a warning but at a 
> debug level at best.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-537) log spam in CalculateAbsUsedCapacity

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton resolved YUNIKORN-537.
---
Fix Version/s: 0.10
   Resolution: Fixed

> log spam in CalculateAbsUsedCapacity
> 
>
> Key: YUNIKORN-537
> URL: https://issues.apache.org/jira/browse/YUNIKORN-537
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> The log gets spammed by a call in the Resource object of the core:
> {code:java}
> 2021-02-02T08:55:32.797Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage
> 2021-02-02T08:55:32.798Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage
> 2021-02-02T08:55:34.862Z  WARNresources/resources.go:817  Cannot 
> calculate absolute capacity because of missing capacity or usage {code}
> This is linked to a REST call and should not be logged as a warning but at a 
> debug level at best.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-536) Add resource requests/limits for the admission-controller

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-536:
-

Assignee: Weiwei Yang

> Add resource requests/limits for the admission-controller
> -
>
> Key: YUNIKORN-536
> URL: https://issues.apache.org/jira/browse/YUNIKORN-536
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: shim - kubernetes
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: pull-request-available
>
> We need to specify resource requests/limits for the admission-controller. 
> Start the pod as a best-effort QoS class could possibly cause issues when the 
> node is under heavy load. That can slow down the admission-controller and 
> subsequentially slows down the api-server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-513) ApplicationState remains in Accepted after recovery

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reopened YUNIKORN-513:
---

> ApplicationState remains in Accepted after recovery
> ---
>
> Key: YUNIKORN-513
> URL: https://issues.apache.org/jira/browse/YUNIKORN-513
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - cache
>Affects Versions: 0.10
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Steps to reproduce:
>  * Start 2 sleep jobs
>  * Wait for both to run and applicationState to be Running
>  * Kill yunikorn
>  * After 10 minutes, the rest call now shows both applicationState as 
> accepted instead of running



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery

2021-02-03 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278024#comment-17278024
 ] 

Kinga Marton commented on YUNIKORN-513:
---

[~rozhang], the steps during recovery is the following one:
 * When the application is created in the core side, it will have the New state
 * During the node recovery we recover the existing allocations as well. 
 * When the first allocation is recovered, the application will transit into 
Starting state through the Accepted one. So if there is only one allocation to 
recover the application will stay in the Starting state until it times out and 
will auto-progress into the Running one
 * When the second allocation is recovered, than the application will transit 
into Running state

Actually the expected behaviour is exactly the same as in case of apps and 
tasks submission.  So in case of a 2-allocation application, (if both pods are 
in running state when the recovery happens) is expected to be in Running state.

However I checked it now and it seems to be broken again. I think now the 
locking has some issues. I suspect a deadlock, so I will reopen this issue. 

[~rozhang] have you encountered any issues as well?

> ApplicationState remains in Accepted after recovery
> ---
>
> Key: YUNIKORN-513
> URL: https://issues.apache.org/jira/browse/YUNIKORN-513
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - cache
>Affects Versions: 0.10
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Steps to reproduce:
>  * Start 2 sleep jobs
>  * Wait for both to run and applicationState to be Running
>  * Kill yunikorn
>  * After 10 minutes, the rest call now shows both applicationState as 
> accepted instead of running



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-538) Fix node recovery

2021-02-03 Thread Kinga Marton (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton updated YUNIKORN-538:
--
Target Version: 0.10

> Fix node recovery
> -
>
> Key: YUNIKORN-538
> URL: https://issues.apache.org/jira/browse/YUNIKORN-538
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Kinga Marton
>Priority: Major
>
> After the changes made in YUNIKORN-518 node recovery is broken.
> Actually, the nodes are not recovered. The first commit when the issue is 
> observed is the following one: 
> [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-538) Fix node recovery

2021-02-03 Thread Kinga Marton (Jira)
Kinga Marton created YUNIKORN-538:
-

 Summary: Fix node recovery
 Key: YUNIKORN-538
 URL: https://issues.apache.org/jira/browse/YUNIKORN-538
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Kinga Marton


After the changes made in YUNIKORN-518 node recovery is broken.

Actually, the nodes are not recovered. The first commit when the issue is 
observed is the following one: 
[https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-513) ApplicationState remains in Accepted after recovery

2021-02-03 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278105#comment-17278105
 ] 

Kinga Marton commented on YUNIKORN-513:
---

So far I found an issue what is causing some troubles during recover: 
YUNIKORN-538. I hope that fixing that one will solve the issue with the 
application states as well. 

> ApplicationState remains in Accepted after recovery
> ---
>
> Key: YUNIKORN-513
> URL: https://issues.apache.org/jira/browse/YUNIKORN-513
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - cache
>Affects Versions: 0.10
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10
>
>
> Steps to reproduce:
>  * Start 2 sleep jobs
>  * Wait for both to run and applicationState to be Running
>  * Kill yunikorn
>  * After 10 minutes, the rest call now shows both applicationState as 
> accepted instead of running



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-538) Fix node recovery

2021-02-03 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278183#comment-17278183
 ] 

Kinga Marton commented on YUNIKORN-538:
---

I think here the order of starting the services matters. I think the issue is 
caused by starting the apiFactory before the appmanagers.

 

> Fix node recovery
> -
>
> Key: YUNIKORN-538
> URL: https://issues.apache.org/jira/browse/YUNIKORN-538
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Kinga Marton
>Priority: Major
>
> After the changes made in YUNIKORN-518 node recovery is broken.
> Actually, the nodes are not recovered. The first commit when the issue is 
> observed is the following one: 
> [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-538) Scheduler is unable to recovery from a restart

2021-02-04 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676
 ] 

Kinga Marton commented on YUNIKORN-538:
---

Thanks [~wwei]! I tested your changes and it solves the recovery issue: both 
the nodes and the applications are recovered as expected. Also the applications 
will have the correct state after recovery.

> Scheduler is unable to recovery from a restart 
> ---
>
> Key: YUNIKORN-538
> URL: https://issues.apache.org/jira/browse/YUNIKORN-538
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Kinga Marton
>Assignee: Weiwei Yang
>Priority: Blocker
>  Labels: pull-request-available
>
> After the changes made in YUNIKORN-518 node recovery is broken.
> Actually, the nodes are not recovered. The first commit when the issue is 
> observed is the following one: 
> [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-538) Scheduler is unable to recovery from a restart

2021-02-04 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676
 ] 

Kinga Marton edited comment on YUNIKORN-538 at 2/4/21, 9:02 AM:


Thanks [~wwei]! I tested your changes and it solves the recovery issue: both 
the nodes and the applications are recovered as expected. Also the applications 
will have the correct state after recovery, *but* the queue name is not filled. 


was (Author: kmarton):
Thanks [~wwei]! I tested your changes and it solves the recovery issue: both 
the nodes and the applications are recovered as expected. Also the applications 
will have the correct state after recovery.

> Scheduler is unable to recovery from a restart 
> ---
>
> Key: YUNIKORN-538
> URL: https://issues.apache.org/jira/browse/YUNIKORN-538
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Kinga Marton
>Assignee: Weiwei Yang
>Priority: Blocker
>  Labels: pull-request-available
>
> After the changes made in YUNIKORN-518 node recovery is broken.
> Actually, the nodes are not recovered. The first commit when the issue is 
> observed is the following one: 
> [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-538) Scheduler is unable to recovery from a restart

2021-02-04 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278676#comment-17278676
 ] 

Kinga Marton edited comment on YUNIKORN-538 at 2/4/21, 9:02 AM:


Thanks [~wwei]! I tested your changes and it solves the recovery issue: both 
the nodes and the applications are recovered as expected. Also the applications 
will have the correct state after recovery, *but* the queue name in the 
allocation is not filled. 


was (Author: kmarton):
Thanks [~wwei]! I tested your changes and it solves the recovery issue: both 
the nodes and the applications are recovered as expected. Also the applications 
will have the correct state after recovery, *but* the queue name is not filled. 

> Scheduler is unable to recovery from a restart 
> ---
>
> Key: YUNIKORN-538
> URL: https://issues.apache.org/jira/browse/YUNIKORN-538
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Kinga Marton
>Assignee: Weiwei Yang
>Priority: Blocker
>  Labels: pull-request-available
>
> After the changes made in YUNIKORN-518 node recovery is broken.
> Actually, the nodes are not recovered. The first commit when the issue is 
> observed is the following one: 
> [https://github.com/apache/incubator-yunikorn-k8shim/commit/4d0b887cb13619247503544d7f4e5c1672b6f291]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted

2021-02-04 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278785#comment-17278785
 ] 

Kinga Marton commented on YUNIKORN-521:
---

I tested this scenario today and I can see multiple issues here:
 * if the queue quota is lower than the total gang resource, the application 
should be rejected instead of accepting and scheduling only a part of the 
placeholders. This issue will be addressed in YUNIKORN-520
 * I faced some deadlock while debugging it: when removing an allocation ask 
and also when we recover an allocation. We will need to fix this issue as well. 

cc [~wwei], [~wilfreds]

> Placeholder pods are not cleaned up even when the job is deleted
> 
>
> Key: YUNIKORN-521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-521
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
> Attachments: job.yaml, ns.yaml
>
>
> This one is a negative test...
> * Create a namespace with quota
> * Submit a job where the placeholder pods resource requests are more than 
> queue quota.
> * Delete the job using kubectl
> * Still the placeholder pods are in running state occupying the resources.
> From an end user perspective, each job is an application consisting of all 
> related pods. If the user decides to purge the job, Yunikorn should also 
> recognize this action and clean up the placeholder pods.
> From a yunikorn point of view, the application and job are 2 different 
> entities. The placeholder pods are not cleaned up because the application is 
> still alive even though the job is deleted. Does it make sense to create a 
> one on one mapping for job and application? Once the lifecycle of job is 
> complete, application should also terminate in Yunikorn world. Let me know 
> your thoughts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-521) Placeholder pods are not cleaned up even when the job is deleted

2021-02-04 Thread Kinga Marton (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278785#comment-17278785
 ] 

Kinga Marton edited comment on YUNIKORN-521 at 2/4/21, 11:50 AM:
-

I tested this scenario today and I can see multiple issues here:
 * if the queue quota is lower than the total gang resource, the application 
should be rejected instead of accepting and scheduling only a part of the 
placeholders. This issue will be addressed in YUNIKORN-520
 * I faced some deadlock while debugging it: when removing an allocation ask 
(deleted manually the placeholder) and also when we recover an allocation. We 
will need to fix this issue as well. 

cc [~wwei], [~wilfreds]


was (Author: kmarton):
I tested this scenario today and I can see multiple issues here:
 * if the queue quota is lower than the total gang resource, the application 
should be rejected instead of accepting and scheduling only a part of the 
placeholders. This issue will be addressed in YUNIKORN-520
 * I faced some deadlock while debugging it: when removing an allocation ask 
and also when we recover an allocation. We will need to fix this issue as well. 

cc [~wwei], [~wilfreds]

> Placeholder pods are not cleaned up even when the job is deleted
> 
>
> Key: YUNIKORN-521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-521
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Ayub Pathan
>Assignee: Kinga Marton
>Priority: Major
> Attachments: job.yaml, ns.yaml
>
>
> This one is a negative test...
> * Create a namespace with quota
> * Submit a job where the placeholder pods resource requests are more than 
> queue quota.
> * Delete the job using kubectl
> * Still the placeholder pods are in running state occupying the resources.
> From an end user perspective, each job is an application consisting of all 
> related pods. If the user decides to purge the job, Yunikorn should also 
> recognize this action and clean up the placeholder pods.
> From a yunikorn point of view, the application and job are 2 different 
> entities. The placeholder pods are not cleaned up because the application is 
> still alive even though the job is deleted. Does it make sense to create a 
> one on one mapping for job and application? Once the lifecycle of job is 
> complete, application should also terminate in Yunikorn world. Let me know 
> your thoughts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



<    1   2   3   4   5   6   7   >