[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913777#comment-16913777 ] Benjamin Mahler commented on MESOS-9806: {noformat} commit de90b2b3078e06975ab2061db821cfe7dda8 Author: Benjamin Mahler Date: Thu Aug 22 17:41:28 2019 -0400 Optimized Resources::shrink. Master: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 30.37 secs Made 0 allocation in 27.05 secs Master + this patch: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 24.15 secs Made 0 allocation in 20.48 secs Review: https://reviews.apache.org/r/71353 {noformat} {noformat} commit 05e5ca4b3446e34447f632463efe9a34b4bace7f Author: Benjamin Mahler Date: Thu Aug 22 17:42:57 2019 -0400 Added ResourceQuantities::fromScalarResource. Master + previous patches: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 24.15 secs Made 0 allocation in 20.48 secs Master + previous patches + this patch: Master + this patch: *HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2 Made 3500 allocations in 23.37 secs Made 0 allocation in 19.72 secs Review: https://reviews.apache.org/r/71354 {noformat} > Address allocator performance regression due to the removal of quota role > sorter. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.
[ https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913709#comment-16913709 ] Benjamin Mahler commented on MESOS-9806: {noformat} commit b6c87d7c44346b2497ace65b1d2060ee423aa772 Author: Benjamin Mahler Date: Wed Aug 21 20:10:42 2019 -0400 Eliminated double lookups in the allocator. Review: https://reviews.apache.org/r/71345 {noformat} {noformat} commit 790c4e72e1460035b13bf27f2cb8999709e9767e Author: Benjamin Mahler Date: Wed Aug 21 20:11:31 2019 -0400 Avoid duplicate allocatableTo call in the allocator. Review: https://reviews.apache.org/r/71346 {noformat} > Address allocator performance regression due to the removal of quota role > sorter. > - > > Key: MESOS-9806 > URL: https://issues.apache.org/jira/browse/MESOS-9806 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: resource-management > > In MESOS-9802, we removed the quota role sorter which is tech debt. > However, this slows down the allocator. The problem is that in the first > stage, even though a cluster might have no active roles with non-default > quota, the allocator will now have to sort and go through each and every role > in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, > the allocator could experience ~50% performance degradation. > There are a couple of ways to address this issue. For example, we could make > the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return > all the roles with non-default quota. Alternatively, an even better approach > would be to deprecate the sorter concept and just have two standalone > functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree > structure (not yet exist in the allocator) and return the sorted roles. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-9949) Track allocated/offered in the allocator's role tree.
[ https://issues.apache.org/jira/browse/MESOS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Sekretenko reassigned MESOS-9949: Assignee: Andrei Sekretenko > Track allocated/offered in the allocator's role tree. > - > > Key: MESOS-9949 > URL: https://issues.apache.org/jira/browse/MESOS-9949 > Project: Mesos > Issue Type: Task > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: resource-management > > Currently the allocator's role tree only tracks the reserved resources for > each role subtree. For metrics purposes, it would be ideal to track offered / > allocated as well. > This requires augmenting the allocator's structs and recoverResources to hold > the two categories independently and transition from offered -> allocated as > applicable when recovering resources. This might require a slight change to > the recoverResources interface. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.
[ https://issues.apache.org/jira/browse/MESOS-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913441#comment-16913441 ] Andrei Budnik commented on MESOS-9844: -- http://mesos.apache.org/documentation/latest/endpoints/slave/containerizer/debug/ > Update documentation describing `containerizer/debug` endpoint. > --- > > Key: MESOS-9844 > URL: https://issues.apache.org/jira/browse/MESOS-9844 > Project: Mesos > Issue Type: Documentation > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9952) ExampleTest.DiskFullFramework is slow
Benjamin Bannier created MESOS-9952: --- Summary: ExampleTest.DiskFullFramework is slow Key: MESOS-9952 URL: https://issues.apache.org/jira/browse/MESOS-9952 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Bannier Assignee: Benjamin Bannier Executing {{ExampleTest.DiskFullFramework}} on my setup takes almost 18s in a not optimized build. This is way too long for a default-enabled test. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9951) A likely STW problem in master'gc routine
longfei created MESOS-9951: -- Summary: A likely STW problem in master'gc routine Key: MESOS-9951 URL: https://issues.apache.org/jira/browse/MESOS-9951 Project: Mesos Issue Type: Bug Reporter: longfei Attachments: image-2019-08-22-14-00-16-298.png I'm using a 1.7.3 master, which seemed to stop for half a minute recently. {code:java} // I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in 1.163968ms; attempting to update the registry I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 353 I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request for position 353 from __req_res__(568)@10.10.23.74:5050 I0820 20:53:56.721997 4185859 master.cpp:8753] Executor 'mt:l004115106217:1' of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24): exited with status 0 I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor 'mt:l004115106217:1' with resources [] of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 (10.153.38.24) I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice for position 353 from log-network(1)@10.10.23.74:5050 I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the registry in 79.105792ms I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 354 I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request for position 354 from __req_res__(571)@10.10.23.74:5050 I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice for position 354 from log-network(1)@10.10.23.74:5050 I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable and 0 gone agents from the registry I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED (Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task mt:l004115107127:1 of framework a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051 (10.17.44.133) {code} Note that their are no log produced between 20:53:56 and 20:54:34. atop shows that a core(used by master) is full during the STW period. !image-2019-08-22-14-00-16-298.png! -- This message was sent by Atlassian Jira (v8.3.2#803003)