[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-08-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913777#comment-16913777
 ] 

Benjamin Mahler commented on MESOS-9806:


{noformat}
commit de90b2b3078e06975ab2061db821cfe7dda8
Author: Benjamin Mahler 
Date:   Thu Aug 22 17:41:28 2019 -0400

Optimized Resources::shrink.

Master:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 30.37 secs
Made 0 allocation in 27.05 secs

Master + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 24.15 secs
Made 0 allocation in 20.48 secs

Review: https://reviews.apache.org/r/71353
{noformat}

{noformat}
commit 05e5ca4b3446e34447f632463efe9a34b4bace7f
Author: Benjamin Mahler 
Date:   Thu Aug 22 17:42:57 2019 -0400

Added ResourceQuantities::fromScalarResource.

Master + previous patches:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 24.15 secs
Made 0 allocation in 20.48 secs

Master + previous patches + this patch:
Master + this patch:
*HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
Made 3500 allocations in 23.37 secs
Made 0 allocation in 19.72 secs

Review: https://reviews.apache.org/r/71354
{noformat}

> Address allocator performance regression due to the removal of quota role 
> sorter.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9806) Address allocator performance regression due to the removal of quota role sorter.

2019-08-22 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913709#comment-16913709
 ] 

Benjamin Mahler commented on MESOS-9806:


{noformat}
commit b6c87d7c44346b2497ace65b1d2060ee423aa772
Author: Benjamin Mahler 
Date:   Wed Aug 21 20:10:42 2019 -0400

Eliminated double lookups in the allocator.

Review: https://reviews.apache.org/r/71345
{noformat}

{noformat}
commit 790c4e72e1460035b13bf27f2cb8999709e9767e
Author: Benjamin Mahler 
Date:   Wed Aug 21 20:11:31 2019 -0400

Avoid duplicate allocatableTo call in the allocator.

Review: https://reviews.apache.org/r/71346
{noformat}

> Address allocator performance regression due to the removal of quota role 
> sorter.
> -
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first 
> stage, even though a cluster might have no active roles with non-default 
> quota, the allocator will now have to sort and go through each and every role 
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, 
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make 
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return 
> all the roles with non-default quota. Alternatively, an even better approach 
> would be to deprecate the sorter concept and just have two standalone 
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree 
> structure (not yet exist in the allocator) and return the sorted roles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-9949) Track allocated/offered in the allocator's role tree.

2019-08-22 Thread Andrei Sekretenko (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Sekretenko reassigned MESOS-9949:


Assignee: Andrei Sekretenko

> Track allocated/offered in the allocator's role tree.
> -
>
> Key: MESOS-9949
> URL: https://issues.apache.org/jira/browse/MESOS-9949
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: resource-management
>
> Currently the allocator's role tree only tracks the reserved resources for 
> each role subtree. For metrics purposes, it would be ideal to track offered / 
> allocated as well.
> This requires augmenting the allocator's structs and recoverResources to hold 
> the two categories independently and transition from offered -> allocated as 
> applicable when recovering resources. This might require a slight change to 
> the recoverResources interface.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.

2019-08-22 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913441#comment-16913441
 ] 

Andrei Budnik commented on MESOS-9844:
--

http://mesos.apache.org/documentation/latest/endpoints/slave/containerizer/debug/

> Update documentation describing `containerizer/debug` endpoint.
> ---
>
> Key: MESOS-9844
> URL: https://issues.apache.org/jira/browse/MESOS-9844
> Project: Mesos
>  Issue Type: Documentation
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9952) ExampleTest.DiskFullFramework is slow

2019-08-22 Thread Benjamin Bannier (Jira)
Benjamin Bannier created MESOS-9952:
---

 Summary: ExampleTest.DiskFullFramework is slow
 Key: MESOS-9952
 URL: https://issues.apache.org/jira/browse/MESOS-9952
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Bannier
Assignee: Benjamin Bannier


Executing {{ExampleTest.DiskFullFramework}} on my setup takes almost 18s in a 
not optimized build. This is way too long for a default-enabled test.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9951) A likely STW problem in master'gc routine

2019-08-22 Thread longfei (Jira)
longfei created MESOS-9951:
--

 Summary: A likely STW problem in master'gc routine 
 Key: MESOS-9951
 URL: https://issues.apache.org/jira/browse/MESOS-9951
 Project: Mesos
  Issue Type: Bug
Reporter: longfei
 Attachments: image-2019-08-22-14-00-16-298.png

I'm using a 1.7.3 master, which seemed to stop for half a minute recently.
{code:java}
// I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in 
1.163968ms; attempting to update the registry
I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 353
I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request 
for position 353 from __req_res__(568)@10.10.23.74:5050
I0820 20:53:56.721997 4185859 master.cpp:8753] Executor 
'mt:l004115106217:1' of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 
(10.153.38.24): exited with status 0
I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor 
'mt:l004115106217:1' with resources [] of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 
(10.153.38.24)
I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice 
for position 353 from log-network(1)@10.10.23.74:5050
I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the 
registry in 79.105792ms
I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 354
I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request 
for position 354 from __req_res__(571)@10.10.23.74:5050
I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice 
for position 354 from log-network(1)@10.10.23.74:5050
I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable 
and 0 gone agents from the registry
I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED 
(Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task 
mt:l004115107127:1 of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051 
(10.17.44.133)
{code}
Note that their are no log produced between 20:53:56 and 20:54:34.

atop shows that a core(used by master) is full during the STW period.

!image-2019-08-22-14-00-16-298.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)