[jira] [Commented] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.

2019-11-08 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970596#comment-16970596
 ] 

Benjamin Mahler commented on MESOS-9896:


Started a thread here on their mailing list:
[https://groups.google.com/forum/#!topic/protobuf/4qmUqGE5-oQ]

> Consider using protobuf provided json conversion facilities rather than 
> custom ones.
> 
>
> Key: MESOS-9896
> URL: https://issues.apache.org/jira/browse/MESOS-9896
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, stout provides custom JSON to protobuf conversion facilities, some 
> of which use protobuf reflection.
> When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state 
> response of the master slowed down, and it appears to be due to a performance 
> regression in the protobuf reflection code.
> We should file an issue with protobuf, but we should also look into using the 
> json conversion code that protobuf provides to see if that can help avoid the 
> regression. It may be the case that using the built-in facilities actually 
> provides a significant performance benefit, given they don't use reflection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.

2019-11-08 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970594#comment-16970594
 ] 

Benjamin Mahler commented on MESOS-10026:
-

Some preliminary numbers from a prototype 
https://github.com/bmahler/mesos/tree/bmahler_v1_operator_api_read_performance

{noformat}
Before:
v0 '/state' response took 6.549942141secs
v1 'master::call::GetState' application/x-protobuf response took 
24.081624381secs
v1 'master::call::GetState' application/json response took 22.760332466secs
{noformat}
{noformat}
After:
v0 '/state' response took 7.57313099secs
v1 'master::call::GetState' application/x-protobuf response took 5.240223816secs
v1 'master::call::GetState' application/json response took 1.76133347258333mins
{noformat}

However, as you can see, it turns out protobuf’s built-in json conversion is 
extremely slow at least for going from serialized protobuf to serialized json 
(I haven’t run perf to see why). This means we can’t really use the built-in 
json facilities (see MESOS-9896), and we have to have two code paths, one doing 
direct protobuf serialization and one doing direct json serialization via 
jsonify. I implemented that and got the following:

{noformat}
After:
v0 '/state' response took 7.743768168secs
v1 'master::call::GetState' application/x-protobuf response took 5.640594663secs
v1 'master::call::GetState' application/json response took 11.795411549secs
{noformat}

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master

2019-11-08 Thread Xudong Ni (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970584#comment-16970584
 ] 

Xudong Ni commented on MESOS-10032:
---

https://reviews.apache.org/r/71742/

> Mesos agent should sever proactively master connection when failing to detect 
> the leading master
> 
>
> Key: MESOS-10032
> URL: https://issues.apache.org/jira/browse/MESOS-10032
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Major
>
> We have observed that this often happens when the agents losing ZK 
> connections and resetting its master to None and beginning dropping messages 
> from the master because they can't verify that the messages are valid.
> This has caused Jarvis to be unable to kill tasks (and they aren't counted as 
> unreachable because the master can still reach the agent).
> A reasonable solution is for the agent to disconnect from the master upon 
> resetting the master it tracks since it's just going to drop control messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master

2019-11-08 Thread Xudong Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Ni reassigned MESOS-10032:
-

Assignee: Xudong Ni

> Mesos agent should sever proactively master connection when failing to detect 
> the leading master
> 
>
> Key: MESOS-10032
> URL: https://issues.apache.org/jira/browse/MESOS-10032
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Major
>
> We have observed that this often happens when the agents losing ZK 
> connections and resetting its master to None and beginning dropping messages 
> from the master because they can't verify that the messages are valid.
> This has caused Jarvis to be unable to kill tasks (and they aren't counted as 
> unreachable because the master can still reach the agent).
> A reasonable solution is for the agent to disconnect from the master upon 
> resetting the master it tracks since it's just going to drop control messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master

2019-11-08 Thread Xudong Ni (Jira)
Xudong Ni created MESOS-10032:
-

 Summary: Mesos agent should sever proactively master connection 
when failing to detect the leading master
 Key: MESOS-10032
 URL: https://issues.apache.org/jira/browse/MESOS-10032
 Project: Mesos
  Issue Type: Improvement
Reporter: Xudong Ni


We have observed that this often happens when the agents losing ZK connections 
and resetting its master to None and beginning dropping messages from the 
master because they can't verify that the messages are valid.

This has caused Jarvis to be unable to kill tasks (and they aren't counted as 
unreachable because the master can still reach the agent).

A reasonable solution is for the agent to disconnect from the master upon 
resetting the master it tracks since it's just going to drop control messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9987) Update 'Master::Http::_reserve' to also require 'source' resources

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970179#comment-16970179
 ] 

Benno Evers commented on MESOS-9987:


{noformat}
commit b368d897d83df2f261e01fa7583798d80d098052
Author: Benno Evers 
Date:   Fri Nov 8 14:06:16 2019 +0100

Updated 'Master::Http::_reserve' to pass along new 'source' field.

Updated 'Master::Http::_reserve()' to correctly set the new `source`
field in the `Offer::Operation` created from operator API input.

Review: https://reviews.apache.org/r/71695/
{noformat}

> Update 'Master::Http::_reserve' to also require 'source' resources
> --
>
> Key: MESOS-9987
> URL: https://issues.apache.org/jira/browse/MESOS-9987
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
> Fix For: 1.10
>
>
> We need to always pass {{source}} into {{Master::Http::_reserve}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9986) Update 'getConsumedResources' and 'getResourceConversions' for 'source' in reservations

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970175#comment-16970175
 ] 

Benno Evers commented on MESOS-9986:


{noformat}
commit 1d225b4c0270f06b901f0fafd777a347aae921cd
Author: Benno Evers 
Date:   Fri Nov 8 14:19:11 2019 +0100

Updated 'getResourceConversion()' for reservation updates.

Updated the `getResourcesConversion()` function to correctly
handle the `source` field in `RESERVE` operations.

Review: https://reviews.apache.org/r/71719/
{noformat}

> Update 'getConsumedResources' and 'getResourceConversions' for 'source' in 
> reservations
> ---
>
> Key: MESOS-9986
> URL: https://issues.apache.org/jira/browse/MESOS-9986
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9991) Update 'Master::authorizeReserveResources' for re-reservations

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970167#comment-16970167
 ] 

Benno Evers commented on MESOS-9991:


{noformat}
commit 09c830d87b88d4c2f386cb9ded5931528d6cf144
Author: Benjamin Bannier 
Date:   Fri Nov 8 14:19:16 2019 +0100

Added authorization handling for reservations with `source`.

This patch adds authorization handling for `RESERVE` operations
containing `source` fields. In order to stay backwards-compatible we add
a dedicated authorization branch for such operations which under the
hood translates each removed reservation to an `UNRESERVE` operation and
every added reservation as a `RESERVE` operation where we fall back to
existing authorization code for authorization.

Review: https://reviews.apache.org/r/71729/
{noformat}

> Update 'Master::authorizeReserveResources' for re-reservations
> --
>
> Key: MESOS-9991
> URL: https://issues.apache.org/jira/browse/MESOS-9991
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: foundations
>
> We need to authorize all modifications to bring {{source}} to common 
> ancestor, and from common ancestor to {{resources}}.
>  * each removed authorizations needs to be authorized as an {{unreserve}} 
> operation
>  * each added reservation needs to be authorized as a {{reserve}} operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9992) Add end-to-end test excercising re-reservation operator API

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970165#comment-16970165
 ] 

Benno Evers commented on MESOS-9992:


{noformat}
commit b6bdc74c896303dc1775c68642023ee4513834b1 (HEAD -> master, origin/master)
Author: Benno Evers 
Date:   Fri Nov 8 14:19:22 2019 +0100

Added end-to-end test for operator API reservation updates.

Added a new test to verify that reservations can be updated
using the operator API.

Review: https://reviews.apache.org/r/71725/
{noformat}

> Add end-to-end test excercising re-reservation operator API
> ---
>
> Key: MESOS-9992
> URL: https://issues.apache.org/jira/browse/MESOS-9992
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)