[jira] [Commented] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.
[ https://issues.apache.org/jira/browse/MESOS-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970596#comment-16970596 ] Benjamin Mahler commented on MESOS-9896: Started a thread here on their mailing list: [https://groups.google.com/forum/#!topic/protobuf/4qmUqGE5-oQ] > Consider using protobuf provided json conversion facilities rather than > custom ones. > > > Key: MESOS-9896 > URL: https://issues.apache.org/jira/browse/MESOS-9896 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Benjamin Mahler >Priority: Major > Labels: foundations > > Currently, stout provides custom JSON to protobuf conversion facilities, some > of which use protobuf reflection. > When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state > response of the master slowed down, and it appears to be due to a performance > regression in the protobuf reflection code. > We should file an issue with protobuf, but we should also look into using the > json conversion code that protobuf provides to see if that can help avoid the > regression. It may be the case that using the built-in facilities actually > provides a significant performance benefit, given they don't use reflection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10026) Improve v1 operator API read performance.
[ https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970594#comment-16970594 ] Benjamin Mahler commented on MESOS-10026: - Some preliminary numbers from a prototype https://github.com/bmahler/mesos/tree/bmahler_v1_operator_api_read_performance {noformat} Before: v0 '/state' response took 6.549942141secs v1 'master::call::GetState' application/x-protobuf response took 24.081624381secs v1 'master::call::GetState' application/json response took 22.760332466secs {noformat} {noformat} After: v0 '/state' response took 7.57313099secs v1 'master::call::GetState' application/x-protobuf response took 5.240223816secs v1 'master::call::GetState' application/json response took 1.76133347258333mins {noformat} However, as you can see, it turns out protobuf’s built-in json conversion is extremely slow at least for going from serialized protobuf to serialized json (I haven’t run perf to see why). This means we can’t really use the built-in json facilities (see MESOS-9896), and we have to have two code paths, one doing direct protobuf serialization and one doing direct json serialization via jsonify. I implemented that and got the following: {noformat} After: v0 '/state' response took 7.743768168secs v1 'master::call::GetState' application/x-protobuf response took 5.640594663secs v1 'master::call::GetState' application/json response took 11.795411549secs {noformat} > Improve v1 operator API read performance. > - > > Key: MESOS-10026 > URL: https://issues.apache.org/jira/browse/MESOS-10026 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > > Currently, the v1 operator API has poor performance relative to the v0 json > API. The following initial numbers were provided by [~Will Mahler] from our > state serving benchmark: > > |OPTIMIZED - Master (baseline)| | | | | > |Test setup|1000 agents with a total of 1 running tasks and 1 > completed tasks|1 agents with a total of 10 running tasks and 10 > completed tasks|2 agents with a total of 20 running tasks and 20 > completed tasks|4 agents with a total of 40 running tasks and 40 > completed tasks| > |v0 'state' response|0.17|1.66|8.96|12.42| > |v1 x-protobuf|0.35|3.21|9.47|19.09| > |v1 json|0.45|4.72|10.81|31.43| > There is quite a lot of variance, but v1 protobuf consistently slower than v0 > (sometimes significantly so) and v1 json is consistently slower than v1 > protobuf (sometimes significantly so). > The reason that the v1 operator API is slower is that it does the following: > (1) Construct temporary unversioned state response object by copying > in-memory un-versioned state into overall response object. (expensive!) > (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. > (expensive!) > (3) Serialize the overall v1 state object to protobuf or json. > (4) Destruct the temporaries (expensive! but is done after response starts > serving) > On the other hand, the v0 jsonify approach does the following: > (1) Serialize the in-memory unversioned state into json, by traversing state > and accumulating the overall serialized json. > This means that v1 has substantial overhead vs v0, and we need to remove it > to bring v1 on-par or better than v0. v1 should serialize directly to json > (straightforward with jsonify) or protobuf (this can be done via a > io::CodedOutputStream). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master
[ https://issues.apache.org/jira/browse/MESOS-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970584#comment-16970584 ] Xudong Ni commented on MESOS-10032: --- https://reviews.apache.org/r/71742/ > Mesos agent should sever proactively master connection when failing to detect > the leading master > > > Key: MESOS-10032 > URL: https://issues.apache.org/jira/browse/MESOS-10032 > Project: Mesos > Issue Type: Improvement >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Major > > We have observed that this often happens when the agents losing ZK > connections and resetting its master to None and beginning dropping messages > from the master because they can't verify that the messages are valid. > This has caused Jarvis to be unable to kill tasks (and they aren't counted as > unreachable because the master can still reach the agent). > A reasonable solution is for the agent to disconnect from the master upon > resetting the master it tracks since it's just going to drop control messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master
[ https://issues.apache.org/jira/browse/MESOS-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Ni reassigned MESOS-10032: - Assignee: Xudong Ni > Mesos agent should sever proactively master connection when failing to detect > the leading master > > > Key: MESOS-10032 > URL: https://issues.apache.org/jira/browse/MESOS-10032 > Project: Mesos > Issue Type: Improvement >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Major > > We have observed that this often happens when the agents losing ZK > connections and resetting its master to None and beginning dropping messages > from the master because they can't verify that the messages are valid. > This has caused Jarvis to be unable to kill tasks (and they aren't counted as > unreachable because the master can still reach the agent). > A reasonable solution is for the agent to disconnect from the master upon > resetting the master it tracks since it's just going to drop control messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10032) Mesos agent should sever proactively master connection when failing to detect the leading master
Xudong Ni created MESOS-10032: - Summary: Mesos agent should sever proactively master connection when failing to detect the leading master Key: MESOS-10032 URL: https://issues.apache.org/jira/browse/MESOS-10032 Project: Mesos Issue Type: Improvement Reporter: Xudong Ni We have observed that this often happens when the agents losing ZK connections and resetting its master to None and beginning dropping messages from the master because they can't verify that the messages are valid. This has caused Jarvis to be unable to kill tasks (and they aren't counted as unreachable because the master can still reach the agent). A reasonable solution is for the agent to disconnect from the master upon resetting the master it tracks since it's just going to drop control messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9987) Update 'Master::Http::_reserve' to also require 'source' resources
[ https://issues.apache.org/jira/browse/MESOS-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970179#comment-16970179 ] Benno Evers commented on MESOS-9987: {noformat} commit b368d897d83df2f261e01fa7583798d80d098052 Author: Benno Evers Date: Fri Nov 8 14:06:16 2019 +0100 Updated 'Master::Http::_reserve' to pass along new 'source' field. Updated 'Master::Http::_reserve()' to correctly set the new `source` field in the `Offer::Operation` created from operator API input. Review: https://reviews.apache.org/r/71695/ {noformat} > Update 'Master::Http::_reserve' to also require 'source' resources > -- > > Key: MESOS-9987 > URL: https://issues.apache.org/jira/browse/MESOS-9987 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Major > Labels: foundations > Fix For: 1.10 > > > We need to always pass {{source}} into {{Master::Http::_reserve}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9986) Update 'getConsumedResources' and 'getResourceConversions' for 'source' in reservations
[ https://issues.apache.org/jira/browse/MESOS-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970175#comment-16970175 ] Benno Evers commented on MESOS-9986: {noformat} commit 1d225b4c0270f06b901f0fafd777a347aae921cd Author: Benno Evers Date: Fri Nov 8 14:19:11 2019 +0100 Updated 'getResourceConversion()' for reservation updates. Updated the `getResourcesConversion()` function to correctly handle the `source` field in `RESERVE` operations. Review: https://reviews.apache.org/r/71719/ {noformat} > Update 'getConsumedResources' and 'getResourceConversions' for 'source' in > reservations > --- > > Key: MESOS-9986 > URL: https://issues.apache.org/jira/browse/MESOS-9986 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9991) Update 'Master::authorizeReserveResources' for re-reservations
[ https://issues.apache.org/jira/browse/MESOS-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970167#comment-16970167 ] Benno Evers commented on MESOS-9991: {noformat} commit 09c830d87b88d4c2f386cb9ded5931528d6cf144 Author: Benjamin Bannier Date: Fri Nov 8 14:19:16 2019 +0100 Added authorization handling for reservations with `source`. This patch adds authorization handling for `RESERVE` operations containing `source` fields. In order to stay backwards-compatible we add a dedicated authorization branch for such operations which under the hood translates each removed reservation to an `UNRESERVE` operation and every added reservation as a `RESERVE` operation where we fall back to existing authorization code for authorization. Review: https://reviews.apache.org/r/71729/ {noformat} > Update 'Master::authorizeReserveResources' for re-reservations > -- > > Key: MESOS-9991 > URL: https://issues.apache.org/jira/browse/MESOS-9991 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Major > Labels: foundations > > We need to authorize all modifications to bring {{source}} to common > ancestor, and from common ancestor to {{resources}}. > * each removed authorizations needs to be authorized as an {{unreserve}} > operation > * each added reservation needs to be authorized as a {{reserve}} operation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9992) Add end-to-end test excercising re-reservation operator API
[ https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970165#comment-16970165 ] Benno Evers commented on MESOS-9992: {noformat} commit b6bdc74c896303dc1775c68642023ee4513834b1 (HEAD -> master, origin/master) Author: Benno Evers Date: Fri Nov 8 14:19:22 2019 +0100 Added end-to-end test for operator API reservation updates. Added a new test to verify that reservations can be updated using the operator API. Review: https://reviews.apache.org/r/71725/ {noformat} > Add end-to-end test excercising re-reservation operator API > --- > > Key: MESOS-9992 > URL: https://issues.apache.org/jira/browse/MESOS-9992 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Major > Labels: foundations > -- This message was sent by Atlassian Jira (v8.3.4#803005)