Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-18 Thread Gastón Kleiman
+1 (binding)

Tested on our internal CI. All green!
Tested on CentOS 7 and the following tests failed:

[  FAILED  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ]
NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage
[  FAILED  ]
bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0,
where GetParam() = true

They are all known to be flaky.

On Wed, Jul 11, 2018 at 6:15 PM Greg Mann  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>
>
> 1.6.1 includes the following:
>
> 
> *Announce major features here*
> *Announce major bug fixes here*
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2
>
> 
>
> The candidate for Mesos 1.6.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz
>
> The tag to be voted on is 1.6.1-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1230
>
> Please vote on releasing this package as Apache Mesos 1.6.1!
>
> The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.6.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Greg
>


Re: [Performance WG] Meeting Notes - July 18

2018-07-18 Thread Vinod Kone
Awesome. Thanks for the write up, Ben!

On Wed, Jul 18, 2018 at 2:55 PM Benjamin Mahler  wrote:

> For folks that missed it, here are my own notes. Thanks to alexr and dario
> for presenting!
>
> (1) I discussed a high agent cpu usage issue when hitting the /containers
> endpoint:
>
> https://issues.apache.org/jira/browse/MESOS-8418
>
> This was resolved, but it didn't get attention for months until I noticed a
> recent complaint about it in slack. It highlights the need to periodically
> check for new performance tickets in the backlog.
>
>
> (2) alexr presented slides on some ongoing work to improve the state
> serving performance:
>
>
> https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g
>
> This included measurements from clusters with many frameworks. The short
> term plan (hopefully in 1.7.0) is to investigate batching / parallel
> processing of state requests (still on the master actor), and halving the
> queueing time via authorizing outside of the master actor. There are
> potential longer term plans, but these short term improvements should take
> us pretty far, along with (3).
>
>
> (3) I presented some results from adapting our jsonify library to use
> rapidjson under the covers, and it cuts our state serving time in half:
>
>
> https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo
>
> The code is mainly done but there are a few things left to get it in a
> reviewable state.
>
>
> (4) I briefly mentioned some various other performance work:
>
>   (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some
> benchmarking and improvements were made to better handle a large number of
> metrics, in support of per-framework metrics:
>
> https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets)
>
> There's still more open work that can be done here, but a more critical
> user-facing improvement at this point is the migration to push gauges in
> the master and allocator:
>
> https://issues.apache.org/jira/browse/MESOS-8914
>
>   (b) JSON parsing cost was cut in half by avoiding conversion through an
> intermediate format and instead directly parsing into our data structures:
>
> https://issues.apache.org/jira/browse/MESOS-9067
>
>
> (5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on
> benchmarking and making performance improvements to the allocator to speed
> up allocation cycle time and to address "offer starvation". In our
> multi-framework scale testing we saw allocation cycle time go down from 15
> secs to 5 secs, and there's still lots of low hanging fruit:
>
> https://issues.apache.org/jira/browse/MESOS-9087
>
> For offer starvation, we fixed an offer fragmentation issue due to quota
> "chopping" and we introduced the choice of a random weighted shuffle sorter
> as an alternative to ensure that high share frameworks don't get starved.
> We may also investigate introducing a round-robin sorter that shuffles
> between rounds if needed:
>
> https://issues.apache.org/jira/browse/MESOS-8935
> https://issues.apache.org/jira/browse/MESOS-8936
>
>
> (6) Dario talked about the MPSC queue that was recently added to libprocess
> for use in Process event queues. This needs to be enabled at configure-time
> as is currently the case for the lock free structures, and should provide a
> throughput improvement to libprocess. We still need to chart a path to
> turning these libprocess performance enhancing features on by default.
>
>
> (7) I can draft a 1.7.0 performance improvements blog post that features
> all of these topics and more. We may need to pull out some of the more
> lengthy content into separate blog posts if needed, but I think from the
> user perspective, highlighting what they get in 1.7.0 performance wise will
> be nice.
>
> Agenda Doc:
>
> https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU
>
> Ben
>


[Performance WG] Meeting Notes - July 18

2018-07-18 Thread Benjamin Mahler
For folks that missed it, here are my own notes. Thanks to alexr and dario
for presenting!

(1) I discussed a high agent cpu usage issue when hitting the /containers
endpoint:

https://issues.apache.org/jira/browse/MESOS-8418

This was resolved, but it didn't get attention for months until I noticed a
recent complaint about it in slack. It highlights the need to periodically
check for new performance tickets in the backlog.


(2) alexr presented slides on some ongoing work to improve the state
serving performance:

https://docs.google.com/presentation/d/10VczNGAPZDOYF1zd5b4qe-Q8Tnp-4pHrjOCF5netO3g

This included measurements from clusters with many frameworks. The short
term plan (hopefully in 1.7.0) is to investigate batching / parallel
processing of state requests (still on the master actor), and halving the
queueing time via authorizing outside of the master actor. There are
potential longer term plans, but these short term improvements should take
us pretty far, along with (3).


(3) I presented some results from adapting our jsonify library to use
rapidjson under the covers, and it cuts our state serving time in half:

https://docs.google.com/spreadsheets/d/1tZ17ws88jIIhuY6kH1rVkR_QxNG8rYL4DX_T6Te_nQo

The code is mainly done but there are a few things left to get it in a
reviewable state.


(4) I briefly mentioned some various other performance work:

  (a) Libprocess metrics scalability: Greg, Gilbert and I undertook some
benchmarking and improvements were made to better handle a large number of
metrics, in support of per-framework metrics:

https://issues.apache.org/jira/browse/MESOS-9072 (and see related tickets)

There's still more open work that can be done here, but a more critical
user-facing improvement at this point is the migration to push gauges in
the master and allocator:

https://issues.apache.org/jira/browse/MESOS-8914

  (b) JSON parsing cost was cut in half by avoiding conversion through an
intermediate format and instead directly parsing into our data structures:

https://issues.apache.org/jira/browse/MESOS-9067


(5) Till, Kapil, Meng Zhu, Greg Mann, Gaston and I have been working on
benchmarking and making performance improvements to the allocator to speed
up allocation cycle time and to address "offer starvation". In our
multi-framework scale testing we saw allocation cycle time go down from 15
secs to 5 secs, and there's still lots of low hanging fruit:

https://issues.apache.org/jira/browse/MESOS-9087

For offer starvation, we fixed an offer fragmentation issue due to quota
"chopping" and we introduced the choice of a random weighted shuffle sorter
as an alternative to ensure that high share frameworks don't get starved.
We may also investigate introducing a round-robin sorter that shuffles
between rounds if needed:

https://issues.apache.org/jira/browse/MESOS-8935
https://issues.apache.org/jira/browse/MESOS-8936


(6) Dario talked about the MPSC queue that was recently added to libprocess
for use in Process event queues. This needs to be enabled at configure-time
as is currently the case for the lock free structures, and should provide a
throughput improvement to libprocess. We still need to chart a path to
turning these libprocess performance enhancing features on by default.


(7) I can draft a 1.7.0 performance improvements blog post that features
all of these topics and more. We may need to pull out some of the more
lengthy content into separate blog posts if needed, but I think from the
user perspective, highlighting what they get in 1.7.0 performance wise will
be nice.

Agenda Doc:
https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU

Ben


Re: Backport Policy

2018-07-18 Thread Gilbert Song
Thanks for clarifying the backporting policy, BenM!

I totally agree with the changes proposed for the backporting policy, but I
realize two more scenarios that are more clear to me yet:

   - There are some bugs that are not fixable (due to legacy technical
   decisions), and we end up with fixing the issue by a semantic/behavior
   change in a new release. Do we expect this semantic/behavior change being
   backported?
   - There might be some bugs that root cause is unknown yet, but it did
   impact on a couple releases. If we decide to add some commits for debugging
   purpose (e.g., a new debugging endpoint, or more logging), should we also
   allow these patches to be backported?

For #2, I think we should do the backporting, but for #1, maybe more
discussion is needed since it relates to whether the user has to upgrade or
not.

Cheers,
Gilbert

On Tue, Jul 17, 2018 at 4:26 PM, Lawrence Rau  wrote:

> I don’t have a big stake in, however, one opinion is if a large commercial
> enterprise was using a specific release that is working the desire is often
> to only upgrade if necessary.  Necessary can be for a number of reasons
> including new features; however if a new feature is not needed the
> compelling reason to upgrade is to fix a specific problem that is causing
> issues.  Thus keeping a maintenance release stable is very important and
> reducing the chance for, while fixing one problem, introducing another.
>
> Often a clear classification of severity of the problem would dictate the
> need to make a change. (yes these can be subjective, but some guidance
> would be better than nothing).
>
> It might be good to give committers guidance on back porting things that
> have a high impact on improving a problem.  Fixing a crashing bug, fixing a
> degenerative performance issue, etc, where these issues have no easy/viable
> work around.  Nice to have fixes aren’t, always, worth updating to.
>
> There can be an argument to respond with a “then don’t upgrade” but if
> changing the release with “nice to have’s” and several point releases later
> when a critical bug is fixed then the org if forced to accept the risk of
> the nice to have’s.
>
> just an opinion.
> …larry
>
>
> On Jul 17, 2018, at 3:00 PM, Chun-Hung Hsiao 
> wrote:
>
> I just have a comment on a special case:
> If a fix for a flaky test is easy to backport,
> IMO we probably should backport it,
> otherwise if someone backports another critical fix in the future,
> it would take them extra effort to check all CI failures.
>
> On Mon, Jul 16, 2018 at 11:39 AM Vinod Kone  wrote:
>
>> I like how you summarized it Greg and I would vote for leaving the
>> decision
>> to the committer too. In addition to what others mentioned, I think
>> committer should've the responsibility because if things break in a point
>> release (after it is released), it is the committer and contributor who
>> are
>> on the hook to triage and fix it and not the release manager.
>>
>> Having said that, if "during" the release process (i.e., cutting an RC)
>> these backports cause delays for a release manager in getting the release
>> out (e.g., CI flakiness introduced due to backports), release manager
>> could
>> be the ultimate arbiter on whether such a backport should be reverted or
>> fixed by the committer/contributor. Hopefully such issues are caught much
>> before a release process is started (e.g., CI running against release
>> branches).
>>
>> On Mon, Jul 16, 2018 at 1:28 PM Jie Yu  wrote:
>>
>> > Greg, I like your idea of adding a prescriptive "policy" when evaluating
>> > whether a bug fix should be backported, and leave the decision to
>> committer
>> > (because they have the most context, and avoid a bottleneck in the
>> > process).
>> >
>> > - Jie
>> >
>> > On Mon, Jul 16, 2018 at 11:24 AM, Greg Mann  wrote:
>> >
>> > > My impression is that we have two opposing schools of thought here:
>> > >
>> > >1. Backport as little as possible, to avoid unforeseen consequences
>> > >2. Backport as much as proves practical, to eliminate bugs in
>> > >supported versions
>> > >
>> > > Do other people agree with this assessment?
>> > >
>> > > If so, how can we find common ground? One possible solution would be
>> to
>> > > leave the decision on backporting up to the committer, without
>> > specifying a
>> > > project-wide policy. This seems to be the status quo, and would lead
>> to
>> > > some variation across committers regarding what types of fixes are
>> > > backported. We could also choose to delegate the decision to the
>> release
>> > > manager; I favor leaving the decision with the committer, to eliminate
>> > the
>> > > burden on release managers.
>> > >
>> > > Here's a thought: rather than defining a prescriptive "policy" that we
>> > > expect committers to abide by, we could enumerate in the documentation
>> > the
>> > > competing concerns that we expect committers to consider when making
>> > > decisions on backports. The committing docs could read