Re: [VOTE] Move Apache Mesos to Attic

2021-04-06 Thread Meng Zhu
+1

It has been a pleasure working with you all!

On Mon, Apr 5, 2021 at 10:58 AM Vinod Kone  wrote:

> Hi folks,
>
> Based on the recent conversations
> <
> https://lists.apache.org/thread.html/raed89cc5ab78531c48f56aa1989e1e7eb05f89a6941e38e9bc8803ff%40%3Cuser.mesos.apache.org%3E
> >
> on our mailing list, it seems to me that the majority consensus among the
> existing PMC is to move the project to the attic <
> https://attic.apache.org/>
> and let the interested community members collaborate on a fork in Github.
>
> I would like to call a vote to dissolve the PMC and move the project to the
> attic.
>
> Please reply to this thread with your vote. Only binding votes from
> PMC/committers count towards the final tally but everyone in the community
> is encouraged to vote. See process here
> .
>
> Thanks,
>


Re: [VOTE] Release Apache Mesos 1.8.1 (rc1)

2019-07-16 Thread Meng Zhu
+1

tested on centos 7.4, only known flakies:

[  PASSED  ] 466 tests.
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ] DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsWithVolumes
[  FAILED  ] DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsSlaveRecovery
[  FAILED  ] DockerVolumeIsolatorTest.ROOT_EmptyCheckpointFileSlaveRecovery
[  FAILED  ]
DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsSingleVolumeMultipleContainers
[  FAILED  ]
NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_TensorflowGpuImage

-Meng

On Wed, Jul 10, 2019 at 1:48 PM Vinod Kone  wrote:

> +1 (binding).
>
> Tested in ASF CI. One build failed due to known flaky test
> https://issues.apache.org/jira/browse/MESOS-9594
>
>
> *Revision*: 4ae06448466408d9ec96ede953208057609f0744
>
>- refs/tags/1.8.1-rc1
>
> Configuration Matrix gcc clang
> centos:7 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> ubuntu:16.04 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Failed]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Success]
> <
> 

Upcoming changes to the `/role` endpoint and `GET_QUOTA` call

2019-06-21 Thread Meng Zhu
Hi:

We are making some changes to the response of *`/role` endpoint and master
`GET_QUOTA` call.* These are necessary as part of the quota limits work
.
Despite efforts in keeping these as backward compatible as possible, there
are some small incompatible tweaks. If you have tooling that depend on
these endpoints, please update them accordingly. Please check out the API
section of the design doc

for more details as well as rational behind these changes. Also, feel free
to reach out if you have any questions or concerns.

Changes to the `/role` endpoint:
- The `principal` field will be removed in the quota object
- Resources with zero quantity will no longer be included in the
`guarantee` field.
- The `guarantee` field will continue to be filled. However, since we are
decoupling the quota guarantee from the limit. One can no longer assume
that the limit will be the same as guarantee. A separate `limit` field will
be introduced.

*Before, *the response might contain:
```
{
  "quota": {
"guarantee": {
  "cpus": 1,
  "disk": 0,
  "gpus": 0,
  "mem": 512
},
"principal": "test-principal",
"role": "foo"
  }
}
```
*After*:
```
{
  "quota": {
"guarantee": {
  "cpus": 1,
  "mem": 512
},
"limit": {
  "cpus": 1,
  "mem": 512
},
"role": "foo"
  }
}
```

Changes to the `GET_QUOTA
`
call:
The `QuotaInfo` field is going to be deprecated, replaced by `QuotaConfig`
.
But we will continue to fill in as much as we can. Similar to the `/role`
endpoint above:
- The `principal` field will no longer be filled in the `QuotaInfo` object.
- The `guarantee` field will continue to be filled. However, since we are
decoupling the quota guarantee from the limit. One can no longer assume
that the limit will be the same as guarantee.

Thanks,
Meng


Upcoming allocator change to clusters using oversubscribed resources with quota under DRF

2019-05-29 Thread Meng Zhu
Folks:

If you are not using oversubscribed resources

along
with quota under DRF (all three at the same time), read no further. Just
stay tuned for the upcoming shiny new allocator with decoupled quota
guarantees and limits :)

OK, for the rest of you, you are truly advanced users! Here is the news.

As part of the tech debt cleanup in the allocator, we plan to remove the
quota role sorter in the allocator and only keep a single role sorter for
all the roles.This would simplify the allocator logic to help speedup
feature development.

This will result in one behavior change if you are using oversubscribed
resources with quota under DRF. Previously, in the quota allocation stage,
revocable resources are counted towards *neither* the total resource pool
*nor* a role’s allocated resources when sorting with DRF. This is arguably
the right behavior. However, after the aforementioned removal, all
resources, both revocable and non-revocable ones, will be counted when
calculating DRF shares in the quota allocation stage. This means, for a
quota role that also consumes a lot of revocable resources but no-so-much
non-revocable ones, previously it would be sorted towards the head of the
queue, now it is likely to be sorted towards the tail of the queue.

If you have concerns over this behavior change, feel free to chime in and
reach out.

Link to the ticket: MESOS-9802


-Meng


Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

2019-04-30 Thread Meng Zhu
+1
Tested on CentOS 7.4, only known flakiness

-Meng

On Tue, Apr 30, 2019 at 8:14 AM Alex Rukletsov  wrote:

> Modulo Jorge's comment (hope he'll come back soon),
>
> +1 (binding).
>
> This rc has been deployed on a cluster internally by us at Mesosphere and
> has been running without noticeable issues for a couple of days for now.
>
> Alex.
>
> On Mon, Apr 29, 2019 at 10:05 PM Benno Evers 
> wrote:
>
> > Hi Jorge,
> >
> > I'm admittedly not too familiar with CUDA and tensorflow but the error
> > message you describe sounds to me more like a build issue, i.e. it sounds
> > like the version of the nvidia driver is different between the docker
> image
> > and the host system?
> >
> > Maybe you could continue investigating to see if this is related to the
> > release itself or caused by some external cause, and create a JIRA ticket
> > to capture your findings?
> >
> > Thanks,
> > Benno
> >
> > On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado  wrote:
> >
> > > Hi all,
> > >
> > > did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having
> > > some issues using the cuda 10+ images when doing real processing. We
> > still
> > > need to check some things but basically we get:
> > >
> > > kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot
> > find working devices in this configuration
> > >
> > >
> > > Logs:
> > >
> > > I0424 13:27:14.00058630 executor.cpp:726] Forked command at 73
> > > Preparing rootfs at
> >
> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
> > > Marked '/' as rslave
> > > Executing pre-exec command
> >
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
> > > Executing pre-exec command
> >
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
> > > Changing root to
> >
> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
> > > 2019-04-24 13:27:18.346994: I
> > tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
> > instructions that this TensorFlow binary was not compiled to use: AVX2
> FMA
> > > 2019-04-24 13:27:18.352203: E
> > tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to
> cuInit:
> > CUDA_ERROR_UNKNOWN: unknown error
> > > 2019-04-24 13:27:18.352243: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA
> > diagnostic information for host: __host__
> > > 2019-04-24 13:27:18.352252: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname:
> __host__
> > > 2019-04-24 13:27:18.352295: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported
> > version is: 410.48.0
> > > 2019-04-24 13:27:18.352329: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported
> > version is: 418.56.0*2019-04-24 13:27:18.352338: E
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 <
> > http://cuda_diagnostics.cc:306>] kernel version 418.56.0 does not match
> > DSO version 410.48.0 -- cannot find working devices in this
> configuration*
> > > 2019-04-24 13:27:18.374940: I
> > tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
> > 259392 Hz
> > > 2019-04-24 13:27:18.378793: I
> > tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10
> > executing computations on platform Host. Devices:
> > > 2019-04-24 13:27:18.378821: I
> > tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device
> > (0): , 
> > > W0424 13:27:18.385210 140191267731200 deprecation.py:323] From
> >
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
> > colocate_with (from tensorflow.python.framework.ops) is deprecated and
> will
> > be removed in a future version.
> > > Instructions for updating:
> > > Colocations handled automatically by placer.
> > > W0424 13:27:18.399287 140191267731200 deprecation.py:323] From
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129:
> > conv2d (from tensorflow.python.layers.convolutional) is deprecated and
> will
> > be removed in a future version.
> > > Instructions for updating:
> > > Use keras.layers.conv2d instead.
> > > W0424 13:27:18.433226 140191267731200 deprecation.py:323] From
> > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261:
> > max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and
> > will be removed in a future version.
> > > Instructions for updating:
> > > Use keras.layers.max_pooling2d instead.
> > > 

Re: [VOTE] Release Apache Mesos 1.5.3 (rc1)

2019-03-13 Thread Meng Zhu
+1
sudo make check on CentOS 7.4, only known flaky tests failed

On Tue, Mar 12, 2019 at 4:44 PM Gilbert Song  wrote:

> +1 (binding).
>
> -Gilbert
>
> On Thu, Mar 7, 2019 at 10:09 AM Greg Mann  wrote:
>
> > +1 (binding)
> >
> > Ran through internal CI and observed only known flaky tests; almost all
> > configurations passed with no failures.
> >
> > Cheers,
> > Greg
> >
> > On Thu, Mar 7, 2019 at 1:55 AM Vinod Kone  wrote:
> >
> > > +1 (binding)
> > >
> > > Ran in ASF CI. Saw some flaky tests but otherwise looks good.
> > >
> > > *Revision*: b1dbba03af23b0222d11f2b7ae936d77ef42650d
> > >
> > >- refs/tags/1.5.3-rc1
> > >
> > > Configuration Matrix gcc clang
> > > centos:7 --verbose --disable-libtool-wrappers
> > > --disable-parallel-test-execution --enable-libevent --enable-ssl
> > autotools
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Not run]
> > > cmake
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Not run]
> > > --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> > > autotools
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Not run]
> > > cmake
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Not run]
> > > ubuntu:16.04 --verbose --disable-libtool-wrappers
> > > --disable-parallel-test-execution --enable-libevent --enable-ssl
> > autotools
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > cmake
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> > > autotools
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > [image: Success]
> > > <
> >
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/67/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> > >
> > > cmake
> > > [image: Success]
> > > <

[RESULT][VOTE] Release Apache Mesos 1.4.3 (rc2)

2019-02-22 Thread Meng Zhu
Hi all,

The vote for Mesos 1.4.3 (rc2) has passed with the
following votes.

+1 (Binding)
--
Vinod Kone
Gastón Kleiman
Gilbert Song

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.4.3

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.3

The mesos-1.4.3.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Meng


Re: [VOTE] Release Apache Mesos 1.7.2 (rc1)

2019-02-22 Thread Meng Zhu
+1, ran through out internal CI, only flaky failures

On Thu, Feb 21, 2019 at 11:41 AM Greg Mann  wrote:

> +1
>
> Built on CentOS 7.4 and ran all tests as root. Only 3 test failures were
> observed, all known flakes.
>
> Cheers,
> Greg
>
> On Wed, Feb 20, 2019 at 7:12 AM Vinod Kone  wrote:
>
>> +1
>>
>> Ran this on ASF CI.
>>
>> The red builds are a flaky infra issue and a known flaky test
>> .
>>
>> *Revision*: 58cc918e9acc2865bb07047d3d2dff156d1708b2
>>
>>- refs/tags/1.7.2-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --disable-libtool-wrappers
>> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
>> autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> ubuntu:16.04 --verbose --disable-libtool-wrappers
>> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
>> autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> cmake
>> [image: Success]
>> <
>> 

Re: [VOTE] Release Apache Mesos 1.6.2 (rc1)

2019-02-20 Thread Meng Zhu
+1 -- ran on centos 7.4 with only known flaky tests

-Meng

On Wed, Feb 20, 2019 at 4:57 PM Gastón Kleiman  wrote:

> +1 (binding) — ran the build through Mesosphere's internal CI and only two
> known flaky tests failed.
>
> On Tue, Feb 19, 2019 at 11:56 AM Greg Mann  wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.6.2.
>>
>>
>> 1.6.2 includes a number of bug fixes since 1.6.1; the CHANGELOG for the
>> release is available at:
>>
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.2-rc1
>>
>> 
>>
>> The candidate for Mesos 1.6.2 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz
>>
>> The tag to be voted on is 1.6.2-rc1:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.2-rc1
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1246
>>
>> Please vote on releasing this package as Apache Mesos 1.6.2!
>>
>> The vote is open until Fri Feb 22 11:54 PST 2019, and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.6.2
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Greg
>>
>


[VOTE] Release Apache Mesos 1.4.3 (rc2)

2019-02-13 Thread Meng Zhu
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.4.3.

1.4.3 includes the following:

https://issues.apache.org/jira/issues/?filter=12345433

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.3-rc2


The candidate for Mesos 1.4.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc2/mesos-1.4.3.tar.gz

The tag to be voted on is 1.4.3-rc2:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.3-rc2

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc2/mesos-1.4.3.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc2/mesos-1.4.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1245

Please vote on releasing this package as Apache Mesos 1.4.3!

The vote is open until Mon Feb 18 18:27:30 PST 2019 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.4.3
[ ] -1 Do not release this package because ...

Thanks,
Meng


Re: [VOTE] Release Apache Mesos 1.7.1 (rc2)

2019-01-26 Thread Meng Zhu
+1

sudo make check on CentOS 7.4.

All failed tests are known flaky:

[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ]
NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage

-Meng

On Fri, Jan 18, 2019 at 2:59 PM Gilbert Song  wrote:

> +1 (binding).
>
> All tests passed except 5 failures (known flakiness) from our internal CI:
>
> FLAG=CMake,label=mesos-ec2-centos-7
>
>  mesos-ec2-centos-7-CMake.Mesos.CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
>
> FLAG=SSL,label=mesos-ec2-centos-7
>  mesos-ec2-centos-7-SSL.MESOS_TESTS_ABORTED.xml.[empty]
>
> FLAG=SSL,label=mesos-ec2-debian-9
>
>  
> mesos-ec2-debian-9-SSL.Mesos.FetcherCacheTest.CachedCustomOutputFileWithSubdirectory
>
> FLAG=SSL,label=mesos-ec2-ubuntu-16.04
>
>  
> mesos-ec2-ubuntu-16.04-SSL.Mesos.CniIsolatorTest.ROOT_INTERNET_CURL_LaunchCommandTask
>
> FLAG=SSL,label=mesos-ec2-centos-6
>
>  
> mesos-ec2-centos-6-SSL.Mesos.GarbageCollectorIntegrationTest.LongLivedDefaultExecutorRestart
>
> -Gilbert
>
> On Wed, Jan 16, 2019 at 2:24 PM Vinod Kone  wrote:
>
> > +1 (binding)
> >
> > Tested on ASF CI. Failing builds are due to missed SSL dep in the docker
> > build file and a flaky test.
> >
> > *Revision*: d5678c3c5500cec72e22e775d9d048c55c128954
> >
> >- refs/tags/1.7.1-rc2
> >
> > Configuration Matrix gcc clang
> > centos:7 --verbose --disable-libtool-wrappers
> > --disable-parallel-test-execution --enable-libevent --enable-ssl
> autotools
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > cmake
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> > autotools
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > cmake
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > ubuntu:16.04 --verbose --disable-libtool-wrappers
> > --disable-parallel-test-execution --enable-libevent --enable-ssl
> autotools
> > [image: Failed]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Failed]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > cmake
> > [image: Failed]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Failed]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/59/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> > autotools
> > [image: Success]
> > <
> 

[VOTE] Release Apache Mesos 1.4.3 (rc1)

2019-01-25 Thread Meng Zhu
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.4.3.

1.4.3 includes the following:

https://issues.apache.org/jira/issues/?filter=12345433

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.3-rc1


The candidate for Mesos 1.4.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz

The tag to be voted on is 1.4.3-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.3-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1244

Please vote on releasing this package as Apache Mesos 1.4.3!

The vote is open until Mon Jan 30th 14:02:55 PST 2019 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.4.3
[ ] -1 Do not release this package because ...

Thanks,
Meng


Fwd: Quota 2.0 proposal

2019-01-25 Thread Meng Zhu
Hi folks:

During the design review meetings, the main discussion point is around
whether we should allow setting quota guarantees for resources with
specific meta-data. And the main use case is for disks with profiles.

The current proposal in the doc is to only allow setting guarantees on
top-level resources such as "cpus" and "disk". And limits can be set on any
resources even with meta-data. it has a caveat though that if limits are
set "underneath" the guarantee (e.g. a guarantee of disk co-exists with a
limit of disk with a specific profile), guarantees might not be satisfied
depending on the cluster usage.

This proposal does not support the use case of setting quota for disks with
profiles. This limitation and the caveat mentioned above are both due to
the quota propagation issue when there is a resource meta-data hierarchy as
explained in these related sections in the design doc
<https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit#heading=h.i4lsj45vylfu>
.

It looks like there are a few options here with regard to setting quotas on
disks with profiles:

1. Stick to the current proposal, but treat disks with a profile as a
top-level resource (think about this as something completely unrelated to
"disk" e.g. "cpus"), so that guarantees can be set on it.

2. Add support for setting guarantees on any meta-data resource, but with
restrictions such that once a guarantee or a limit is set on a resource
with meta data, no more quotas can be configured for resources on the same
path in the meta-data hierarchy. For example, disk, disk with a fast
profile, and disk comes from vendor A are all considered resources on the
same path. Once one type of resource has a quota, no other resource types
on the same path can have quotas.

3. Add support for setting guarantees and limits on any meta-data
resources, but running the risk of guarantees might not get satisfied.

4. Add support for setting guarantees and limits on any meta-data
resources, and use the linear programming model to figure out how to
satisfy all the quotas.

5. Stick to the current proposal and does not support setting quotas on
disks with profiles.

Option 1 raises the question that should we treat resources like EBS as
something completely different from vanilla local disk? And if not (as the
option suggests), we need to update other parts of the system accordingly.
For example, endpoints, metrics, the allocator and etc. should stop
treating disk profile as "disk".

Option 2 seems to be too restrictive. It can be hard to reason and unwieldy
for the user.

Option 3 would certainly be easy to use. But after setting up the
guarantees, users would expect the guarantees can be satisfied which Mesos
may not be able to deliver. And when that happens there is no easy
explanation to why the guarantees are not satisfied.

Option 4 allows and enforces all the guarantees optimally. However, it is
not clear what is the performance implication of going through all the
optimization solvers. Also, since guarantees are not part of the long term
plan as we introduce priority tiers, we should ask whether it is worth the
complexity and effort.

Option 5 essentially kicks the can down the road, as the use case for
setting quotas on disk with profile is not immediate. For MVP, we could
stick to the design proposal and prepare to extend that when needs arise
(likely in the medium term).

Thoughts?

Thanks,
Meng

On Thu, Jan 24, 2019 at 9:58 AM Meng Zhu  wrote:

> After the API WG sync, we want to schedule a follow up meeting to discuss
> Quota 2.0 further. If you are interested, please join us at 12:30pm PST
> today (Jan 24th) with the zoom link below. Sorry for the short notice.
>
> -Meng
>
> Join Zoom Meeting https://zoom.us/j/574632536
> <https://www.google.com/url?q=https%3A%2F%2Fzoom.us%2Fj%2F574632536=D=1548784417513000=AFQjCNEiLMZoqWW2x5X0oH-AhrN2GlLAiQ>
> One tap mobile +16699006833,,574632536# US (San Jose)
> +16465588656,,574632536# US (New York) Dial by your location +1 669 900
> 6833 US (San Jose) +1 646 558 8656 US (New York) Meeting ID: 574 632 536
> Find your local number: https://zoom.us/u/acZYnvuO63
> <https://www.google.com/url?q=https%3A%2F%2Fzoom.us%2Fu%2FacZYnvuO63=D=1548784417513000=AFQjCNGCJXDosuVT9iEhjg_KeyoBZT4XxQ>
>
> On Sun, Jan 20, 2019 at 8:07 PM Meng Zhu  wrote:
>
>> Hi folks:
>>
>> I am excited to propose Quota 2.0 for better resource management on
>> Mesos, with explicit limits (decoupled from guarantee), generic quota
>> (which can be set on resources with metadata and on more generic resources
>> such as the number of containers) and bright shiny new APIs.
>>
>> You can find the design doc here
>> <https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit?usp=sharing>.
>> Please fe

Re: Quota 2.0 proposal

2019-01-24 Thread Meng Zhu
After the API WG sync, we want to schedule a follow up meeting to discuss
Quota 2.0 further. If you are interested, please join us at 12:30pm PST
today (Jan 24th) with the zoom link below. Sorry for the short notice.

-Meng

Join Zoom Meeting https://zoom.us/j/574632536
<https://www.google.com/url?q=https%3A%2F%2Fzoom.us%2Fj%2F574632536=D=1548784417513000=AFQjCNEiLMZoqWW2x5X0oH-AhrN2GlLAiQ>
One tap mobile +16699006833,,574632536# US (San Jose)
+16465588656,,574632536# US (New York) Dial by your location +1 669 900
6833 US (San Jose) +1 646 558 8656 US (New York) Meeting ID: 574 632 536
Find your local number: https://zoom.us/u/acZYnvuO63
<https://www.google.com/url?q=https%3A%2F%2Fzoom.us%2Fu%2FacZYnvuO63=D=1548784417513000=AFQjCNGCJXDosuVT9iEhjg_KeyoBZT4XxQ>

On Sun, Jan 20, 2019 at 8:07 PM Meng Zhu  wrote:

> Hi folks:
>
> I am excited to propose Quota 2.0 for better resource management on Mesos,
> with explicit limits (decoupled from guarantee), generic quota (which can
> be set on resources with metadata and on more generic resources such as the
> number of containers) and bright shiny new APIs.
>
> You can find the design doc here
> <https://docs.google.com/document/d/13vG5uH4YVwM79ErBPYAZfnqYFOBbUy2Lym0_9iAQ5Uk/edit?usp=sharing>.
> Please feel free to leave comments and suggestions.
>
> I have also put an agenda item for the upcoming API working group meeting
> on Tuesday (Jan 22nd, 11am PST), please join if you are interested.
>
> Thanks,
> Meng
>


Quota 2.0 proposal

2019-01-20 Thread Meng Zhu
Hi folks:

I am excited to propose Quota 2.0 for better resource management on Mesos,
with explicit limits (decoupled from guarantee), generic quota (which can
be set on resources with metadata and on more generic resources such as the
number of containers) and bright shiny new APIs.

You can find the design doc here
.
Please feel free to leave comments and suggestions.

I have also put an agenda item for the upcoming API working group meeting
on Tuesday (Jan 22nd, 11am PST), please join if you are interested.

Thanks,
Meng


Re: Proposing `Quantity`

2018-12-31 Thread Meng Zhu
Hi James:

Thanks for the great feedback! We will keep your suggestions in mind when
doing specific designs such as the upcoming new API for setting limits.
Stay tuned!

-Meng

On Thu, Dec 20, 2018 at 12:55 PM James DeFelice 
wrote:

> I have concerns about using a "string" type for the value in Quantity. I
> left comments and suggestions in the doc. Thanks!
>
> On Mon, Dec 17, 2018 at 1:49 AM Meng Zhu  wrote:
>
> > Hello:
> >
> > After discussing with folks, we want to propose to use `string` in the
> > `Quantity` message and also leave out the name at the moment:
> >
> > message Quantity {
> >   required string value = 1;
> > }
> >
> > A sample use case for setting resource guarantee will be:
> >
> > message ResourceGuaranteeRequest {
> >optional string role = 1;
> >optional map resource_quantities = 2;
> > }
> >
> > I have updated the design doc here
> > <
> >
> https://docs.google.com/document/d/1WbRKmqsos1-IBJ9VjpT4kNIp2Peyvx2k4UphRaFVYFU/edit#
> > >
> > with more context, alternative proposals, and examples. Feel free leave
> > comments and questions. Thanks!
> >
> > -Meng
> >
> > On Tue, Dec 11, 2018 at 12:20 PM Meng Zhu  wrote:
> >
> > > Hi:
> > >
> > > We are proposing to add a primitive type `Quantity` to facilitate
> several
> > > ongoing projects. The schema will be:
> > >
> > > message Quantity {
> > >   required string name;
> > >   optional value.scalar scalar;
> > > }
> > >
> > > You can find more details such as motivation, current, and future use
> > > cases in this design doc
> > > <
> >
> https://docs.google.com/document/d/1WbRKmqsos1-IBJ9VjpT4kNIp2Peyvx2k4UphRaFVYFU/edit?usp=sharing
> > >.
> > > Feel free leave comments and questions.
> > >
> > > Thanks,
> > > Meng
> > >
> >
>
>
> --
> James DeFelice
> 585.241.9488 (voice)
> 650.649.6071 (fax)
>


Re: [VOTE] Release Apache Mesos 1.7.1 (rc1)

2018-12-28 Thread Meng Zhu
+1
Make check pass on Ubuntu 18.04, clang 6

-Meng

On Fri, Dec 21, 2018 at 2:48 PM Chun-Hung Hsiao  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.1.
>
>
> 1.7.1 includes the following:
>
> 
> * This is a bug fix release. Also includes performance and API
>   improvements:
>
>   * **Allocator**: Improved allocation cycle time substantially
> (see MESOS-9239 and MESOS-9249). These reduce the allocation
> cycle time in some benchmarks by 80%.
>
>   * **Scheduler API**: Improved the experimental `CREATE_DISK` and
> `DESTROY_DISK` operations for CSI volume recovery (see MESOS-9275
> and MESOS-9321). Storage local resource providers now return disk
> resources with the `source.vendor` field set, so frameworks needs to
> upgrade the `Resource` protobuf definitions.
>
>   * **Scheduler API**: Offer operation feedbacks now present their agent
> IDs and resource provider IDs (see MESOS-9293).
>
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.1-rc1
>
> 
>
> The candidate for Mesos 1.7.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz
>
> The tag to be voted on is 1.7.1-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.1-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
>
> https://repository.apache.org/content/repositories/releases/org/apache/mesos/mesos/1.7.1-rc1/
>
> Please vote on releasing this package as Apache Mesos 1.7.1!
>
> To accommodate for the holidays, the vote is open until Mon Dec 31
> 14:00:00 PST 2018 and passes if a majority of at least 3 +1 PMC votes are
> cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


Re: Proposing `Quantity`

2018-12-16 Thread Meng Zhu
Hello:

After discussing with folks, we want to propose to use `string` in the
`Quantity` message and also leave out the name at the moment:

message Quantity {
  required string value = 1;
}

A sample use case for setting resource guarantee will be:

message ResourceGuaranteeRequest {
   optional string role = 1;
   optional map resource_quantities = 2;
}

I have updated the design doc here
<https://docs.google.com/document/d/1WbRKmqsos1-IBJ9VjpT4kNIp2Peyvx2k4UphRaFVYFU/edit#>
with more context, alternative proposals, and examples. Feel free leave
comments and questions. Thanks!

-Meng

On Tue, Dec 11, 2018 at 12:20 PM Meng Zhu  wrote:

> Hi:
>
> We are proposing to add a primitive type `Quantity` to facilitate several
> ongoing projects. The schema will be:
>
> message Quantity {
>   required string name;
>   optional value.scalar scalar;
> }
>
> You can find more details such as motivation, current, and future use
> cases in this design doc
> <https://docs.google.com/document/d/1WbRKmqsos1-IBJ9VjpT4kNIp2Peyvx2k4UphRaFVYFU/edit?usp=sharing>.
> Feel free leave comments and questions.
>
> Thanks,
> Meng
>


Proposing `Quantity`

2018-12-11 Thread Meng Zhu
Hi:

We are proposing to add a primitive type `Quantity` to facilitate several
ongoing projects. The schema will be:

message Quantity {
  required string name;
  optional value.scalar scalar;
}

You can find more details such as motivation, current, and future use cases
in this design doc
.
Feel free leave comments and questions.

Thanks,
Meng


Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-10 Thread Meng Zhu
Thanks Ben. Some thoughts below:

>From a scheduler's perspective the difference between the two models is:
>
> (1) expressing "how much more" you need
> (2) expressing an offer "matcher"
>
> So:
>
> (1) covers the middle part of the demand quantity spectrum we currently
> have: unsuppressed -> infinite additional demand, suppressed -> 0
> additional demand, and now also unsuppressed w/ request of X -> X
> additional demand
>

I am not quite sure if the middle ground (expressing "how much more")
is needed. Even with matchers, the framework may still find itself to cycle
through several offers before finding the right resource. Setting
"effective limit"
will surely prolong this process. I guess the motivation here is to avoid
e.g. sending
too much resources to a just-unsuppressed framework that only wants to
launch a small task. I would say the inefficiency of flooding the framework
with offers would be tolerable if the framework rejects most offers in time,
as we are making progress. Even in cases where such limiting is desired
(e.g. the number of frameworks is too large), I think it is more appropirate
to rely on operators to configure the cluster prioirty by e.g. setting
limits,
than to expect individual frameworks to perform such altruistc action to
limit its own offers (while still having pending work).


> (2) is a global filtering mechanism to avoid getting offers in an unusable
> shape
>

Yeah, as you mentioned, I think we all agree that adding global matchers to
filter-out undesired resources is a good direction--which I think is what
matters most here. I think the small difference lies in how should the
framework
communicate the information: whether a more declarative approach or
exposing the global matchers to frameworks directly.


> They both solve inefficiencies we have, and they're complementary: a
> "request" could actually consist of (1) and (2), e.g. "I need an additional
> 10 cpus, 100GB mem, and I want offers to contain [1cpu, 10GB mem]".
>
> I'll schedule a meeting to discuss further. We should also make sure we
> come back to the original problem in this thread around REVIVE retries.
>
> On Mon, Dec 10, 2018 at 11:58 AM Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
> > Hi Ben et al.,
> >
> > I'd expect frameworks to *always* know how to accept or decline offers in
> > general. More involved frameworks might know how to suppress offers. I
> > don't expect that any framework models filters and their associated
> > durations in detail (that's why I called them a Mesos implementation
> > detail) since there is not much benefit to a framework's primary goal of
> > running tasks as quickly as possible.
> >
> > > I couldn't quite tell how you were imagining this would work, but let
> me
> > spell out the two models that I've been considering, and you can tell me
> if
> > one of these matches what you had in mind or if you had a different model
> > in mind:
> >
> > > (1) "Effective limit" or "give me this much more" ...
> >
> > This sounds more like an operator-type than a framework-type API to me.
> > I'd assume that frameworks would not worry about their total limit the
> way
> > an operator would, but instead care about getting resources to run a
> > certain task at a point in time. I could also imagine this being easy to
> > use incorrectly as frameworks would likely need to understand their total
> > limit when issuing the call which could require state or coordination
> among
> > internal framework components (think: multi-purpose frameworks like
> > Marathon or Aurora).
> >
> > > (2) "Matchers" or "give me things that look like this": when a
> scheduler
> > expresses its "request" for a role, it would act as a "matcher" (opposite
> > of filter). When mesos is allocating resources, it only proceeds if
> > (requests.matches(resources) && !filters.filtered(resources)). The open
> > ended aspect here is what a matcher would consist of. Consider a case
> where
> > a matcher is a resource quantity and multiple are allowed; if any matcher
> > matches, the result is a match. This would be equivalent to letting
> > frameworks specify their own --min_allocatable_resources for a role
> (which
> > is something that has been considered). The "matchers" could be more
> > sophisticated: full resource objects just like filters (but global), full
> > resource objects but with quantities for non-scalar resources like ports,
> > etc.
> >
> > I was thinking in this direction, but what you described is more involved
> > than what I had in mind as a possible first attempt. I'd expect that
> > frameworks currently use `REVIVE` as a proxy for `REQUEST_RESOURCES`, not
> > as a way to manage their filter state tracked in the allocator. Assuming
> we
> > have some way to express resource quantities (i.e., MESOS-9314), we
> should
> > be able to improve on `REVIVE` by providing a `REQUEST_RESOURCES` which
> > clears all filters for resource containing the requested resources (or
> all
> > filters 

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-04 Thread Meng Zhu
Hi Benjamin:

Thanks for the great feedback.

I like the idea of giving frameworks more meaningful and fined grained
control over which filters to remove, especially this is likely to help
adoption. For example, letting the framework send an optional agentID which
instructs Mesos to only clear filters on that agent might help a task
launch with agent constraint.

However, when it comes to framework sent desired resource profiles, we
should give more thoughts. There is always the question that to what degree
do we support the various meta-data in the resource schema. I feel the
current schema is too complex for expressing resource needs, let alone
respecting it in the allocator (even just for the purpose of removing
filters). We probably want to first introduce a more concise format (such
as resourceQuantity) for all purposes of specifying desired resource
profiles (clear filters, quota guarantee, min_allocatable_resources and
etc) and start from there.

I suggest to just add the optional agentID atm and we can always add
support for specifying resource requirements in the future. And since its
semantic is far away from "requesting resources", I suggest keeping the
name of CLEAR(or REMOVE)_FILTERS.

What do you think?

-Meng

On Tue, Dec 4, 2018 at 1:50 AM Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

> Hi Meng,
>
> thanks for the proposal, I agree that the way these two aspects are
> currently entangled is an issue (e.g., for master/allocator performance
> reasons). At the same time, the workflow we currently expect frameworks to
> follow is conceptually not hard to grasp,
>
> (1) If framework has work then
> (i) put framework in unsuppressed state,
> (ii) decline not matching offers with a long filter duration.
> (2) If an offer matches, accept.
> (3) If there is no more work, suppress. GOTO (1).
>
> Here the framework does not need to track its filters across allocation
> cycles (they are an unexposed implementation detail of the hierarchical
> allocator anyway) which e.g., allows metaschedulers like Marathon or Apache
> Aurora to decouple the scheduling of different workloads. A downside of
> this interface is that
>
> * there is little incentive for frameworks to use SUPPRESS in addition to
> filters, and
> * unsupression is all-or-nothing, forcing the master to send potentially
> all unused resources to one framework, even if it is only interested in a
> fraction. This can cause, at least temporal, non-optimal allocation
> behavior.
>
> It seems to me that even though adding UNSUPPRESS and CLEAR_FILTERS would
> give frameworks more control, it would only be a small improvement. In
> above framework workflow we would allow a small improvement if the
> framework knows that a new workload matches a previously running workflow
> (i.e., it can infer that no filters for the resources it is interested in
> is active) so that it can issue UNSUPPRESS instead of CLEAR_FILTERS.
> Incidentally, there seems little local benefit for frameworks to use these
> new calls as they’d mostly help the master and I’d imagine we wouldn’t want
> to imply that clearing filters would unsuppress the framework. This seems
> too little to me, and we run the danger that frameworks would just always
> pair UNSUPPRESS and CLEAR_FILTERS (or keep using REVIVE) to simplify their
> workflow. If we’d model the interface more along framework needs, there
> would be clear benefit which would help adoption.
>
> A more interesting call for me would be REQUEST_RESOURCES. It maps very
> well onto framework needs (e.g., “I want to launch a task requiring these
> resources”), and clearly communicates a requirement to the master so that
> it e.g., doesn’t need to remove all filters for a framework. It also seems
> to fit the allocator model pretty well which doesn’t explicitly expose
> filters. I believe implementing it should not be too hard if we'd restrict
> its semantics to only communicate to the master that a framework _is
> interested in a certain resource_ without promising that the framework
> _will get them in any amount of time_ (i.e., no need to rethink DRF
> fairness semantics in the hierarchical allocator). I also feel that if we
> have REQUEST_RESOURCES we would have some freedom to perform further
> improvements around filters in the master/allocator (e.g., filter
> compatification, work around increasing the default filter duration, …).
>
>
> A possible zeroth implementation for REQUEST_RESOURCES with the
> hierarchical allocator would be to have it remove any filters containing
> the requested resource and likely to unsuppress the framework. A
> REQUEST_RESOURCES call would hold an optional resource and an optional
> AgentID; the case where both are empty would map onto CLEAR_FILTERS.
>
>
> That being said, it mi

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-03 Thread Meng Zhu
See my comments inline.

On Mon, Dec 3, 2018 at 5:43 PM Vinod Kone  wrote:

> Thanks Meng for the explanation.
>
> I imagine most frameworks do not remember what stuff they filtered much
> less figure out how previously filtered stuff  can satisfy new operations.
> That sounds complicated!
>

Frameworks do not need to remember what filters they currently have. Only
knowing
the resource profiles of the current vs. the previous operation would help
a lot.
But yeah, even this may be too much complexity.

>
> But I like your example. So a suggestion we could make to frameworks could
> be to use CLEAR_FILTERS when they have new work, e.g., scale up/down, new
> app (they might want to use this even if they aren't suppressed!); and to
> use UNSUPPRESS when they are rescheduling old work?
>

Yeah, these are the general guideline.

I want to echo and reemphasize that CLEAR_FILTERS is orthogonal to
suppression.
Framework should consider clearing filters regardless of suppression.

Ideally, when there is new different work, old irelavent filters should be
cleared. This helps
framework to get more offers and makes the allocator run faster (filter
could take up
bulk of the allocation time when they build up). On the flip side, calling
CLEAR_FILTERS too often
might also have performance implications (esp. if the master/allocator
actors are already stressed).

Thoughts?
>
> On Mon, Dec 3, 2018 at 6:51 PM Meng Zhu  wrote:
>
> > Hi Vinod:
> >
> > Yeah, `CLEAR_FILTERS` sounds good.
> >
> > UNSUPPRESS should be used whenever currently suppressed framework wants
> to
> > resume getting offers after a previous SUPPRESS call.
> >
> > As for `CLEAR_FILTERS`, the short (but not very useful) suggestion is to
> > call it whenever the framework wants to clear all the existing filters.
> >
> > To elaborate it, frameworks decline and accumulate filters when it is
> > trying to satisfy a particular set of requirements/constraints to perform
> > an operation. Once the operation is done and the next operation comes, if
> > the new operation has the same (or strictly more) resource
> > requirements/constraints compared to the last one, then it is more
> > efficient to KEEP the existing filters instead of getting useless offers
> > and rebuild the filters again.
> >
> > On the other hand, if the requirements/constraints are different (i.e.
> some
> > of the previous requirements could be loosened), then it means the
> existing
> > filter no longer make sense. Then it might be a good idea to clear all
> the
> > existing filters to improve the chance of getting more offers.
> >
> > Note, although we introduce `CLEAR_FILTERS` as part of decoupling the
> > `REVIVE` call, its usage should be independent of suppression/revival.
> The
> > decision to clear the filters only depends on whether the existing
> filters
> > make sense for the current operation constraints/requirements.
> >
> > Examples:
> > If a framework first launches a task, then wants to launch a replacement
> > task (because the first task failed), then it should keep the filters
> built
> > up during the first launch. However, if the framework wants to launch a
> > second task with a completely different resource profile, then clearing
> > filters might help to get more (otherwise filtered) offers and hence
> speed
> > up the deployment.
> >
> > -Meng
> >
> > On Mon, Dec 3, 2018 at 12:40 PM Vinod Kone  wrote:
> >
> > > Hi Meng,
> > >
> > > What would be the recommendation for framework authors on when to use
> > > UNSUPPRESS vs CLEAR_FILTER?
> > >
> > > Also, should it CLEAR_FILTERS instead of CLEAR_FILTER?
> > >
> > > On Mon, Dec 3, 2018 at 2:26 PM Meng Zhu  wrote:
> > >
> > >> Hi:
> > >>
> > >> tl;dr: We are proposing to add two new V1 scheduler APIs: unsuppress
> and
> > >> clear_filter in order to decouple the dual-semantics of the current
> > revive
> > >> call.
> > >>
> > >> As pointed out in the Mesos framework scalability guide
> > >> <
> >
> http://mesos.apache.org/documentation/latest/app-framework-development-guide/#multi-scheduler-scalability
> > >,
> > >> utilizing the suppress
> > >> <
> >
> http://mesos.apache.org/documentation/latest/scheduler-http-api/#suppress>
> > >> call is the key to get your cluster to a large number of frameworks
> > >> <
> >
> https://schd.ws/hosted_files/mesoscon18/84/Scaling%20Mesos%20to%20Thousands%20of%20Frameworks.pdf
> > >.
> >

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-03 Thread Meng Zhu
Hi Vinod:

Yeah, `CLEAR_FILTERS` sounds good.

UNSUPPRESS should be used whenever currently suppressed framework wants to
resume getting offers after a previous SUPPRESS call.

As for `CLEAR_FILTERS`, the short (but not very useful) suggestion is to
call it whenever the framework wants to clear all the existing filters.

To elaborate it, frameworks decline and accumulate filters when it is
trying to satisfy a particular set of requirements/constraints to perform
an operation. Once the operation is done and the next operation comes, if
the new operation has the same (or strictly more) resource
requirements/constraints compared to the last one, then it is more
efficient to KEEP the existing filters instead of getting useless offers
and rebuild the filters again.

On the other hand, if the requirements/constraints are different (i.e. some
of the previous requirements could be loosened), then it means the existing
filter no longer make sense. Then it might be a good idea to clear all the
existing filters to improve the chance of getting more offers.

Note, although we introduce `CLEAR_FILTERS` as part of decoupling the
`REVIVE` call, its usage should be independent of suppression/revival. The
decision to clear the filters only depends on whether the existing filters
make sense for the current operation constraints/requirements.

Examples:
If a framework first launches a task, then wants to launch a replacement
task (because the first task failed), then it should keep the filters built
up during the first launch. However, if the framework wants to launch a
second task with a completely different resource profile, then clearing
filters might help to get more (otherwise filtered) offers and hence speed
up the deployment.

-Meng

On Mon, Dec 3, 2018 at 12:40 PM Vinod Kone  wrote:

> Hi Meng,
>
> What would be the recommendation for framework authors on when to use
> UNSUPPRESS vs CLEAR_FILTER?
>
> Also, should it CLEAR_FILTERS instead of CLEAR_FILTER?
>
> On Mon, Dec 3, 2018 at 2:26 PM Meng Zhu  wrote:
>
>> Hi:
>>
>> tl;dr: We are proposing to add two new V1 scheduler APIs: unsuppress and
>> clear_filter in order to decouple the dual-semantics of the current revive
>> call.
>>
>> As pointed out in the Mesos framework scalability guide
>> <http://mesos.apache.org/documentation/latest/app-framework-development-guide/#multi-scheduler-scalability>,
>> utilizing the suppress
>> <http://mesos.apache.org/documentation/latest/scheduler-http-api/#suppress>
>> call is the key to get your cluster to a large number of frameworks
>> <https://schd.ws/hosted_files/mesoscon18/84/Scaling%20Mesos%20to%20Thousands%20of%20Frameworks.pdf>.
>> In short, when a framework is idling with no intention to launch any tasks,
>> it should suppress to inform the Mesos to stop sending any more offers. And
>> the framework should revive
>> <http://mesos.apache.org/documentation/latest/scheduler-http-api/#revive>
>> when new work arrives. This way, the allocator will skip the framework when
>> performing resource allocations. As a result, thorny issues such as offer
>> starvation and resource fragmentation would be greatly mitigated.
>>
>> That being said. The suppress/revive calls currently are a little bit
>> unwieldy due to MESOS-9028
>> <https://issues.apache.org/jira/browse/MESOS-9028>:
>>
>> The revive call has two semantics. It unsuppresses the framework AND
>> clears all the existing filters. The later makes the revive call
>> non-idempotent. And sometimes users may want to keep the existing filters
>> when reiving which is not possible atm.
>>
>> To decouple the semantics, as suggested in the ticket, we propose to add
>> two new V1 scheduler calls:
>>
>> (1) `UNSUPPRESS` call requests the Mesos to resume sending offers;
>> (2) `CLEAR_FILTER` call will explicitly clear all the existing filters.
>>
>> To make life easier, both calls will return 200 OK (as opposed to 202
>> returned by most existing scheduler calls, including `SUPPRESS` and
>> `REVIVE`).
>>
>> We will keep the revive call and its semantics (i.e. unsupppress AND
>> clear filters) for backward compatibility.
>>
>> Note, the changes are proposed for V1 API only. Thus, once the changes
>> are landed, framework developers are encouraged to move to V1 API to take
>> advantage of the new calls (among many other benefits).
>>
>> Any feedback/comments are welcome.
>>
>> -Meng
>>
>


New scheduler API proposal: unsuppress and clear_filter

2018-12-03 Thread Meng Zhu
Hi:

tl;dr: We are proposing to add two new V1 scheduler APIs: unsuppress and
clear_filter in order to decouple the dual-semantics of the current revive
call.

As pointed out in the Mesos framework scalability guide
,
utilizing the suppress

call is the key to get your cluster to a large number of frameworks
.
In short, when a framework is idling with no intention to launch any tasks,
it should suppress to inform the Mesos to stop sending any more offers. And
the framework should revive

when new work arrives. This way, the allocator will skip the framework when
performing resource allocations. As a result, thorny issues such as offer
starvation and resource fragmentation would be greatly mitigated.

That being said. The suppress/revive calls currently are a little bit
unwieldy due to MESOS-9028
:

The revive call has two semantics. It unsuppresses the framework AND clears
all the existing filters. The later makes the revive call non-idempotent.
And sometimes users may want to keep the existing filters when reiving
which is not possible atm.

To decouple the semantics, as suggested in the ticket, we propose to add
two new V1 scheduler calls:

(1) `UNSUPPRESS` call requests the Mesos to resume sending offers;
(2) `CLEAR_FILTER` call will explicitly clear all the existing filters.

To make life easier, both calls will return 200 OK (as opposed to 202
returned by most existing scheduler calls, including `SUPPRESS` and
`REVIVE`).

We will keep the revive call and its semantics (i.e. unsupppress AND clear
filters) for backward compatibility.

Note, the changes are proposed for V1 API only. Thus, once the changes are
landed, framework developers are encouraged to move to V1 API to take
advantage of the new calls (among many other benefits).

Any feedback/comments are welcome.

-Meng


Re: [VOTE] Release Apache Mesos 1.5.2 (rc2)

2018-11-22 Thread Meng Zhu
+1
make check on Ubuntu 18.04

On Wed, Oct 31, 2018 at 4:26 PM Gilbert Song  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.5.2.
>
> 1.5.2 includes the following:
>
> 
> *Announce major bug fixes here*
>   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
>   * [MESOS-8128] - Make os::pipe file descriptors O_CLOEXEC.
>   * [MESOS-8418] - mesos-agent high cpu usage because of numerous
> /proc/mounts reads.
>   * [MESOS-8545] -
> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
>   * [MESOS-8568] - Command checks should always call
> `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`.
>   * [MESOS-8620] - Containers stuck in FETCHING possibly due to
> unresponsive server.
>   * [MESOS-8830] - Agent gc on old slave sandboxes could empty persistent
> volume data.
>   * [MESOS-8871] - Agent may fail to recover if the agent dies before
> image store cache checkpointed.
>   * [MESOS-8904] - Master crash when removing quota.
>   * [MESOS-8906] - `UriDiskProfileAdaptor` fails to update profile
> selectors.
>   * [MESOS-8907] - Docker image fetcher fails with HTTP/2.
>   * [MESOS-8917] - Agent leaking file descriptors into forked processes.
>   * [MESOS-8921] - Autotools don't work with newer OpenJDK versions.
>   * [MESOS-8935] - Quota limit "chopping" can lead to cpu-only and
> memory-only offers.
>   * [MESOS-8936] - Implement a Random Sorter for offer allocations.
>   * [MESOS-8942] - Master streaming API does not send (health) check
> updates for tasks.
>   * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
>   * [MESOS-8947] - Improve the container preparing logging in
> IOSwitchboard and volume/secret isolator.
>   * [MESOS-8952] - process::await/collect n^2 performance issue.
>   * [MESOS-8963] - Executor crash trying to print container ID.
>   * [MESOS-8978] - Command executor calling setsid breaks the tty support.
>   * [MESOS-8980] - mesos-slave can deadlock with docker pull.
>   * [MESOS-8986] - `slave.available()` in the allocator is expensive and
> drags down allocation performance.
>   * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
>   * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
>   * [MESOS-9049] - Agent GC could unmount a dangling persistent volume
> multiple times.
>   * [MESOS-9116] - Launch nested container session fails due to incorrect
> detection of `mnt` namespace of command executor's task.
>   * [MESOS-9125] - Port mapper CNI plugin might fail with "Resource
> temporarily unavailable".
>   * [MESOS-9127] - Port mapper CNI plugin might deadlock iptables on the
> agent.
>   * [MESOS-9131] - Health checks launching nested containers while a
> container is being destroyed lead to unkillable tasks.
>   * [MESOS-9142] - CNI detach might fail due to missing network config
> file.
>   * [MESOS-9144] - Master authentication handling leads to request
> amplification.
>   * [MESOS-9145] - Master has a fragile burned-in 5s authentication
> timeout.
>   * [MESOS-9146] - Agent has a fragile burn-in 5s authentication timeout.
>   * [MESOS-9147] - Agent and scheduler driver authentication retry backoff
> time could overflow.
>   * [MESOS-9151] - Container stuck at ISOLATING due to FD leak.
>   * [MESOS-9170] - Zookeeper doesn't compile with newer gcc due to format
> error.
>   * [MESOS-9196] - Removing rootfs mounts may fail with EBUSY.
>   * [MESOS-9231] - `docker inspect` may return an unexpected result to
> Docker executor due to a race condition.
>   * [MESOS-9267] - Mesos agent crashes when CNI network is not configured
> but used.
>   * [MESOS-9279] - Docker Containerizer 'usage' call might be expensive if
> mount table is big.
>   * [MESOS-9283] - Docker containerizer actor can get backlogged with
> large number of containers.
>   * [MESOS-9305] - Create cgoup recursively to workaround systemd deleting
> cgroups_root.
>   * [MESOS-9308] - URI disk profile adaptor could deadlock.
>   * [MESOS-9334] - Container stuck at ISOLATING state due to libevent poll
> never returns.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.2-rc2
>
> 
>
> The candidate for Mesos 1.5.2 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz
>
> The tag to be voted on is 1.5.2-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.2-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz.asc
>
> The PGP key used to sign the release is here:
> 

Proposing Minimum Capability to Safeguard Downgrade

2018-06-14 Thread Meng Zhu
Hi:

A common use case for downgrade is rolling back from problematic upgrades.
Mesos promises compatibility between any 1.x and 1.y versions of
masters/agents as long as new features are not used. However, currently
there is no easy way to tell whether any “new” features are being used. And
any incompatible downgrade would silently result in undefined behavior
instead of failsafe. This is not ideal.

We want to help operators to make informed downgrade decisions and to take
correct actions (e.g. deactivate the use of certain new features) if
necessary. To this end, we propose adding minimum component capability.
Please checkout the doc below for more details. Feel free to comment in the
doc! Thanks!

JIRA: *MESOS-8878 *
Design proposal


-Meng


Proposing change to the allocatable check in the allocator

2018-06-11 Thread Meng Zhu
Hi:

The allocatable

 check in the allocator (shown below) was originally introduced to

help alleviate the situation where a framework receives some resources, but
no

cpu/memory, thus cannot launch a task.


constexpr double MIN_CPUS = 0.01;constexpr Bytes MIN_MEM = Megabytes(32);
bool HierarchicalAllocatorProcess::allocatable(
const Resources& resources)
{
  Option cpus = resources.cpus();
  Option mem = resources.mem();

  return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
 (mem.isSome() && mem.get() >= MIN_MEM);
}


Issues

However, there has been a couple of issues surfacing lately surrounding the
check.

   -
   - - MESOS-8935 Quota limit "chopping" can lead to cpu-only and
   memory-only offers.

We introduced fined-grained quota-allocation (MESOS-7099) in Mesos 1.5.
When we

allocate resources to a role, we'll "chop" the available resources of the
agent up to the

quota limit for the role. However, this has the unintended consequence of
creating

cpu-only and memory-only offers, even though there might be other agents
with both

cpu and memory resources available in the cluster.


- MESOS-8626 The 'allocatable' check in the allocator is problematic with
multi-role frameworks.

Consider roleA reserved cpu/memory on an agent and roleB reserved disk on
the same agent.

A framework under both roleA and roleB will not be able to get the reserved
disk due to the

allocatable check. With the introduction of resource providers, the similar
situation will

become more common.

Proposed change

Instead of hardcoding a one-size-fits-all value in Mesos, we are proposing
to add a new master flag

min_allocatable_resources. It specifies one or more scalar resources
quantities that define the

minimum allocatable resources for the allocator. The allocator will only
offer resources that are more

than at least one of the specified resources.  The default behavior *is
backward compatible* i.e.

by default, the flag is set to “cpus:0.01|mem:32”.

Usage

The flag takes in either a simple text of resource(s) delimited by a bar
(|) or a JSON array of JSON

formatted resources. Note, the input should be “pure” scalar quantities
i.e. the specified resource(s)

should only have name, type (set to scalar) and scalar fields set.


Examples:

   - - To eliminate cpu or memory only offer due to the quota chopping,
   - we could set the flag to “cpus:0.01;mem:32”
   -
   - - To enable offering disk only offer, we could set the flag to
   “disk:32”
   -
   - - For both, we could set the flag to “cpus:0.01;mem:32|disk:32”.
   - Then the allocator will only offer resources that at least contain
   “cpus:0.01;mem:32”
   - OR resources that at least contain “disk:32”.


Let me know what you think! Thanks!


-Meng


Re: Adding a `FLAKY` label to flaky unit tests

2018-03-29 Thread Meng Zhu
+1, the advantages are appealing.

Though I am afraid that this will probably reduce the incentive to fix
flaky tests.

-Meng

On Thu, Mar 29, 2018 at 9:45 AM, Benno Evers  wrote:

> Hi all,
>
> if you're regularly running Mesos unit tests, e.g. because you've set up a
> CI system, you probably noticed that there is a lot of noise in the results
> due to flaky tests.
>
> As a measure to ease the pain, what do you think about adding a `FLAKY`
> label to known flaky unit tests, similar to how we have `ROOT`, `INTERNET`,
> `DISABLED`, etc. right now?
>
> The advantages, in my opinion, would be:
>  - Looking at test results, it would be immediately visible whether a test
> failure was known flaky or not without going to JIRA
>  - People who want to reduce noise can disable all known flaky tests by a
> simple gtest filter
>  - People who want to can still run the flaky tests easier than if they get
> disabled outright
>  - With a little bit of scripting, it would be possible to add logic like
> "for flaky tests, run them 10 times and only report a failure if more than
> x% of the runs fail."
>
> What do you think?
>
> Best regards,
> --
> Benno Evers
> Software Engineer, Mesosphere
>


Re: Welcome Chun-Hung Hsiao as Mesos Committer and PMC Member

2018-03-12 Thread Meng Zhu
Congrats Chun! Well deserved!

On Mon, Mar 12, 2018 at 10:09 AM, Zhitao Li  wrote:

> Congrats, Chun!
>
> On Sun, Mar 11, 2018 at 11:47 PM, Gilbert Song 
> wrote:
>
> > Congrats, Chun!
> >
> > It is great to have you in the community!
> >
> > - Gilbert
> >
> > On Sun, Mar 11, 2018 at 4:40 PM, Andrew Schwartzmeyer <
> > and...@schwartzmeyer.com> wrote:
> >
> > > Congratulations Chun!
> > >
> > > I apologize for not also giving you a +1, as I certainly would have,
> but
> > > just discovered my mailing list isn't working. Just a heads up, don't
> let
> > > that happen to you too!
> > >
> > > I look forward to continuing to work with you.
> > >
> > > Cheers,
> > >
> > > Andy
> > >
> > >
> > > On 03/10/2018 9:14 pm, Jie Yu wrote:
> > >
> > >> Hi,
> > >>
> > >> I am happy to announce that the PMC has voted Chun-Hung Hsiao as a new
> > >> committer and member of PMC for the Apache Mesos project. Please join
> me
> > >> to
> > >> congratulate him!
> > >>
> > >> Chun has been an active contributor for the past year. His main
> > >> contributions to the project include:
> > >> * Designed and implemented gRPC client support to libprocess
> > (MESOS-7749)
> > >> * Designed and implemented Storage Local Resource Provider
> (MESOS-7235,
> > >> MESOS-8374)
> > >> * Implemented part of the CSI support (MESOS-7235, MESOS-8374)
> > >>
> > >> Chun is friendly and humble, but also intelligent, insightful, and
> > >> opinionated. I am confident that he will be a great addition to our
> > >> committer pool. Thanks Chun for all your contributions to the project
> so
> > >> far!
> > >>
> > >> His committer checklist can be found here:
> > >> https://docs.google.com/document/d/1FjroAvjGa5NdP29zM7-2eg6t
> > >> LPAzQRMUmCorytdEI_U/edit?usp=sharing
> > >>
> > >> - Jie
> > >>
> > >
> > >
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>


Re: Tasks may be explicitly dropped by agent in Mesos 1.5

2018-03-02 Thread Meng Zhu
CORRECTION:

This is a new behavior that only appears in the current 1.5.x branch. In
1.5.0, Mesos
agent still has the old behavior, namely, any reordered tasks (to the same
executor)
are launched regardless.

On Fri, Mar 2, 2018 at 9:41 AM, Chun-Hung Hsiao <chhs...@mesosphere.io>
wrote:

> Gilbert I think you're right. The code path doesn't exist in 1.5.0.
>
> On Mar 2, 2018 9:36 AM, "Chun-Hung Hsiao" <chhs...@mesosphere.io> wrote:
>
> > This is a new behavior we have after solving MESOS-1720, and thus a new
> > problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same
> executor)
> > will be launched because whoever comes first will launch the executor.
> > Since 1.5, one might be dropped.
> >
> > On Mar 1, 2018 4:36 PM, "Gilbert Song" <gilb...@mesosphere.io> wrote:
> >
> >> Meng,
> >>
> >> Could you double check if this is really an issue in Mesos 1.5.0
> release?
> >>
> >> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was
> >> resolved
> >> after the 1.5 release (rc-2) and it seems like
> >> it is only at the master branch and 1.5.x branch (not 1.5.0).
> >>
> >> Did I miss anything?
> >>
> >> - Gilbert
> >>
> >> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bmah...@apache.org>
> >> wrote:
> >>
> >> > Put another way, we currently don't guarantee in-order task delivery
> to
> >> > the executor. Due to the changes for MESOS-1720, one special case of
> >> task
> >> > re-ordering now leads to the re-ordered task being dropped (rather
> than
> >> > delivered out-of-order as before). Technically, this is strictly
> better.
> >> >
> >> > However, we'd like to start guaranteeing in-order task delivery.
> >> >
> >> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <m...@mesosphere.com> wrote:
> >> >
> >> >> Hi all:
> >> >>
> >> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> >> >> if all three conditions are met:
> >> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
> >> >>  use the same executor.
> >> >> (2) The executor currently does not exist on the agent.
> >> >> (3) Due to some race conditions, these tasks are trying to launch
> >> >> on the agent in a different order from their original launch order.
> >> >>
> >> >> In this case, tasks that are trying to launch on the agent
> >> >> before the first task in the original order will be explicitly
> dropped
> >> by
> >> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
> >> >>
> >> >> This bug will be fixed in 1.5.1. It is tracked in
> >> >> https://issues.apache.org/jira/browse/MESOS-8624
> >> >>
> >> >> 
> >> >>
> >> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced
> an
> >> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> >> >> calls to a new executor. The master would specify that the first call
> >> is
> >> >> the
> >> >> one to launch a new executor through the `launch_executor` field in
> >> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> >> >> use the existing executor launched by the first one.
> >> >>
> >> >> On the agent side, running a task/task group goes through a series of
> >> >> continuations, one is `collect()` on the future that unschedule
> >> >> frameworks from
> >> >> being GC'ed:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2158
> >> >> another is `collect()` on task authorization:
> >> >> https://github.com/apache/mesos/blob/master/src/slave/slave.
> cpp#L2333
> >> >> Since these `collect()` calls run on individual actors, the futures
> of
> >> the
> >> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> >> >> out-of-order, even if the futures these two `collect()` wait for are
> >> >> satisfied in
> >> >> order (which is true in these two cases).
> >> >>
> >> >> As a result, under some race conditions (probably under some heavy
> load
> >> >> conditions), tasks rely on the previous task to launch executor may
> >> >> get processed before the task that is supposed to launch the executor
> >> >> first, resulting in the tasks being explicitly dropped by the agent.
> >> >>
> >> >> -Meng
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>


Tasks may be explicitly dropped by agent in Mesos 1.5

2018-03-01 Thread Meng Zhu
Hi all:

TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
if all three conditions are met:
(1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
 use the same executor.
(2) The executor currently does not exist on the agent.
(3) Due to some race conditions, these tasks are trying to launch
on the agent in a different order from their original launch order.

In this case, tasks that are trying to launch on the agent
before the first task in the original order will be explicitly dropped by
the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).

This bug will be fixed in 1.5.1. It is tracked in
https://issues.apache.org/jira/browse/MESOS-8624



In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
calls to a new executor. The master would specify that the first call is the
one to launch a new executor through the `launch_executor` field in
`RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
use the existing executor launched by the first one.

On the agent side, running a task/task group goes through a series of
continuations, one is `collect()` on the future that unschedule frameworks
from
being GC'ed:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
another is `collect()` on task authorization:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
Since these `collect()` calls run on individual actors, the futures of the
`collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
out-of-order, even if the futures these two `collect()` wait for are
satisfied in
order (which is true in these two cases).

As a result, under some race conditions (probably under some heavy load
conditions), tasks rely on the previous task to launch executor may
get processed before the task that is supposed to launch the executor
first, resulting in the tasks being explicitly dropped by the agent.

-Meng