Re: [VOTE] Release Apache Mesos 1.9.0 (rc3)

2019-09-04 Thread Chun-Hung Hsiao
+1 (binding)

make distcheck on Ubuntu 16.04 and 18.04.

On 18.04 I got the following known failure:
[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_FetchBlob

Also the mesos-gtest-runner invoked by make distcheck seems not working on
both platforms.

On Tue, Sep 3, 2019 at 1:34 PM Gilbert Song  wrote:

> +1 (binding).
>
> Tested on our internal CI. Green on most of the platforms:
> Configuration Matrix Plain SSL CMake Clang BUILD_ISOLATORS
> mac
> [image: Not run]
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mac/
> >
> [image: Not run]
> [image: Not run]
> [image: Not run]
> mesos-ec2-centos-6
> [image: Not run]
> [image: Unstable]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-centos-6/
> >
> [image: Not run]
> [image: Not run]
> [image: Not run]
> mesos-ec2-centos-7
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=Plain,label=mesos-ec2-centos-7/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-centos-7/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=CMake,label=mesos-ec2-centos-7/
> >
> [image: Not run]
> [image: Not run]
> mesos-ec2-debian-8
> [image: Not run]
> [image: Unstable]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-debian-8/
> >
> [image: Not run]
> [image: Not run]
> [image: Not run]
> mesos-ec2-debian-9
> [image: Not run]
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-debian-9/
> >
> [image: Not run]
> [image: Not run]
> [image: Not run]
> mesos-ec2-ubuntu-14.04
> [image: Not run]
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-ubuntu-14.04/
> >
> [image: Not run]
> [image: Not run]
> [image: Not run]
> mesos-ec2-ubuntu-16.04
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=Plain,label=mesos-ec2-ubuntu-16.04/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=SSL,label=mesos-ec2-ubuntu-16.04/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=CMake,label=mesos-ec2-ubuntu-16.04/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=Clang,label=mesos-ec2-ubuntu-16.04/
> >
> [image: Success]
> <
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/6388/FLAG=BUILD_ISOLATORS,label=mesos-ec2-ubuntu-16.04/
> >
>
> The only two failed tests are known flakiness.
>
>- mesos-ec2-centos-6-SSL.Mesos.ExecutorAuthorizationTest.FailedSubscribe
>- mesos-ec2-debian-8-SSL.MESOS_TESTS_ABORTED.xml.[empty]
>
> -Gilbert
>
> On Tue, Sep 3, 2019 at 10:43 AM Vinod Kone  wrote:
>
> > +1 (binding)
> >
> > Tested on ASF CI.
> >
> > *Revision*: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e
> >
> >- refs/tags/1.9.0-rc3
> >
> > Configuration Matrix gcc clang
> > centos:7 --verbose --disable-libtool-wrappers
> > --disable-parallel-test-execution --enable-libevent --enable-ssl
> autotools
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/75/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop%7C%7Cbeam)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > cmake
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/75/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop%7C%7Cbeam)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> > autotools
> > [image: Success]
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/75/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop%7C%7Cbeam)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> > [image: Not run]
> > cmake
> > [image: Success]
> > <
> 

Re: [VOTE] Release Apache Mesos 1.9.0 (rc2)

2019-08-29 Thread Chun-Hung Hsiao
-1 for https://issues.apache.org/jira/browse/MESOS-9956. I'm working on a
fix for it.

On Wed, Aug 28, 2019 at 4:13 AM Qian Zhang  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.9.0.
>
>
> 1.9.0 includes the following:
>
> 
> * Agent draining
> * Support configurable /dev/shm and IPC namespace.
> * Containerizer debug endpoint.
> * Add `no-new-privileges` isolator.
> * Client side SSL certificate verification in Libprocess.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc2
>
> 
>
> The candidate for Mesos 1.9.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz
>
> The tag to be voted on is 1.9.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1256
>
> Please vote on releasing this package as Apache Mesos 1.9.0!
>
> The vote is open until  and passes if a majority of at least 3 +1 PMC
> votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.9.0
> [ ] -1 Do not release this package because ...
>
>
> Thanks,
> Qian and Gilbert
>


Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

2019-05-02 Thread Chun-Hung Hsiao
>From the log you attached, it seems that you're using Mesos containerizer,
so a docker pull won't affect Mesos. Can you verify if the error occurs
with the latest nvidia/cuda image?

On Wed, May 1, 2019, 4:25 PM Chun-Hung Hsiao  wrote:

> Hi Jorge,
>
> Can you provide the output of `docker run --rm -ti nvidia/cuda ls
> /usr/local/cuda-10.1/compat/`?
> It seems that the nvidia kernel driver installed on your host has version
> 418, but the image you're using is version 410.
> The lastest `nvidia/cuda` image uses version 418 as well.
> Can you also do a `docker pull nvidia/cuda` then try again with Mesos 1.8?
>
> On Fri, Apr 26, 2019 at 1:03 PM Jorge Machado 
> wrote:
>
>> Hi all,
>>
>> did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having
>> some issues using the cuda 10+ images when doing real processing. We still
>> need to check some things but basically we get:
>> kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot
>> find working devices in this configuration
>>
>> Logs:
>> I0424 13:27:14.00058630 executor.cpp:726] Forked command at 73
>> Preparing rootfs at
>> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
>> Marked '/' as rslave
>> Executing pre-exec command
>> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
>> Executing pre-exec command
>> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
>> Changing root to
>> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
>> 2019-04-24 13:27:18.346994: I
>> tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
>> instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
>> 2019-04-24 13:27:18.352203: E
>> tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit:
>> CUDA_ERROR_UNKNOWN: unknown error
>> 2019-04-24 13:27:18.352243: I
>> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA
>> diagnostic information for host: __host__
>> 2019-04-24 13:27:18.352252: I
>> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: __host__
>> 2019-04-24 13:27:18.352295: I
>> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported
>> version is: 410.48.0
>> 2019-04-24 13:27:18.352329: I
>> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported
>> version is: 418.56.0
>> 2019-04-24 13:27:18.352338: E
>> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306] kernel version
>> 418.56.0 does not match DSO version 410.48.0 -- cannot find working devices
>> in this configuration
>> 2019-04-24 13:27:18.374940: I
>> tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
>> 259392 Hz
>> 2019-04-24 13:27:18.378793: I
>> tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10
>> executing computations on platform Host. Devices:
>> 2019-04-24 13:27:18.378821: I
>> tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device
>> (0): , 
>> W0424 13:27:18.385210 140191267731200 deprecation.py:323] From
>> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
>> colocate_with (from tensorflow.python.framework.ops) is deprecated and will
>> be removed in a future version.
>> Instructions for updating:
>> Colocations handled automatically by placer.
>> W0424 13:27:18.399287 140191267731200 deprecation.py:323] From
>> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129:
>> conv2d (from tensorflow.python.layers.convolutional) is deprecated and will
>> be removed in a future version.
>> Instructions for updating:
>> Use keras.layers.conv2d instead.
>> W0424 13:27:18.433226 140191267731200 deprecation.py:323] From
>> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261:
>> max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and
>> will 

Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

2019-05-01 Thread Chun-Hung Hsiao
Hi Jorge,

Can you provide the output of `docker run --rm -ti nvidia/cuda ls
/usr/local/cuda-10.1/compat/`?
It seems that the nvidia kernel driver installed on your host has version
418, but the image you're using is version 410.
The lastest `nvidia/cuda` image uses version 418 as well.
Can you also do a `docker pull nvidia/cuda` then try again with Mesos 1.8?

On Fri, Apr 26, 2019 at 1:03 PM Jorge Machado  wrote:

> Hi all,
>
> did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having
> some issues using the cuda 10+ images when doing real processing. We still
> need to check some things but basically we get:
> kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot find
> working devices in this configuration
>
> Logs:
> I0424 13:27:14.00058630 executor.cpp:726] Forked command at 73
> Preparing rootfs at
> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
> Marked '/' as rslave
> Executing pre-exec command
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
> Executing pre-exec command
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
> Changing root to
> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
> 2019-04-24 13:27:18.346994: I
> tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
> instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
> 2019-04-24 13:27:18.352203: E
> tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit:
> CUDA_ERROR_UNKNOWN: unknown error
> 2019-04-24 13:27:18.352243: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA
> diagnostic information for host: __host__
> 2019-04-24 13:27:18.352252: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: __host__
> 2019-04-24 13:27:18.352295: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported
> version is: 410.48.0
> 2019-04-24 13:27:18.352329: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported
> version is: 418.56.0
> 2019-04-24 13:27:18.352338: E
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306] kernel version
> 418.56.0 does not match DSO version 410.48.0 -- cannot find working devices
> in this configuration
> 2019-04-24 13:27:18.374940: I
> tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
> 259392 Hz
> 2019-04-24 13:27:18.378793: I
> tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10
> executing computations on platform Host. Devices:
> 2019-04-24 13:27:18.378821: I
> tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device
> (0): , 
> W0424 13:27:18.385210 140191267731200 deprecation.py:323] From
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
> colocate_with (from tensorflow.python.framework.ops) is deprecated and will
> be removed in a future version.
> Instructions for updating:
> Colocations handled automatically by placer.
> W0424 13:27:18.399287 140191267731200 deprecation.py:323] From
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129:
> conv2d (from tensorflow.python.layers.convolutional) is deprecated and will
> be removed in a future version.
> Instructions for updating:
> Use keras.layers.conv2d instead.
> W0424 13:27:18.433226 140191267731200 deprecation.py:323] From
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261:
> max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and
> will be removed in a future version.
> Instructions for updating:
> Use keras.layers.max_pooling2d instead.
> W0424 13:27:20.197937 140191267731200 deprecation.py:323] From
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209:
> to_float (from tensorflow.python.ops.math_ops) is deprecated and will be
> removed in a future version.
> Instructions for updating:
> Use tf.cast instead.
> W0424 13:27:20.312573 140191267731200 deprecation.py:323] From
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066:
> to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be
> removed in a future version.
> Instructions for updating:
> Use tf.cast instead.
> W0424 13:27:21.082763 140191267731200 deprecation.py:323] From
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238:
> __init__ (from tensorflow.python.training.supervisor) is deprecated 

Re: Mesos on ssl

2019-04-05 Thread Chun-Hung Hsiao
I'm not sure if this is related:
https://issues.apache.org/jira/browse/MESOS-7076

In summary, Ubuntu 18.04 ships libevent 2.1.x (for OpenSSL 1.1.x support).
But libevent 2.1.x has an unknown bug that caused some Mesos tests to fail.
As a workaround, the current Mesos master branch (will be 1.8 soon) bundled
libevent 2.0.x with a magic patch from Debian 8 for OpenSSL 1.1.x). So
Mesos 1.8 will be the first official release supporting SSL on Ubuntu 18.04.

That said, I'm not sure what you encountered is exactly the same bug that
caused the Mesos tests to fail though. Just a guess ;)

On Fri, Apr 5, 2019, 12:58 AM Jorge Machado  wrote:

> Hi Guys,
>
> I'm having issues with mesos versions from tar.gz compared with a build
> from git master when using ssl.
> With a build from git ssl agent is fine and for example the endpoint
> https://mesos-agent:5051/ returns a 404 which is fine.
> With a build from tar.gz (1.7.1 or 1.7.2) the same endpoint does not work
> and it just hangs. No logs nothing...
> I'm testing this on ubuntu 18.04.
>
> Any tipps ?
> thanks
> Jorge
>
>
> Jorge Machado
> www.jmachado.me
>
>
>
>
>
>


Re: [VOTE] Release Apache Mesos 1.5.3 (rc1)

2019-03-14 Thread Chun-Hung Hsiao
+1 (binding)

`sudo make check` with `--enable-grpc --enable-ssh --enable-libevent` on
Ubuntu 16.04 with the following known test failures:
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ] CgroupsAnyHierarchyWithCpuAcctMemoryTest.ROOT_CGROUPS_Stat
[  FAILED  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS

`sudo make check` with `--enable-grpc --enable-ssh --enable-libevent on
macOS 10.14.2 with the following test failures that also fail on 1.5.2:
[  FAILED  ] ExamplesTest.PythonFramework
[  FAILED  ] FetcherCacheTest.LocalUncachedExtract
[  FAILED  ] FetcherCacheHttpTest.HttpMixed
[  FAILED  ]
HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPWithContainerImage
[  FAILED  ]
HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPSWithContainerImage
[  FAILED  ]
HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaTCPWithContainerImage
[  FAILED  ]
MesosContainerizer/DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI/0,
where GetParam() = "mesos"
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_PersistentResources/1,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 26-76 6F-6C 75-6D 65-2F 73-61 6E-64
62-6F 78-5F 70-61 74-68 00-00 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_PersistentResources/2,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 31-00 00-00 00-00 00-00 24-00 00-00
00-00 00-00 80-8C D1-7B DC-7F 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskSandboxPersistentVolume/1,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 26-76 6F-6C 75-6D 65-2F 73-61 6E-64
62-6F 78-5F 70-61 74-68 00-00 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskSandboxPersistentVolume/2,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 31-00 00-00 00-00 00-00 24-00 00-00
00-00 00-00 80-8C D1-7B DC-7F 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TasksSharingViaSandboxVolumes/1,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 26-76 6F-6C 75-6D 65-2F 73-61 6E-64
62-6F 78-5F 70-61 74-68 00-00 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TasksSharingViaSandboxVolumes/2,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 31-00 00-00 00-00 00-00 24-00 00-00
00-00 00-00 70-6F D4-7B DC-7F 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/1,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 26-76 6F-6C 75-6D 65-2F 73-61 6E-64
62-6F 78-5F 70-61 74-68 00-00 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/2,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 31-00 00-00 00-00 00-00 24-00 00-00
00-00 00-00 80-8C D1-7B DC-7F 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_HealthCheckUsingPersistentVolume/1,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 26-76 6F-6C 75-6D 65-2F 73-61 6E-64
62-6F 78-5F 70-61 74-68 00-00 00-00>
[  FAILED  ]
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_HealthCheckUsingPersistentVolume/2,
where GetParam() = 48-byte object <0A-6C 69-6E 75-78 00-00 00-00 00-00
00-00 00-00 00-00 00-00 00-00 00-00 31-00 00-00 00-00 00-00 24-00 00-00
00-00 00-00 00-8C D4-7B DC-7F 00-00>
[  FAILED  ] ContentTypeAndSSLConfig/SchedulerSSLTest.RunTaskAndTeardown/0,
where GetParam() = (application/x-protobuf, "https")
[  FAILED  ] ContentTypeAndSSLConfig/SchedulerSSLTest.RunTaskAndTeardown/2,
where GetParam() = (application/json, "https")

On Wed, Mar 13, 2019 at 12:36 PM Meng Zhu  wrote:

> +1
> sudo make check on CentOS 7.4, only known flaky tests failed
>
> On Tue, Mar 12, 2019 at 4:44 PM Gilbert Song  wrote:
>
> > +1 (binding).
> >
> > -Gilbert
> >
> > On Thu, Mar 7, 2019 at 10:09 AM Greg Mann  wrote:
> >
> > > +1 (binding)
> > >
> > > Ran through internal CI and observed only known flaky tests; almost all
> > > configurations passed with no failures.
> > >
> > > Cheers,
> > > Greg
> > >
> > > On Thu, Mar 7, 2019 at 1:55 AM Vinod Kone 
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Ran in ASF CI. Saw some flaky tests but otherwise looks good.
> > > >
> > > > *Revision*: b1dbba03af23b0222d11f2b7ae936d77ef42650d
> > > >
> > > >- refs/tags/1.5.3-rc1
> > > >
> > > > Configuration Matrix gcc clang
> > > > 

[RESULT][VOTE] Release Apache Mesos 1.7.1 (rc2)

2019-01-28 Thread Chun-Hung Hsiao
Hi all,

The vote for Mesos 1.7.1 (rc2) has passed with the
following votes.

+1 (Binding)
--
*** Vinod Kone
*** Gilbert Song
*** Meng Zhu

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.7.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.1

The mesos-1.7.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Chun-Hung & Gaston


Re: Discussion: Scheduler API for Operation Reconciliation

2019-01-24 Thread Chun-Hung Hsiao
SOS-9448).
>>
>> What is keeping us from moving forward with (III) at this point?
>>
>>
>> Cheers,
>>
>> Benjamin
>>
>> > On Jan 3, 2019, at 11:30 PM, Benno Evers  wrote:
>> >
>> > Hi Chun-Hung,
>> >
>> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
>> node, then the master need to maintain 20k entries for LRPs.
>> >
>> > How big would the required additional storage be in this scenario? Even
>> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
>> for such a big custer.
>> >
>> > In general, it seems hard to discuss the trade-offs between your
>> proposals without looking at the users of that API - do you know if there
>> are ayn frameworks out there that already use
>> >  operation reconciliation, and if so what do they do based on the
>> reconciliation response?
>> >
>> > As far as I know, we don't have any formal guarantees on which
>> operations status changes the framework will receive without
>> reconciliation. So putting on my framework-implementer hat it seems like
>> I'd have no choice but to implement a continously polling background loop
>> anyways if I care about knowing the latest operation statuses. If this is
>> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
>> have little additional benefit.
>> >
>> > Best regards,
>> > Benno
>> >
>> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao 
>> wrote:
>> > Hi folks,
>> >
>> > Recently I've being discussing the problems of the current design of the
>> > experimental
>> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
>> discussion
>> > was started
>> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
>> when a
>> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
>> > if it should retry the operation or not (further details described
>> below).
>> > As the discussion
>> > evolves, we realize there are more issues to consider, design-wise and
>> > implementation-wise, so
>> > I'd like to reach out to the community to get valuable opinions from you
>> > guys.
>> >
>> > Before I jump right into the issues I'd like to discuss, let me fill you
>> > guys in with some
>> > background of operation reconciliation. Since the design of this feature
>> > was informed by the
>> > pre-existing implementation of task reconciliation, I'll begin there.
>> >
>> > *Task Reconciliation: Design*
>> >
>> > The scheduler API has a `RECONCILE` call for a framework to query the
>> > current statuses
>> > of its tasks. This call supports the following modes:
>> >
>> >- *Explicit reconciliation*: The framework specifies the list of
>> tasks
>> >it wants to know
>> >about, and expects status updates for these tasks.
>> >
>> >- *Implicit reconciliation*: The framework does not specify a list of
>> >tasks, and simply
>> >expects status updates for all tasks the master knows about.
>> >
>> > In both cases, the master looks into its in-memory task bookkeeping and
>> > sends
>> > *one or more`UPDATE` events* to respond to the reconciliation request.
>> >
>> > *Task Reconciliation: Problems*
>> >
>> > This API design of task reconciliation has the following shortcomings:
>> >
>> >- (1) There is no clear boundary of when the "reconciliation
>> response"
>> >ends, and thus
>> >there is
>> > *no 1-1 correspondence between the reconciliation request and the
>> response*.
>> >For explicit reconciliation, the framework might wait for an
>> extended period
>> >of time before it receives all status updates; for implicit
>> >reconciliation, there is no way for
>> >a framework to tell if it has learned about all of its tasks, which
>> >could be inconvenient if
>> >the framework has lost its task bookkeeping.
>> >
>> >- (2) The "reconciliation response" may be outdated. If an agent
>> >reregisters after a task
>> >reconciliation has been responded,
>> > *the framework wouldn't learn about the tasks **from this recovered
>> agent*.
>> >Mesos relies on the framework to call the `RECONCILE` call
>> >*periodically* to get up-

Re: [VOTE] Release Apache Mesos 1.5.2 (rc3)

2019-01-16 Thread Chun-Hung Hsiao
+1 (binding)

`sudo make -j32 DISTCHECK_CONFIGURE_FLAGS='LIBS=-ldl --enable-ssl
--enable-libevent --enable-grpc' distcheck` on Ubuntu 16.04.
I got 4 known test failures on my machine:
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ] CgroupsAnyHierarchyWithCpuAcctMemoryTest.ROOT_CGROUPS_Stat
[  FAILED  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS

However with gcc 5.4.0, LIBS=-ldl is required for linking.

On Wed, Jan 16, 2019 at 12:03 PM Vinod Kone  wrote:

> +1  (binding)
>
> Passed in ASF CI. Known flaky tests, but otherwise builds look good.
>
> *Revision*: 3088295d4156eb58d092ad9b3529b85fd33bd36e
>
>- refs/tags/1.5.2-rc3
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Failed]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> --verbose autotools
> [image: Failed]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Failed]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
> --verbose autotools
> [image: Success]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
>
>
> On Wed, Jan 16, 2019 at 11:04 AM Jie Yu  wrote:
>
>> +1
>>
>> make dist check on macOS Mojave
>>
>> On Tue, Jan 15, 2019 at 12:57 AM Gilbert Song  wrote:
>>
>>>  Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.5.2.
>>>
>>> 1.5.2 includes the following:
>>>
>>> 
>>> *Announce major bug fixes here*
>>> https://jira.apache.org/jira/issues/?filter=12345443
>>>
>>> The CHANGELOG for the release is available 

[VOTE] Release Apache Mesos 1.7.1 (rc2)

2019-01-15 Thread Chun-Hung Hsiao
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.7.1.


1.7.1 includes the following:

* This is a bug fix release. Also includes performance and API
  improvements:

  * **Allocator**: Improved allocation cycle time substantially
(see MESOS-9239 and MESOS-9249). These reduce the allocation
cycle time in some benchmarks by 80%.

  * **Scheduler API**: Improved the experimental `CREATE_DISK` and
`DESTROY_DISK` operations for CSI volume recovery (see MESOS-9275
and MESOS-9321). Storage local resource providers now return disk
resources with the `source.vendor` field set, so frameworks needs to
upgrade the `Resource` protobuf definitions.

  * **Scheduler API**: Offer operation feedbacks now present their agent
IDs and resource provider IDs (see MESOS-9293).


The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.1-rc2


The candidate for Mesos 1.7.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc2/mesos-1.7.1.tar.gz

The tag to be voted on is 1.7.1-rc2:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.1-rc2

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc2/mesos-1.7.1.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc2/mesos-1.7.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1243/

Please vote on releasing this package as Apache Mesos 1.7.1!

The vote is open until Fri Jan 18 18:27:28 PST 2019 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.7.1
[ ] -1 Do not release this package because ...

Thanks,
Chun-Hung & Gaston


Re: [VOTE] Release Apache Mesos 1.7.1 (rc1)

2019-01-03 Thread Chun-Hung Hsiao
Thanks Vinod. I'll take a look tomorrow.

I'm doing a -1 myself because of
https://issues.apache.org/jira/browse/MESOS-9508.
Once this is landed and the above issues are investigated I'll make another
cut.

On Wed, Jan 2, 2019 at 1:38 PM Vinod Kone  wrote:

> Also, another error
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/57/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
> >
> .
>
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:
> In function 'tsi_result ssl_handshaker_extract_peer(tsi_handshaker*,
> tsi_peer*)':
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:1011:71:
> error: 'SSL_get0_alpn_selected' was not declared in this scope
>SSL_get0_alpn_selected(impl->ssl, _selected, _selected_len);
>^
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:
> In function 'tsi_result tsi_create_ssl_client_handshaker_factory(const
> tsi_ssl_pem_key_cert_pair*, const char*, const char*, const char**,
> uint16_t, tsi_ssl_client_handshaker_factory**)':
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:1417:73:
> error: 'SSL_CTX_set_alpn_protos' was not declared in this scope
>static_cast int>(impl->alpn_protocol_list_length))) {
>  ^
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:
> In function 'tsi_result
> tsi_create_ssl_server_handshaker_factory_ex(const
> tsi_ssl_pem_key_cert_pair*, size_t, const char*,
> tsi_client_certificate_request_type, const char*, const char**,
> uint16_t, tsi_ssl_server_handshaker_factory**)':
>
> /mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0/src/core/tsi/ssl_transport_security.cc:1557:79:
> error: 'SSL_CTX_set_alpn_select_cb' was not declared in this scope
>
> server_handshaker_factory_alpn_callback, impl);
>
>  ^
> make[7]: *** [CMakeFiles/grpc.dir/src/core/tsi/ssl_transport_security.cc.o]
> Error 1
> make[7]: Leaving directory
> `/mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0-build'
> make[6]: *** [CMakeFiles/grpc.dir/all] Error 2
> make[6]: Leaving directory
> `/mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0-build'
> make[5]: *** [CMakeFiles/grpc.dir/rule] Error 2
> make[5]: Leaving directory
> `/mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0-build'
> make[4]: *** [grpc] Error 2
> make[4]: Leaving directory
> `/mesos/build/3rdparty/grpc-1.10.0/src/grpc-1.10.0-build'
> make[3]: *** [3rdparty/grpc-1.10.0/src/grpc-1.10.0-stamp/grpc-1.10.0-build]
> Error 2
> make[3]: Leaving directory `/mesos/build'
> make[2]: *** [3rdparty/CMakeFiles/grpc-1.10.0.dir/all] Error 2
> make[2]: *** Waiting for unfinished jobs
>
>
>
> On Wed, Jan 2, 2019 at 3:35 PM Vinod Kone  wrote:
>
> > I see an issue
> > <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/57/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
> >
> > with clang compiler when running it in ASF CI. Is this a known issue?
> >
> > ../../src/resource_provider/storage/provider.cpp:3190:5: error:
> conditional expression is ambiguous; 'Future>' can
> be converted to 'Future>' and vice versa
> > ? createVolume(
> > ^ ~
> >
> >
> >
> > On Wed, Jan 2, 2019 at 2:11 PM Benjamin Mahler 
> wrote:
> >
> >> +1 (binding)
> >>
> >> make check passes on macOS 10.14.2
> >>
> >> $ clang++ --version
> >> Apple LLVM version 10.0.0 (clang-1000.10.44.4)
> >> Target: x86_64-apple-darwin18.2.0
> >> Thread model: posix
> >> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
> >>
> >> $ ./configure CC=clang CXX=clang++
> CXXFLAGS="-Wno-deprecated-declarations"
> >> --disable-python --disable-java --with-apr=/usr/local/opt/apr/libexec
> >> --with-svn=/usr/local/opt/subversion && make check -j12
> >> ...
> >> [  PASSED  ] 1956 tests.
> >>
> >> On Fri, Dec 21, 2018 at 5:48 PM Chun-Hung Hsiao 
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >&g

[VOTE] Release Apache Mesos 1.7.1 (rc1)

2018-12-21 Thread Chun-Hung Hsiao
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.7.1.


1.7.1 includes the following:

* This is a bug fix release. Also includes performance and API
  improvements:

  * **Allocator**: Improved allocation cycle time substantially
(see MESOS-9239 and MESOS-9249). These reduce the allocation
cycle time in some benchmarks by 80%.

  * **Scheduler API**: Improved the experimental `CREATE_DISK` and
`DESTROY_DISK` operations for CSI volume recovery (see MESOS-9275
and MESOS-9321). Storage local resource providers now return disk
resources with the `source.vendor` field set, so frameworks needs to
upgrade the `Resource` protobuf definitions.

  * **Scheduler API**: Offer operation feedbacks now present their agent
IDs and resource provider IDs (see MESOS-9293).


The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.1-rc1


The candidate for Mesos 1.7.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz

The tag to be voted on is 1.7.1-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.1-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.1-rc1/mesos-1.7.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/releases/org/apache/mesos/mesos/1.7.1-rc1/

Please vote on releasing this package as Apache Mesos 1.7.1!

To accommodate for the holidays, the vote is open until Mon Dec 31 14:00:00
PST 2018 and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.7.1
[ ] -1 Do not release this package because ...

Thanks,
Chun-Hung & Gaston


Discussion: Scheduler API for Operation Reconciliation

2018-12-11 Thread Chun-Hung Hsiao
Hi folks,

Recently I've being discussing the problems of the current design of the
experimental
`RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
was started
from MESOS-9318 : when a
framework receives an `OPERATION_UNKNOWN`, it doesn't know
if it should retry the operation or not (further details described below).
As the discussion
evolves, we realize there are more issues to consider, design-wise and
implementation-wise, so
I'd like to reach out to the community to get valuable opinions from you
guys.

Before I jump right into the issues I'd like to discuss, let me fill you
guys in with some
background of operation reconciliation. Since the design of this feature
was informed by the
pre-existing implementation of task reconciliation, I'll begin there.

*Task Reconciliation: Design*

The scheduler API has a `RECONCILE` call for a framework to query the
current statuses
of its tasks. This call supports the following modes:

   - *Explicit reconciliation*: The framework specifies the list of tasks
   it wants to know
   about, and expects status updates for these tasks.

   - *Implicit reconciliation*: The framework does not specify a list of
   tasks, and simply
   expects status updates for all tasks the master knows about.

In both cases, the master looks into its in-memory task bookkeeping and
sends
*one or more`UPDATE` events* to respond to the reconciliation request.

*Task Reconciliation: Problems*

This API design of task reconciliation has the following shortcomings:

   - (1) There is no clear boundary of when the "reconciliation response"
   ends, and thus
   there is
*no 1-1 correspondence between the reconciliation request and the response*.
   For explicit reconciliation, the framework might wait for an extended period
   of time before it receives all status updates; for implicit
   reconciliation, there is no way for
   a framework to tell if it has learned about all of its tasks, which
   could be inconvenient if
   the framework has lost its task bookkeeping.

   - (2) The "reconciliation response" may be outdated. If an agent
   reregisters after a task
   reconciliation has been responded,
*the framework wouldn't learn about the tasks **from this recovered agent*.
   Mesos relies on the framework to call the `RECONCILE` call
   *periodically* to get up-to-date task statuses.



*Operation Reconciliation: Design & Problems*

When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
call
*asynchronous request-response style call* that returns a 200 OK with a
list of operation status
to avoid (1). However, this design does not resolve (2), and also
introduces new problems:

   - (3) *The synchronous response could race with the event stream* and
   the framework
   does not know which contains the latest operation status.

   - (4) To ensure scalability, the master does not manage local resource
   providers (LRPs);
   the agents do. So the master cannot tell if an LRP is temporarily
   unreachable/recovering
   or permanently gone. As a result, if the framework explicitly reconciles
   an LRP operation
   that the master does not know about, it can only reply
   `OPERATION_UNKNOWN`, but
   then *the framework would not know if the operation would come back in
   the future*,
   and thus cannot decide if it should reissue another operation, which
   leads to MESOS-9318 .

   Note that this is less of a problem for explicit task reconciliation,
   because in most cases
   the master can infer task statuses from agent statuses, and in the rare
   cases that it
   replies `TASK_UNKNOWN`, it is generally safe for the framework to
   relaunch another
   task.


*The Open Question*

Now, the big question here is:
*are the benefits of a synchronous request-responsestyle
`RECONCILE_OPERATIONS` call worth the complexity it introduces* in order to
address (3) and (4) in the code? To explain what the complexity would be,
let me lay out a
couple proposals we've been discussing:

I. Keep `RECONCILE_OPERATIONS` synchronous

To address (3), we could add a *timestamp* to every operation status as
well as the
reconciliation response, so the framework can infer which one is the latest
status, and if
it receives a stale operation status update after the reconciliation
response, it can just
ack the status update without updating its bookkeeping. But, the framework
needs to
deal with a corner case:

*when it receives a reconciliation response containing aterminal operation
status, it may or may not receive one or more status updatesfor that
operation later *because of the race.


To address (4), we could either: (a) surface the unreachable/gone LRPs to
the master, or
(b) forward the explicit reconciliation request to the corresponding agent.
The complexity
of (a) is that
*it might not be scalable for the master to maintain the list ofunreachable
and gone LRPs*: imagine 

Re: [VOTE] Release Apache Mesos 1.5.2 (rc2)

2018-11-26 Thread Chun-Hung Hsiao
-1 for https://issues.apache.org/jira/browse/MESOS-8623

I'm working on a fix.

On Thu, Nov 22, 2018 at 1:40 PM Meng Zhu  wrote:

> +1
> make check on Ubuntu 18.04
>
> On Wed, Oct 31, 2018 at 4:26 PM Gilbert Song 
> wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.5.2.
> >
> > 1.5.2 includes the following:
> >
> >
> 
> > *Announce major bug fixes here*
> >   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
> >   * [MESOS-8128] - Make os::pipe file descriptors O_CLOEXEC.
> >   * [MESOS-8418] - mesos-agent high cpu usage because of numerous
> > /proc/mounts reads.
> >   * [MESOS-8545] -
> > AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> >   * [MESOS-8568] - Command checks should always call
> > `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`.
> >   * [MESOS-8620] - Containers stuck in FETCHING possibly due to
> > unresponsive server.
> >   * [MESOS-8830] - Agent gc on old slave sandboxes could empty persistent
> > volume data.
> >   * [MESOS-8871] - Agent may fail to recover if the agent dies before
> > image store cache checkpointed.
> >   * [MESOS-8904] - Master crash when removing quota.
> >   * [MESOS-8906] - `UriDiskProfileAdaptor` fails to update profile
> > selectors.
> >   * [MESOS-8907] - Docker image fetcher fails with HTTP/2.
> >   * [MESOS-8917] - Agent leaking file descriptors into forked processes.
> >   * [MESOS-8921] - Autotools don't work with newer OpenJDK versions.
> >   * [MESOS-8935] - Quota limit "chopping" can lead to cpu-only and
> > memory-only offers.
> >   * [MESOS-8936] - Implement a Random Sorter for offer allocations.
> >   * [MESOS-8942] - Master streaming API does not send (health) check
> > updates for tasks.
> >   * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
> >   * [MESOS-8947] - Improve the container preparing logging in
> > IOSwitchboard and volume/secret isolator.
> >   * [MESOS-8952] - process::await/collect n^2 performance issue.
> >   * [MESOS-8963] - Executor crash trying to print container ID.
> >   * [MESOS-8978] - Command executor calling setsid breaks the tty
> support.
> >   * [MESOS-8980] - mesos-slave can deadlock with docker pull.
> >   * [MESOS-8986] - `slave.available()` in the allocator is expensive and
> > drags down allocation performance.
> >   * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
> >   * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
> >   * [MESOS-9049] - Agent GC could unmount a dangling persistent volume
> > multiple times.
> >   * [MESOS-9116] - Launch nested container session fails due to incorrect
> > detection of `mnt` namespace of command executor's task.
> >   * [MESOS-9125] - Port mapper CNI plugin might fail with "Resource
> > temporarily unavailable".
> >   * [MESOS-9127] - Port mapper CNI plugin might deadlock iptables on the
> > agent.
> >   * [MESOS-9131] - Health checks launching nested containers while a
> > container is being destroyed lead to unkillable tasks.
> >   * [MESOS-9142] - CNI detach might fail due to missing network config
> > file.
> >   * [MESOS-9144] - Master authentication handling leads to request
> > amplification.
> >   * [MESOS-9145] - Master has a fragile burned-in 5s authentication
> > timeout.
> >   * [MESOS-9146] - Agent has a fragile burn-in 5s authentication timeout.
> >   * [MESOS-9147] - Agent and scheduler driver authentication retry
> backoff
> > time could overflow.
> >   * [MESOS-9151] - Container stuck at ISOLATING due to FD leak.
> >   * [MESOS-9170] - Zookeeper doesn't compile with newer gcc due to format
> > error.
> >   * [MESOS-9196] - Removing rootfs mounts may fail with EBUSY.
> >   * [MESOS-9231] - `docker inspect` may return an unexpected result to
> > Docker executor due to a race condition.
> >   * [MESOS-9267] - Mesos agent crashes when CNI network is not configured
> > but used.
> >   * [MESOS-9279] - Docker Containerizer 'usage' call might be expensive
> if
> > mount table is big.
> >   * [MESOS-9283] - Docker containerizer actor can get backlogged with
> > large number of containers.
> >   * [MESOS-9305] - Create cgoup recursively to workaround systemd
> deleting
> > cgroups_root.
> >   * [MESOS-9308] - URI disk profile adaptor could deadlock.
> >   * [MESOS-9334] - Container stuck at ISOLATING state due to libevent
> poll
> > never returns.
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.2-rc2
> >
> >
> 
> >
> > The candidate for Mesos 1.5.2 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz
> >
> > The tag to be voted on is 1.5.2-rc2:
> > 

[RESULT][VOTE] Release Apache Mesos 1.7.0 (rc3)

2018-09-19 Thread Chun-Hung Hsiao
Hi all,

The vote for Mesos 1.7.0 (rc3) has passed with the
following votes.

+1 (Binding)
--
*** Alex Rukletsov
*** Kapil Arya
*** James Peach

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.7.0

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0

The mesos-1.7.0.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Chun-Hung & Gaston


Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-09-14 Thread Chun-Hung Hsiao
It seems we have the following issues w.r.t path generation:

1. Path separators are disallowed:
This is general to all systems, so we'll need to put a
platform-independent check. But since no one's doing this we can put this
into the backlog.
2. Other invalid characters on different platforms:
For now let's just focus on Windows since Un*x doesn't have any
restriction other than /,
but since we're already working on this issue, how about resolve all
of 0x00-0x1F 0x7F " * / : < > ? | at once?
This can be a Windows-specific now, as proposed by Andy.
3. Other path constraints, e.g., invalid sequences of valid characters.
This is platform-dependent but the problem is there for both Un*x and
Windows. We can resolve this along with 1 later.

As long as the way we resolve 2 (i.e., the encoding/decoding process) won't
introduce any compatibility problem in the future,
I'm good at only fixing 2 for now and follow up with a clean up later.
To be conservative, if we're sure that there's no existing framework using
% in its ID,
does it make sense to add a check for now to ensure that?

On Tue, Sep 4, 2018 at 2:12 PM Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> I think your approach would be fairly sound. That is, to change the
> logic to read the IDs from the info file instead of the paths. But I
> also think we can punt this for now (as I do not think a task ID like
> 'Hello%3AWorld' is plausibly in use right now), and implement a fix for
> colons now that would remain compatible.
>
> If we add encode/decode logic for colons on Windows, we do not introduce
> backward compatibility issues on other platforms (as we'd constrain the
> change to Windows), and in the future, we can safely replace the decode
> logic with your approach. That is to say, we implement the encoding as
> sparingly as possible, but implement it now, because it's kind of
> required, and we implement the decoding only as a stop-gap until we
> replace this logic with reading from the info file instead. If we later
> find another character in use that also needs to be encoded, we can then
> abstract the single encoding into a per-platform encoding set.
>
> Does this seem reasonable?
>
> Thanks,
>
> Andy
>
> P.S. Sorry this took a while to get back to, I was out last week.
>
> On 08/23/2018 3:34 pm, Chun-Hung Hsiao wrote:
> > I'm a bit concerned about the recovery logic and backward
> > compatibility:
> > The changes we're making shouldn't affect existing users,
> > and we should try hard to avoid any future backward compatibility
> > problem.
> >
> > Say if there is already some custom framework using task ID
> > 'Hello%3AWorld',
> > then if we blindly decode the task path during recovery, we will get
> > the
> > wrong ID 'Hello:World'.
> > On the other hand, if we don't decode the task path during recovery,
> > then later on during checkpointing for the same task,
> > we shouldn't blindly encode the task ID, because it might create a
> > different path,
> > and we'll need to introduce some migration code to avoid such
> > duplication.
> >
> > Fortunately, we do checkpoint the executor IDs and task IDs in the info
> > files under the meta dir.
> > So I'm considering the following design to minimize the backward
> > compatibility issue we might have:
> > During recovery, we don't decode the recovered task path,
> > but get the executor/task ID from the info file instead of relying on
> > parsing the executor/task path.
> > When checkpointing, we only encode executor/task IDs if they contain
> > reserved characters.
> > The set of reserved characters could be defined as a platform-dependent
> > variable,
> > similar to what we have done for `PATH_SEPARATOR`.
> >
> > The above design would look a bit more complicated then just blindly
> > applying percent encoding
> > to when constructing checkpoint paths, but it doesn't require extra
> > checkpoint migration logic,
> > and would keep the exact same behavior we have now for "normal"
> > executor/task IDs.
> >
> > What did you guys think? Please feel free to raise any concern :)
> > And we don't need to implement the whole thing for now.
> > For example, we could start with just dealing with colons,
> > and extend the implementation later on,
> > as long as the partial solution we're going to have right now doesn't
> > create future tech debts!
> >
> > Best,
> > Chun-Hung
> >
> > On Thu, Aug 23, 2018 at 1:42 PM Greg Mann  wrote:
> >
> >> Thanks Andy! Responses inlined below.
> >>
> >>
> >>
> >>> No: As the only character we've 

Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-30 Thread Chun-Hung Hsiao
Hi all,

Because of https://issues.apache.org/jira/browse/MESOS-9193,
I'll vote for a -1 for RC2. I've put up some patches for Clang 3.5 support,
and will make another RC early next week.

Thanks,
Chun-Hung & Gaston

On Wed, Aug 29, 2018 at 7:18 PM Vinod Kone  wrote:

> I prefer 1) since you already have the fix.
>
> Thanks,
> Vinod
>
> > On Aug 29, 2018, at 8:44 PM, Chun-Hung Hsiao 
> wrote:
> >
> > I found two issues when compiling with clang 3.5:
> >
> > 1. The `-Wno-inconsistent-missing-override` option added in
> https://reviews.apache.org/r/67953/
> > is not recognized by clang 3.5.
> > 2. The same issue described in https://reviews.apache.org/r/55400/
> would make
> > `src/resource_provider/storage/provider.cpp` fail to compile.
> >
> > I put up two patches to resolve the above issues (no review posted yet):
> >
> https://github.com/chhsia0/mesos/commit/1f60aa3b3a7eede4a2a5ddf1288efff6a801ea97
> >
> https://github.com/chhsia0/mesos/commit/84d13a0468f34726e4a920915cdda7e0e0a829b8
> >
> > However, I'm not sure if this is worth blocking a release. We have 2
> options:
> > 1. Fail this vote and cut rc3 with the above patches to support clang
> 3.5.
> > 2. Keep rc2 but bump the version requirement for clang on the website.
> (If so, then the above patches are not needed.)
> >
> > I was wondering which option would be more appropriate so I'd like to
> ask for some feedbacks. Thanks!
> >
> >> On Wed, Aug 29, 2018 at 10:18 AM James Peach  wrote:
> >> +1 (binding)
> >>
> >> Built and tested on Fedora 28 (clang).
> >>
> >>> On Aug 24, 2018, at 4:42 PM, Chun-Hung Hsiao 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
> >>>
> >>>
> >>> 1.7.0 includes the following:
> >>>
> 
> >>> * Performance Improvements:
> >>>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
> >>>   * Allocator: Improved allocator cycle significantly
> >>>   * Agent `/containers` endpoint: Fixed a performance issue
> >>>   * Agent container launch / destroy throughput is significantly
> improved
> >>> * Containerization:
> >>>   * **Experimental** Supported docker image tarball fetching from HDFS
> >>>   * Added new `cgroups/all` and `linux/devices` isolators
> >>>   * Added metrics for `network/cni` isolator and docker pull latency
> >>> * Windows:
> >>>   * Added support to libprocess for the Windows Thread Pool API
> >>> * Multi-Framework Workloads:
> >>>   * **Experimental** Added per-framework metrics to the master
> >>>   * A new weighted random sorter was added as an alternative to the
> DRF sorter
> >>>
> >>> The CHANGELOG for the release is available at:
> >>>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
> >>>
> 
> >>>
> >>> The candidate for Mesos 1.7.0 release is available at:
> >>>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz
> >>>
> >>> The tag to be voted on is 1.7.0-rc2:
> >>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2
> >>>
> >>> The SHA512 checksum of the tarball can be found at:
> >>>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
> >>>
> >>> The signature of the tarball can be found at:
> >>>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc
> >>>
> >>> The PGP key used to sign the release is here:
> >>> https://dist.apache.org/repos/dist/release/mesos/KEYS
> >>>
> >>> The JAR is in a staging repository here:
> >>> https://repository.apache.org/content/repositories/orgapachemesos-1233
> >>>
> >>> Please vote on releasing this package as Apache Mesos 1.7.0!
> >>>
> >>> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Mesos 1.7.0
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>> Thanks,
> >>> Chun-Hung & Gaston
> >>
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-29 Thread Chun-Hung Hsiao
I found two issues when compiling with clang 3.5:

1. The `-Wno-inconsistent-missing-override` option added in
https://reviews.apache.org/r/67953/
is not recognized by clang 3.5.
2. The same issue described in https://reviews.apache.org/r/55400/ would
make
`src/resource_provider/storage/provider.cpp` fail to compile.

I put up two patches to resolve the above issues (no review posted yet):
https://github.com/chhsia0/mesos/commit/1f60aa3b3a7eede4a2a5ddf1288efff6a801ea97
https://github.com/chhsia0/mesos/commit/84d13a0468f34726e4a920915cdda7e0e0a829b8

However, I'm not sure if this is worth blocking a release. We have 2
options:
1. Fail this vote and cut rc3 with the above patches to support clang 3.5.
2. Keep rc2 but bump the version requirement for clang on the website. (If
so, then the above patches are not needed.)

I was wondering which option would be more appropriate so I'd like to ask
for some feedbacks. Thanks!

On Wed, Aug 29, 2018 at 10:18 AM James Peach  wrote:

> +1 (binding)
>
> Built and tested on Fedora 28 (clang).
>
> On Aug 24, 2018, at 4:42 PM, Chun-Hung Hsiao 
> wrote:
>
> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
>
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
>
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1233
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>
>
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-28 Thread Chun-Hung Hsiao
Folks,

This is a gentle reminder for 1.7.0-rc2.
The vote is open until Wed Aug 29 23:59:59 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

Thanks!

On Fri, Aug 24, 2018, 4:45 PM Chun-Hung Hsiao  wrote:

> Hi all,
>
> Since there will be a weekend during the vote period,
> the vote will be open until Wed Aug 29 23:59:59 PDT 2018,
> so we can have more time testing.
>
> Best,
> Chun-Hung
>
> On Fri, Aug 24, 2018 at 4:42 PM Chun-Hung Hsiao 
> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>>
>>
>> 1.7.0 includes the following:
>>
>> 
>> * Performance Improvements:
>>   * Master `/state` endpoint: ~130% throughput improvement through
>> RapidJSON
>>   * Allocator: Improved allocator cycle significantly
>>   * Agent `/containers` endpoint: Fixed a performance issue
>>   * Agent container launch / destroy throughput is significantly improved
>> * Containerization:
>>   * **Experimental** Supported docker image tarball fetching from HDFS
>>   * Added new `cgroups/all` and `linux/devices` isolators
>>   * Added metrics for `network/cni` isolator and docker pull latency
>> * Windows:
>>   * Added support to libprocess for the Windows Thread Pool API
>> * Multi-Framework Workloads:
>>   * **Experimental** Added per-framework metrics to the master
>>   * A new weighted random sorter was added as an alternative to the DRF
>> sorter
>>
>> The CHANGELOG for the release is available at:
>>
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
>>
>> 
>>
>> The candidate for Mesos 1.7.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz
>>
>> The tag to be voted on is 1.7.0-rc2:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1233
>>
>> Please vote on releasing this package as Apache Mesos 1.7.0!
>>
>> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.7.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Chun-Hung & Gaston
>>
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-24 Thread Chun-Hung Hsiao
Hi all,

Since there will be a weekend during the vote period,
the vote will be open until Wed Aug 29 23:59:59 PDT 2018,
so we can have more time testing.

Best,
Chun-Hung

On Fri, Aug 24, 2018 at 4:42 PM Chun-Hung Hsiao 
wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
>
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
>
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1233
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


[VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-24 Thread Chun-Hung Hsiao
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.7.0.


1.7.0 includes the following:

* Performance Improvements:
  * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
  * Allocator: Improved allocator cycle significantly
  * Agent `/containers` endpoint: Fixed a performance issue
  * Agent container launch / destroy throughput is significantly improved
* Containerization:
  * **Experimental** Supported docker image tarball fetching from HDFS
  * Added new `cgroups/all` and `linux/devices` isolators
  * Added metrics for `network/cni` isolator and docker pull latency
* Windows:
  * Added support to libprocess for the Windows Thread Pool API
* Multi-Framework Workloads:
  * **Experimental** Added per-framework metrics to the master
  * A new weighted random sorter was added as an alternative to the DRF
sorter

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2


The candidate for Mesos 1.7.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz

The tag to be voted on is 1.7.0-rc2:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1233

Please vote on releasing this package as Apache Mesos 1.7.0!

The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.7.0
[ ] -1 Do not release this package because ...

Thanks,
Chun-Hung & Gaston


Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-08-23 Thread Chun-Hung Hsiao
I'm a bit concerned about the recovery logic and backward compatibility:
The changes we're making shouldn't affect existing users,
and we should try hard to avoid any future backward compatibility problem.

Say if there is already some custom framework using task ID 'Hello%3AWorld',
then if we blindly decode the task path during recovery, we will get the
wrong ID 'Hello:World'.
On the other hand, if we don't decode the task path during recovery,
then later on during checkpointing for the same task,
we shouldn't blindly encode the task ID, because it might create a
different path,
and we'll need to introduce some migration code to avoid such duplication.

Fortunately, we do checkpoint the executor IDs and task IDs in the info
files under the meta dir.
So I'm considering the following design to minimize the backward
compatibility issue we might have:
During recovery, we don't decode the recovered task path,
but get the executor/task ID from the info file instead of relying on
parsing the executor/task path.
When checkpointing, we only encode executor/task IDs if they contain
reserved characters.
The set of reserved characters could be defined as a platform-dependent
variable,
similar to what we have done for `PATH_SEPARATOR`.

The above design would look a bit more complicated then just blindly
applying percent encoding
to when constructing checkpoint paths, but it doesn't require extra
checkpoint migration logic,
and would keep the exact same behavior we have now for "normal"
executor/task IDs.

What did you guys think? Please feel free to raise any concern :)
And we don't need to implement the whole thing for now.
For example, we could start with just dealing with colons,
and extend the implementation later on,
as long as the partial solution we're going to have right now doesn't
create future tech debts!

Best,
Chun-Hung

On Thu, Aug 23, 2018 at 1:42 PM Greg Mann  wrote:

> Thanks Andy! Responses inlined below.
>
>
>
>> No: As the only character we've run into a problem with is `:`
>> (MESOS-9109), it might not be worth it to generalize this to solve a bunch
>> of problems that we haven't encountered.
>>
>>
> It's true that I'm not aware of other scenarios where
> filesystem-disallowed characters in task/executor IDs have caused issues
> for users, and this issue has existed for a long time. However, when
> feasible I would like to fix issues that we're aware of before they cause
> problems for users, rather than after. I would suggest that since we have
> one compelling case that we need to address now, it's worth formulating an
> approach for the general case, so that we can be sure any current work
> doesn't get in our way later on.
>
>
>> I'm somewhat comfortable doing so only for Windows, as we don't really
>> need to worry about the recovery scenario; but very uncomfortable about
>> doing so for Linux etc., for precisely that reason.
>>
>> So expanding this is definitely up for debate; but we must fix the bug
>> with `:`.
>>
>>
> Indeed, addressing the general case may prove to be much more complex - I
> can certainly identify with this situation, where a fix for a smaller issue
> turns into a big project :)
> It may turn out to be possible to implement a scoped-down solution for the
> colon case now, and extend it later on. I think it would be good if we
> could at least get an idea of how we want to handle the general case now,
> so that any short-term solutions can be a constructive step toward the
> long-term.
>
> Cheers,
> G
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc1)

2018-08-22 Thread Chun-Hung Hsiao
Hi all,

The URL for the JAR in the previous email is incorrect.
The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1232

Thanks,
Chun-Hung

On Tue, Aug 21, 2018 at 7:34 PM Chun-Hung Hsiao  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
>
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
> * Bug fixes: 84 bugs fixed, including 20 critical ones.
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc1
>
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
>
> https://repository.apache.org/service/local/repositories/orgapachemesos-1232/
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Fri Aug 24 19:16:39 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


[VOTE] Release Apache Mesos 1.7.0 (rc1)

2018-08-21 Thread Chun-Hung Hsiao
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.7.0.


1.7.0 includes the following:

* Performance Improvements:
  * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
  * Allocator: Improved allocator cycle significantly
  * Agent `/containers` endpoint: Fixed a performance issue
  * Agent container launch / destroy throughput is significantly improved
* Containerization:
  * **Experimental** Supported docker image tarball fetching from HDFS
  * Added new `cgroups/all` and `linux/devices` isolators
  * Added metrics for `network/cni` isolator and docker pull latency
* Windows:
  * Added support to libprocess for the Windows Thread Pool API
* Multi-Framework Workloads:
  * **Experimental** Added per-framework metrics to the master
  * A new weighted random sorter was added as an alternative to the DRF
sorter
* Bug fixes: 84 bugs fixed, including 20 critical ones.

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc1


The candidate for Mesos 1.7.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz

The tag to be voted on is 1.7.0-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/service/local/repositories/orgapachemesos-1232/

Please vote on releasing this package as Apache Mesos 1.7.0!

The vote is open until Fri Aug 24 19:16:39 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.7.0
[ ] -1 Do not release this package because ...

Thanks,
Chun-Hung & Gaston


Update: Mesos 1.7.0 Release

2018-08-13 Thread Chun-Hung Hsiao
Hi folks,

I just created a new 1.7.x branch from the master.
If you are committing patches for any 1.7.0 issues,
please backport them to the 1.7.x branch and update the CHANGELOG.

Currently there are still 12 unresolved issues targeting 1.7.0:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12333125
Since 7 of them are blocker issues related to critical bugs that should be
fixed in 1.7.0,
Gaston and I have decided to postpone the 1.7.0 cut for one week.
We're targeting to cut the 1.7.0 release candidate on Monday, August 20th.
Hopefully this would give all of us enough time to resolve all 1.7.0 issues.
If you have any problem, please feel free to let Gaston or me know.

The following lists all tickets that have been excluded from 1.7.0.
If you have any concerns about these tickets, please let us know as well.

Tickets untargeted from 1.7.0 since no progress has been made since 1.4.0:
  * [MESOS-6394] - Improvements to partition-aware Mesos frameworks.
  * [MESOS-6843] - Fetcher should not assume stdout/stderr in the sandbox.
  * [MESOS-7103] - Container Attach/Exec Improvements
  * [MESOS-7278] - Implement configuration reader/writer for the new CLI
  * [MESOS-7317] - Add master endpoint to deactivate / activate agent
  * [MESOS-7404] - Ensure hierarchical roles work with old Mesos agents
  * [MESOS-7473] - "Use ""-dev"" prerelease label for version during
development"
  * [MESOS-7563] - Make the HTTP command executor the default
implementation.
  * [MESOS-7705] - Reconsider restricting the resource format for
frameworks.
  * [MESOS-8275] - Remove use of ::_stat on Windows
  * [MESOS-8718] - Add the fields `ExposedPorts` and `Volumes` into Docker
v1 image spec
  * [MESOS-8789] - Role-related endpoints should display distinct offered
and allocated resources.
  * [MESOS-8790] - Deprecate Role::resources in favor of Role::allocated
and Role::offered.

Tickets retargeted to 1.8.0:
  * [MESOS-7141] - Support hook scripts to customize actions for
container's lifecycle
  * [MESOS-7428] - Report exit code of tasks from default and command
executors
  * [MESOS-7776] - Document `MESOS_CONTAINER_IP`
  * [MESOS-7882] - Mesos master rescinds all the in-flight offers from all
the registered agents when a new maintenance schedule is posted for a
subset of slaves
  * [MESOS-7950] - Update autotools and CMake to build in C++14 mode.
  * [MESOS-7967] - Make `mesos-execute` work with old-style resources
  * [MESOS-7974] - "Accept ""application/recordio"" type is rejected for
master operator API SUBSCRIBE call"
  * [MESOS-8068] - Non-revocable bursting over quota guarantees via limits.
  * [MESOS-8456] - Allocator should allow roles to burst above guarantees
but below limits.
  * [MESOS-8470] - CHECK failure in DRFSorter due to invalid framework name.
  * [MESOS-8509] - Launching a Docker container with `--restart=always` may
cause the Docker container is running after the task completes
  * [MESOS-8515] - Docker containerizer does not recover the executor pid
  * [MESOS-8516] - Deprecate mesos.native Python module.
  * [MESOS-8545] -
AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
  * [MESOS-8560] - Test resource provider selection for URI disk profile
adaptor.
  * [MESOS-8561] - Test profile checkpointing in SLRP
  * [MESOS-8582] - Add a way to make sure an agent always knows the full
framework information of all frameworks executing operations on its
resources
  * [MESOS-8621] - Add a `/debug` allocator endpoint to expose allocator
state for debugging.
  * [MESOS-8652] - Consider adding a `filesystem/csi` isolator.
  * [MESOS-8745] - Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.
  * [MESOS-8824] - "Send the task's latest ""status update state"" to
frameworks when an unreachable agent reregisters."
  * [MESOS-8972] - when choose docker image use user network all mesos
agent crash
  * [MESOS-9003] - Allow storage resource providers to consume given CSI
endpoints.
  * [MESOS-9004] - Add unit tests for dropping operations during SLRP
reconciliation.
  * [MESOS-9019] - Validate that container paths are unique in
`ContainerInfo.volumes`.
  * [MESOS-9116] - Launch nested container session fails due to incorrect
detection of `mnt` namespace of command executor's task.
  * [MESOS-9141] - Consider adding restrictions to disk profile names.

Thanks for your work!


Mesos 1.7.0 Release

2018-08-06 Thread Chun-Hung Hsiao
Hi folks,

We are considering to cut the 1.7.0 release on Monday, August 13th
since there are not many blocker or critical issues targeting 1.7.0:
We currently have 1 blocker, 1 critical issue and 45 major issues on the 1.7.0
release
dashboard 
<*https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12333125
*
>.

It's OK if we need to push the date back, but if anyone would like to land
anything onto 1.7.0,
please let Gaston or me know in advance and make them blocker issues and
make sure they
are targeting 1.7.0. If you're fine not landing the issues you're working
on on 1.7.0,
please remove it from your target version.

Thanks for all your work!


Re: 1.7 release manager?

2018-07-25 Thread Chun-Hung Hsiao
If there's no objection, Gastón and I will manage the 1.7 release together
then :)

We're aiming to do the release around mid August to keep up with the
release policy:
http://mesos.apache.org/documentation/latest/versioning/
If you're working on something that should land on 1.7,
please let us know so we could figure out a reasonable schedule. Thanks!

On Tue, Jul 17, 2018 at 2:49 PM Gastón Kleiman  wrote:

> I haven't been waiting for this, but I'd like to volunteer too. I've never
> been the release manager before, so I could help with some of the
> pre-release JIRA wrangling and drafting the announcements.
>
> -Gastón
>
> On Tue, Jul 17, 2018 at 2:26 PM Vinod Kone  wrote:
>
> > +dev
> >
> > -- Vinod
> >
> >
> > On Tue, Jul 17, 2018 at 4:20 PM Chun-Hung Hsiao 
> > wrote:
> >
> > > I could volunteer unless someone has been waiting for this :)
> > >
> > > On Tue, Jul 17, 2018 at 2:09 PM Greg Mann  wrote:
> > >
> > > > Hey folks!
> > > > The question just came up here in the office: who is managing the
> 1.7.0
> > > > release? 1.6.0 came out on May 11, so according to our quarterly
> > release
> > > > policy, we should aim for 1.7 to come out some time around
> mid-August.
> > > >
> > > > AFAIK, nobody has volunteered yet? I thought I'd start a thread to
> see
> > if
> > > > anybody is interested - any volunteers?
> > > >
> > > > Cheers,
> > > > Greg
> > > >
> > >
> >
>


Re: Operations Working Group - First Meeting

2018-07-17 Thread Chun-Hung Hsiao
Unfortunately the time conflicts with the CSI community sync so I'll have
to skip :(

On Tue, Jul 17, 2018 at 2:55 AM Abel Souza  wrote:

> Thank you for setting this up Gaston,
>
> Would you mind giving us a brief of what you have in mind for discussion?
>
> Thank you,
>
> Abel
>
> On 07/17/2018 10:52 AM, Matt Jarvis wrote:
>
> That's great news Gaston ! Let me know if you need any help from the
> Community team.
>
> Matt
>
> On Tue, 17 Jul 2018, 05:04 Gastón Kleiman,  wrote:
>
>> Hi all,
>>
>> Thank you for responding to my previous emails - I think that we have
>> quorum for a new working group!
>>
>> Many of you who have expressed interest seem to be in Europe, so I tried
>> schedule the first meeting at a time that I hope will be friendly for
>> people in both GMT+1 and GMT-8:
>>
>> *Date:* Wednesday July 25th from 9:00-10:00 AM PDT
>> *Agenda:*
>> https://docs.google.com/document/d/1XjJfoksz70vbTvvT6FQ1t_J0SD1XIoipmYSvEHJfXt8/
>> *Zoom URL:* https://zoom.us/j/310132146
>> 
>>
>> You can also find the event in the Mesos Community Calendar.
>>
>> Feel free to add more topics to the agenda.
>>
>> Looking forward to meeting you all next week,
>>
>> -Gastón
>>
>
>


Re: Backport Policy

2018-07-17 Thread Chun-Hung Hsiao
I just have a comment on a special case:
If a fix for a flaky test is easy to backport,
IMO we probably should backport it,
otherwise if someone backports another critical fix in the future,
it would take them extra effort to check all CI failures.

On Mon, Jul 16, 2018 at 11:39 AM Vinod Kone  wrote:

> I like how you summarized it Greg and I would vote for leaving the decision
> to the committer too. In addition to what others mentioned, I think
> committer should've the responsibility because if things break in a point
> release (after it is released), it is the committer and contributor who are
> on the hook to triage and fix it and not the release manager.
>
> Having said that, if "during" the release process (i.e., cutting an RC)
> these backports cause delays for a release manager in getting the release
> out (e.g., CI flakiness introduced due to backports), release manager could
> be the ultimate arbiter on whether such a backport should be reverted or
> fixed by the committer/contributor. Hopefully such issues are caught much
> before a release process is started (e.g., CI running against release
> branches).
>
> On Mon, Jul 16, 2018 at 1:28 PM Jie Yu  wrote:
>
> > Greg, I like your idea of adding a prescriptive "policy" when evaluating
> > whether a bug fix should be backported, and leave the decision to
> committer
> > (because they have the most context, and avoid a bottleneck in the
> > process).
> >
> > - Jie
> >
> > On Mon, Jul 16, 2018 at 11:24 AM, Greg Mann  wrote:
> >
> > > My impression is that we have two opposing schools of thought here:
> > >
> > >1. Backport as little as possible, to avoid unforeseen consequences
> > >2. Backport as much as proves practical, to eliminate bugs in
> > >supported versions
> > >
> > > Do other people agree with this assessment?
> > >
> > > If so, how can we find common ground? One possible solution would be to
> > > leave the decision on backporting up to the committer, without
> > specifying a
> > > project-wide policy. This seems to be the status quo, and would lead to
> > > some variation across committers regarding what types of fixes are
> > > backported. We could also choose to delegate the decision to the
> release
> > > manager; I favor leaving the decision with the committer, to eliminate
> > the
> > > burden on release managers.
> > >
> > > Here's a thought: rather than defining a prescriptive "policy" that we
> > > expect committers to abide by, we could enumerate in the documentation
> > the
> > > competing concerns that we expect committers to consider when making
> > > decisions on backports. The committing docs could read something like:
> > >
> > > "When bug fixes are committed to master, the committer should evaluate
> > the
> > > fix to determine whether or not it should be backported to supported
> > > versions. This is left to the committer, but they are expected to weigh
> > the
> > > following concerns when making the decision:
> > >
> > >- Every backported change comes with a risk of unintended
> > >consequences. The change should be carefully evaluated to ensure
> that
> > such
> > >side-effects are highly unlikely.
> > >- As the complexity of applying a backport increases due to merge
> > >conflicts, the likelihood of unintended consequences also increases.
> > Bug
> > >fixes which require extensive rebasing should only be backported
> when
> > the
> > >bug is critical enough to warrant the risk.
> > >- Users of supported versions benefit greatly from the resolution of
> > >bugs in point releases. Thus, whenever concerns #1 and #2 can be
> > allayed
> > >for a given bug fix, it should be backported."
> > >
> > >
> > > Cheers,
> > > Greg
> > >
> > >
> > > On Mon, Jul 16, 2018 at 3:06 AM, Alex Rukletsov 
> > > wrote:
> > >
> > >> Back porting as little as possible is the ultimate goal for me. My
> > >> reasons are closely aligned with what Andrew wrote above.
> > >>
> > >> If we agree on this strategy, the next question is how to enforce it.
> My
> > >> intuition is that committers will lean towards back porting their
> > patches
> > >> in arguable cases, because humans tend to overestimate the importance
> of
> > >> their personal work. Delegating the decision in such cases to a
> release
> > >> manager in my opinion will help us enforce the strategy of minimal
> > number
> > >> backports. As a bonus, the release manager will have a much better
> > >> understanding of what's going on with the release, keyword: "more
> > >> ownership".
> > >>
> > >> On Sat, Jul 14, 2018 at 12:07 AM, Andrew Schwartzmeyer <
> > >> and...@schwartzmeyer.com> wrote:
> > >>
> > >>> I believe I fall somewhere between Alex and Ben.
> > >>>
> > >>> As for deciding what to backport or not, I lean toward Alex's view of
> > >>> backporting as little as possible (and agree with his criteria). My
> > >>> reasoning is that all changes can have unforeseen consequences,
> which I
> > >>> believe is something to be 

Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-13 Thread Chun-Hung Hsiao
+1 (binding)

Tested on our internal CI. All green.
Tested on my Mac with both autotools and CMake, with gRPC enabled.
Failed tests:

HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPWithContainerImage
HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPSWithContainerImage
HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaTCPWithContainerImage
FetcherCacheTest.LocalUncachedExtract
FetcherCacheHttpTest.HttpMixed
MesosContainerizer/DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI
MesosContainerizer/DefaultExecutorTest.ROOT_LaunchGroupFailure
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_PersistentResources
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskSandboxPersistentVolume
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TasksSharingViaSandboxVolumes
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_HealthCheckUsingPersistentVolume

All of the above tests require the `filesystem/linux` isolator so are
supposed to fail on a Mac.

On Thu, Jul 12, 2018 at 8:50 AM Greg Mann  wrote:

> Yep, I trimmed the list to make it more digestible and provided a list of
> "notable" bug fixes. The entire list of changes can be found in the
> CHANGELOG.
>
> On Thu, Jul 12, 2018, 7:56 AM Chun-Hung Hsiao 
> wrote:
>
>> Seems you missed MESOS-9049. And this seems not just a bug fix release
>> because of MESOS-8934? ;)
>>
>> On Wed, Jul 11, 2018, 9:37 PM Greg Mann  wrote:
>>
>>> Whoops, I forgot to include the list of changes included in this release
>>> - sorry!
>>>
>>> 1.6.1-rc2 includes the following notable bug fixes:
>>>
>>>   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
>>>   * [MESOS-8830] - Agent gc on old slave sandboxes could empty
>>> persistent volume data
>>>   * [MESOS-8871] - Agent may fail to recover if the agent dies before
>>> image store cache checkpointed.
>>>   * [MESOS-8904] - Master crash when removing quota.
>>>   * [MESOS-8936] - Implement a Random Sorter for offer allocations.
>>>   * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
>>>   * [MESOS-8963] - Executor crash trying to print container ID.
>>>   * [MESOS-8980] - mesos-slave can deadlock with docker pull.
>>>   * [MESOS-8986] - `slave.available()` in the allocator is expensive and
>>> drags down allocation performance.
>>>   * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
>>>   * [MESOS-9002] - GCC 8.1 build failure in os::Fork::Tree.
>>>   * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
>>>   * [MESOS-9025] - The container which joins CNI network and has
>>> checkpoint enabled will be mistakenly destroyed by agent.
>>>
>>> Cheers,
>>> Greg
>>>
>>> On Wed, Jul 11, 2018 at 6:15 PM, Greg Mann  wrote:
>>>
>>>> Hi all,
>>>>
>>>> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>>>>
>>>>
>>>> 1.6.1 includes the following:
>>>>
>>>> 
>>>> *Announce major features here*
>>>> *Announce major bug fixes here*
>>>>
>>>> The CHANGELOG for the release is available at:
>>>>
>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2
>>>>
>>>> 
>>>>
>>>> The candidate for Mesos 1.6.1 release is available at:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz
>>>>
>>>> The tag to be voted on is 1.6.1-rc2:
>>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2
>>>>
>>>> The SHA512 checksum of the tarball can be found at:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512
>>>>
>>>> The signature of the tarball can be found at:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc
>>>>
>>>> The PGP key used to sign the release is here:
>>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>>
>>>> The JAR is in a staging repository here:
>>>> https://repository.apache.org/content/repositories/orgapachemesos-1230
>>>>
>>>> Please vote on releasing this package as Apache Mesos 1.6.1!
>>>>
>>>> The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
>>>> majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Mesos 1.6.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> Thanks,
>>>> Greg
>>>>
>>>
>>>


Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-12 Thread Chun-Hung Hsiao
Seems you missed MESOS-9049. And this seems not just a bug fix release
because of MESOS-8934? ;)

On Wed, Jul 11, 2018, 9:37 PM Greg Mann  wrote:

> Whoops, I forgot to include the list of changes included in this release -
> sorry!
>
> 1.6.1-rc2 includes the following notable bug fixes:
>
>   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
>   * [MESOS-8830] - Agent gc on old slave sandboxes could empty persistent
> volume data
>   * [MESOS-8871] - Agent may fail to recover if the agent dies before
> image store cache checkpointed.
>   * [MESOS-8904] - Master crash when removing quota.
>   * [MESOS-8936] - Implement a Random Sorter for offer allocations.
>   * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
>   * [MESOS-8963] - Executor crash trying to print container ID.
>   * [MESOS-8980] - mesos-slave can deadlock with docker pull.
>   * [MESOS-8986] - `slave.available()` in the allocator is expensive and
> drags down allocation performance.
>   * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
>   * [MESOS-9002] - GCC 8.1 build failure in os::Fork::Tree.
>   * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
>   * [MESOS-9025] - The container which joins CNI network and has
> checkpoint enabled will be mistakenly destroyed by agent.
>
> Cheers,
> Greg
>
> On Wed, Jul 11, 2018 at 6:15 PM, Greg Mann  wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>>
>>
>> 1.6.1 includes the following:
>>
>> 
>> *Announce major features here*
>> *Announce major bug fixes here*
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2
>>
>> 
>>
>> The candidate for Mesos 1.6.1 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz
>>
>> The tag to be voted on is 1.6.1-rc2:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1230
>>
>> Please vote on releasing this package as Apache Mesos 1.6.1!
>>
>> The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.6.1
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Greg
>>
>
>


Re: [VOTE] Release Apache Mesos 1.6.1 (rc1)

2018-06-29 Thread Chun-Hung Hsiao
-1 on https://issues.apache.org/jira/browse/MESOS-8830.

This is a critical bug that would wipe out persistent data. I'm backporting
this to 1.4, 1.5 and 1.6.

On Fri, Jun 29, 2018 at 9:05 AM Greg Mann  wrote:

> The failures here are mostly command executor/default executor tests.
> Looking at the test output, it seems that the tasks in these tests failed
> to start successfully and send task status updates. I haven't seen this
> issue on our internal CI; I'll try to re-run the build on ASF CI and if the
> failures occur again, investigate why that environment is experiencing this
> problem.
>
> -Greg
>
> On Wed, Jun 27, 2018 at 1:58 PM, Vinod Kone  wrote:
>
>> Hmm. Lot of tests failed when I ran this through ASF CI. Not sure if all
>> of these are known flaky tests?
>>
>>
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
>>
>>
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
>>
>> On Wed, Jun 27, 2018 at 11:59 AM Jie Yu  wrote:
>>
>>> +1
>>>
>>> Passed on our internal CI that has the following matrix. I looked into
>>> the only failed test, looks to be a flaky test due to a race in the test.
>>>
>>>
>>>
>>> On Tue, Jun 26, 2018 at 7:02 PM, Greg Mann  wrote:
>>>
 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 1.6.1.


 1.6.1 includes the following:

 
 *Announce major features here*
 *Announce major bug fixes here*

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc1

 

 The candidate for Mesos 1.6.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz

 The tag to be voted on is 1.6.1-rc1:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc1

 The SHA512 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz.sha512

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1229

 Please vote on releasing this package as Apache Mesos 1.6.1!

 The vote is open until Fri Jun 29 18:46:28 PDT 2018 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 1.6.1
 [ ] -1 Do not release this package because ...

 Thanks,
 Greg

>>>
>>>
>


Proposal: Changing `CREATE_VOLUME` and `CREATE_BLOCK` to `CREATE_DISK`.

2018-06-28 Thread Chun-Hung Hsiao
Hi folks,

*TL;DR*

I'm proposing a breaking API change on experimental offer operations, as
shown in the review request:
https://reviews.apache.org/r/67779/
Reasons:
1. "Volume" is overloaded and leads to conflicting/inconsistent naming.
2. The concept of "PATH" disks does not exist in CSI, which could be
problematic.

Please provide feedbacks or raise any concern you have. Thanks!

*Introduction*

Mesos 1.5 introduced four new operations for better storage support through
CSI. These operations are:
  * CREATE_VOLUME converts RAW disks to MOUNT or PATH disks.
  * DESTROY_VOLUME converts MOUNT or PATH disks back to RAW disks.
  * CREATE_BLOCK converts RAW disks to BLOCK disks.
  * DESTROY_BLOCK converts BLOCK disks back to RAW disks.
However, the following two issues are raised for these operations.

*Naming/terminology inconsistency*

In Mesos we used to roughly use the term "volume" to refer to persistent
volumes. For example, we have the following endpoints:
http://mesos.apache.org/documentation/latest/persistent-volume/#unversioned-operator-http-endpoints
And we have a corresponding ACLs: `ACL.CreateVolume` and
`ACL.DestroyVolume`.
But, the `CREATE_VOLUME` and `DESTROY_VOLUME` operations are not related to
persistent volume at all, and it becomes hard to come up with intuitive
names for the corresponding ACLs.

On the other hand, we distinguish "volumes" from "blocks," which is
incosistent with CSI. CSI has "block volumes" and "mount volumes," so when
an operator with CSI knowledge want to use Mesos, the names may confuse
them.

Since these operations are still experimental, and AFAIK there has been no
3rdparty framework that uses these operations yet except for us, I'd like
to propose a breaking API change to rename and combine these operations
into `CREATE_DISK` and `DESTROY_DISK`. Furthermore, we could refine the use
of "volume" to refer to "ROOT/PATH/MOUNT/BLOCK disk resources with metadata
that could be used by containers."

*PATH disks do not exist in CSI*

PATH disks are used to split local disks into smaller chunks so tasks can
use them concurrently and independently. We have isolators such as
"disk/du" or "xfs/disk" to enforce their usage capacity. This PATH concept,
however, does not exist in CSI, and this could be prone to future CSI
changes. For example, to enforce usage, we will need to directly interact
with the filesystem on top of CSI volumes without involving CSI plugins.
Also, when we support non-local CSI volumes (such as EBS volumes) in the
future, those volumes won't be able to be split into small chunks to be
used on different agents.

Therefore, I propose that we should remove PATH support for these storage
operations.

Best,
Chun-Hung


[Design Doc] External Resource Provider and CSI

2018-06-11 Thread Chun-Hung Hsiao
Folks,

As a natural extension to prior work [1, 2] to improve storage support in
Mesos,
I'm working on the general design of external resource providers,
and the specific design for external storage support through CSI [3].
The goal is to enable Mesos to manage cluster-wide resources such as EBS
volumes
and offer them to frameworks.

Please find the design doc through the following link:
https://docs.google.com/document/d/1c4allCaldqBOLlKzzgQiurvhqmPe59qqYswjtf4gf1s/edit?usp=sharing

We'll also discuss the design in tomorrow's API working group.
Thanks for your time to review and provide feedbacks for the design!

Best,
Chun-Hung

[1]
https://docs.google.com/document/d/125YWqg_5BB5OY9a6M7LZcby5RSqBwo2PZzpVLuxYXh4/edit?usp=sharing
[2]
https://docs.google.com/document/d/1D4a-GNON8PCSIUx3pVoZXg1dUB_50ZAIlDFSnZ3xx6I/edit?usp=sharing
[3] https://github.com/container-storage-interface/spec


Re: [jira] [Commented] (MESOS-8927) Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is stuck.

2018-05-16 Thread Chun-Hung Hsiao
I'm sorry for the duplicated messages. Accidentally pressed the wrong key
shortcuts twice :(

Unfortunately I don't have the log right now. IIRC the executor received
the `KILL` event because the log I saw contained this line:
https://github.com/apache/mesos/blob/7e11a2d39cc642944897d2480105db
fd860fa601/src/launcher/default_executor.cpp#L1236
But it didn't contain this line:
https://github.com/apache/mesos/blob/7e11a2d39cc642944897d2480105dbfd860fa601/src/launcher/default_executor.cpp#L1101

The reason that caused the `LAUNCH_NESTED_CONTAINER` to be stuck was
rotated out in the log file when I examined it.


On Wed, May 16, 2018 at 6:57 PM, Chun-Hung Hsiao <chhs...@mesosphere.io>
wrote:

> Unfortunately I don't have the log right now. IIRC the executor received
> the `KILL` event because the log I saw contained this line:
> https://github.com/apache/mesos/blob/7e11a2d39cc642944897d2480105db
> fd860fa601/src/launcher/default_executor.cpp#L1236
> But it didn't contain this line:
>
> On Wed, May 16, 2018 at 6:18 PM, Vinod Kone <vi...@mesosphere.io> wrote:
>
>> Can you paste some logs here too if you have?
>>
>> On Wed, May 16, 2018 at 5:53 PM, Chun-Hung Hsiao (JIRA) <j...@apache.org>
>> wrote:
>>
>> >
>> > [ https://issues.apache.org/jira/browse/MESOS-8927?page=
>> > com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> > tabpanel=16478318#comment-16478318 ]
>> >
>> > Chun-Hung Hsiao commented on MESOS-8927:
>> > 
>> >
>> > I'd like to add some notes here. This problem is actually nontrivial,
>> > because AFAIK we don't have a reliable way to kill a container at any
>> state.
>> >
>> > > Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is
>> stuck.
>> > > 
>> > -
>> > >
>> > > Key: MESOS-8927
>> > >     URL: https://issues.apache.org/jira/browse/MESOS-8927
>> > > Project: Mesos
>> > >  Issue Type: Bug
>> > >  Components: executor
>> > >Affects Versions: 1.5.1, 1.6.0
>> > >Reporter: Chun-Hung Hsiao
>> > >Priority: Critical
>> > >  Labels: default-executor, mesosphere
>> > >
>> > > In the default executor, if the {{LAUNCH_NESTED_CONTAINER}} call never
>> > returns, {{container->launched}} won't be set, so a follow-up {{KILL}}
>> > event will be ignored:
>> > >  [https://github.com/apache/mesos/blob/40b40d9b73221388e583fc140280f1
>> > eb2b48b832/src/launcher/default_executor.cpp#L1091]
>> > > This could lead to tasks stuck in {{TASK_STARTING}}.
>> >
>> >
>> >
>> > --
>> > This message was sent by Atlassian JIRA
>> > (v7.6.3#76005)
>> >
>>
>
>


Re: [jira] [Commented] (MESOS-8927) Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is stuck.

2018-05-16 Thread Chun-Hung Hsiao
Unfortunately I don't have the log right now. IIRC the executor received
the `KILL` event because the log I saw contained this line:
https://github.com/apache/mesos/blob/7e11a2d39cc642944897d2480105dbfd860fa601/src/launcher/default_executor.cpp#L1236
But it didn't contain this line:

On Wed, May 16, 2018 at 6:18 PM, Vinod Kone <vi...@mesosphere.io> wrote:

> Can you paste some logs here too if you have?
>
> On Wed, May 16, 2018 at 5:53 PM, Chun-Hung Hsiao (JIRA) <j...@apache.org>
> wrote:
>
> >
> > [ https://issues.apache.org/jira/browse/MESOS-8927?page=
> > com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel=16478318#comment-16478318 ]
> >
> > Chun-Hung Hsiao commented on MESOS-8927:
> > 
> >
> > I'd like to add some notes here. This problem is actually nontrivial,
> > because AFAIK we don't have a reliable way to kill a container at any
> state.
> >
> > > Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is
> stuck.
> > > 
> > -
> > >
> > > Key: MESOS-8927
> > > URL: https://issues.apache.org/jira/browse/MESOS-8927
> > >     Project: Mesos
> > >  Issue Type: Bug
> > >  Components: executor
> > >Affects Versions: 1.5.1, 1.6.0
> > >Reporter: Chun-Hung Hsiao
> > >Priority: Critical
> > >  Labels: default-executor, mesosphere
> > >
> > > In the default executor, if the {{LAUNCH_NESTED_CONTAINER}} call never
> > returns, {{container->launched}} won't be set, so a follow-up {{KILL}}
> > event will be ignored:
> > >  [https://github.com/apache/mesos/blob/40b40d9b73221388e583fc140280f1
> > eb2b48b832/src/launcher/default_executor.cpp#L1091]
> > > This could lead to tasks stuck in {{TASK_STARTING}}.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.3#76005)
> >
>


Re: [jira] [Commented] (MESOS-8927) Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is stuck.

2018-05-16 Thread Chun-Hung Hsiao
Unfortunately I don't have any log. IIRC the executor received the the
`KILL` event because this is printed:

On Wed, May 16, 2018 at 6:18 PM, Vinod Kone <vi...@mesosphere.io> wrote:

> Can you paste some logs here too if you have?
>
> On Wed, May 16, 2018 at 5:53 PM, Chun-Hung Hsiao (JIRA) <j...@apache.org>
> wrote:
>
> >
> > [ https://issues.apache.org/jira/browse/MESOS-8927?page=
> > com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel=16478318#comment-16478318 ]
> >
> > Chun-Hung Hsiao commented on MESOS-8927:
> > 
> >
> > I'd like to add some notes here. This problem is actually nontrivial,
> > because AFAIK we don't have a reliable way to kill a container at any
> state.
> >
> > > Default executor cannot kill tasks if `LAUNCH_NESTED_CONTAINER` is
> stuck.
> > > 
> > -
> > >
> > > Key: MESOS-8927
> > > URL: https://issues.apache.org/jira/browse/MESOS-8927
> > >     Project: Mesos
> > >  Issue Type: Bug
> > >  Components: executor
> > >Affects Versions: 1.5.1, 1.6.0
> > >Reporter: Chun-Hung Hsiao
> > >Priority: Critical
> > >  Labels: default-executor, mesosphere
> > >
> > > In the default executor, if the {{LAUNCH_NESTED_CONTAINER}} call never
> > returns, {{container->launched}} won't be set, so a follow-up {{KILL}}
> > event will be ignored:
> > >  [https://github.com/apache/mesos/blob/40b40d9b73221388e583fc140280f1
> > eb2b48b832/src/launcher/default_executor.cpp#L1091]
> > > This could lead to tasks stuck in {{TASK_STARTING}}.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.3#76005)
> >
>


Re: [VOTE] Release Apache Mesos 1.6.0 (rc1)

2018-05-10 Thread Chun-Hung Hsiao
+1 (binding)

Tested on our internal CI (sudo make check) on Mac, CentOS 6/7, Debian 8/9
and Ubuntu 14/16/17, with gRPC/SSL disabled/enabled.
Also manually tested "make distcheck" w/ autotools, and "ninja check" w/
CMake on Mac and CentOS 7 with gRPC enabled.

Observed the following failures:
https://issues.apache.org/jira/browse/MESOS-8884
https://issues.apache.org/jira/browse/MESOS-8875

The first one is a test flakiness, and the second one is related to
MESOS-2407 which is a known problem.

On Wed, May 9, 2018 at 11:00 AM, Vinod Kone  wrote:

> +1 (binding)
>
> Ran it on ASF CI. The only failures observed were known flaky command check
> tests.
>
> *Revision*: c7df5eadc075adcf525ea091f65786aaffb9b072
>
>- refs/tags/1.6.0-rc1
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Failed]
>  ease/48/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> %20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=
> 1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> --verbose autotools
> [image: Failed]
>  ease/48/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> ,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Failed]
>  ease/48/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/48/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> %20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=
> 1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_
> v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> --verbose autotools
> [image: Success]
>  ease/48/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/48/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=-
> -verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> ,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/48/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A1
> 4.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>
>
> On Mon, May 7, 2018 at 8:48 PM, Greg Mann  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.6.0.
> >
> >
> > 1.6.0 includes the following:
> > 
> > 
> > * Resizing of persistent volumes for agent default resources
> > * Offer operation feedback for resource provider 

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
If we do option 1, then there will be no agent crash since the master won't
send any unknown operation to an old agent,
so option 2 is not a necessity.

On Mon, Apr 16, 2018 at 2:12 PM, Silas Snider <swsni...@apple.com> wrote:

> I think we should definitely do option 2 regardless of whether we do
> option 1 as well, since although in this case it will still crash 1.5.0, at
> least in the future we won't have to have this worry again.
>
> On 4/16/18, 2:04 PM, "Chun-Hung Hsiao" <chhs...@apache.org> wrote:
>
> Hi all,
>
> As some might have already known, we are currently working on patches
> to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
>
> One problem surfaces is that, since the new operations are not
> supported in
> Mesos 1.5, they will lead to an agent crash during the operation
> application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
>
> We are now consider two possibilities to address this compatibility
> problem:
>
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in
> future
> Mesos
> releases, we would have code like the following:
>
> ```
> Version slaveVersion = ...; // Get the Mesos version of the slave of
> the
> offer.
> switch (operation.type()) {
>   ...
>   case SOME_NEW_OPERATION: {
> if (slaveVersion < minVersionForSomeNewOperation) {
>   ... // Drop the operation.
> }
> break;
>   }
>   ...
> }
> ```
>
> Pros and cons:
> + The new operation won't go into the operation application cycle
> since it
> is
>   rejected in the very beginning. This means no resource metadata is
> touched.
> - Explicit slave version checks at master side make the code look not
> very
> clean,
>   and we will need to update this list every time we add a new
> operation.
>
> 2) Treat this issue as an agent crash bug. The Mesos master would
> forward
> the operation to the agent, regardless of the agent's Mesos version.
> In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
>
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>   ... // Drop the operation and trigger a re-registration or send an
>   // `UpdateSlaveMessage` to force the master to update the total
> resource of
>   // the slave.
> }
> ```
>
> Pros and cons:
> + Easier to add new operations since no new logic needs to be added for
> backward
>   Compability.
> - Since the old agent won't know whether the new operations are
> speculative
> or not,
>   a re-registration or an `UpdateSlaveMessage` is required.
> - Mesos 1.5.0 agents will still have the bug and crash when a new
> master
> sends a
>   new operation to them.
>
> Since both options are viable and there seems to be no clear winner,
> we'd
> like to
> check with the community to see which convention is preferable. Please
> let
> us know
> what you think. Thanks!
>
> Best,
> Chun-Hung
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-4965
> [2]
> https://github.com/apache/mesos/blob/1.5.x/src/common/
> protobuf_utils.cpp#L851
> [3] https://github.com/apache/mesos/blob/master/src/master/
> master.cpp#L3899
> [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/
> slave.cpp#L4359
>
>
>
>


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
Are you suggesting that for every new operation we'll introduce a new
capability?

On Mon, Apr 16, 2018 at 2:14 PM, Vinod Kone <vinodk...@apache.org> wrote:

> Crashing the agent is definitely not a viable option IMO.
>
> Why can't we use agent capabilities instead of agent version and reject
> such operations at master? This is one of the main reasons we introduced
> the concept of framework, master, agent capabilities.
>
> On Mon, Apr 16, 2018 at 2:04 PM, Chun-Hung Hsiao <chhs...@apache.org>
> wrote:
>
> > Hi all,
> >
> > As some might have already known, we are currently working on patches to
> > implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
> >
> > One problem surfaces is that, since the new operations are not supported
> in
> > Mesos 1.5, they will lead to an agent crash during the operation
> > application
> > cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
> >
> > We are now consider two possibilities to address this compatibility
> > problem:
> >
> > 1) The Mesos 1.6 master should check the agent's Mesos version in
> > `Master::accept` [3]. Moving forward, if we add new operations in future
> > Mesos
> > releases, we would have code like the following:
> >
> > ```
> > Version slaveVersion = ...; // Get the Mesos version of the slave of the
> > offer.
> > switch (operation.type()) {
> >   ...
> >   case SOME_NEW_OPERATION: {
> > if (slaveVersion < minVersionForSomeNewOperation) {
> >   ... // Drop the operation.
> > }
> > break;
> >   }
> >   ...
> > }
> > ```
> >
> > Pros and cons:
> > + The new operation won't go into the operation application cycle since
> it
> > is
> >   rejected in the very beginning. This means no resource metadata is
> > touched.
> > - Explicit slave version checks at master side make the code look not
> very
> > clean,
> >   and we will need to update this list every time we add a new operation.
> >
> > 2) Treat this issue as an agent crash bug. The Mesos master would forward
> > the operation to the agent, regardless of the agent's Mesos version. In
> the
> > agent,
> > we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
> >
> > ```
> > if (message.operation_info().type() == OPERATION_UNKNOWN) {
> >   ... // Drop the operation and trigger a re-registration or send an
> >   // `UpdateSlaveMessage` to force the master to update the total
> > resource of
> >   // the slave.
> > }
> > ```
> >
> > Pros and cons:
> > + Easier to add new operations since no new logic needs to be added for
> > backward
> >   Compability.
> > - Since the old agent won't know whether the new operations are
> speculative
> > or not,
> >   a re-registration or an `UpdateSlaveMessage` is required.
> > - Mesos 1.5.0 agents will still have the bug and crash when a new master
> > sends a
> >   new operation to them.
> >
> > Since both options are viable and there seems to be no clear winner, we'd
> > like to
> > check with the community to see which convention is preferable. Please
> let
> > us know
> > what you think. Thanks!
> >
> > Best,
> > Chun-Hung
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-4965
> > [2]
> > https://github.com/apache/mesos/blob/1.5.x/src/common/protob
> > uf_utils.cpp#L851
> > [3] https://github.com/apache/mesos/blob/master/src/master/maste
> > r.cpp#L3899
> > [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
> >
>


Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
Hi all,

As some might have already known, we are currently working on patches to
implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].

One problem surfaces is that, since the new operations are not supported in
Mesos 1.5, they will lead to an agent crash during the operation application
cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].

We are now consider two possibilities to address this compatibility problem:

1) The Mesos 1.6 master should check the agent's Mesos version in
`Master::accept` [3]. Moving forward, if we add new operations in future
Mesos
releases, we would have code like the following:

```
Version slaveVersion = ...; // Get the Mesos version of the slave of the
offer.
switch (operation.type()) {
  ...
  case SOME_NEW_OPERATION: {
if (slaveVersion < minVersionForSomeNewOperation) {
  ... // Drop the operation.
}
break;
  }
  ...
}
```

Pros and cons:
+ The new operation won't go into the operation application cycle since it
is
  rejected in the very beginning. This means no resource metadata is
touched.
- Explicit slave version checks at master side make the code look not very
clean,
  and we will need to update this list every time we add a new operation.

2) Treat this issue as an agent crash bug. The Mesos master would forward
the operation to the agent, regardless of the agent's Mesos version. In the
agent,
we deploy and backport the following logic in `Slave::applyOperation` [4]:

```
if (message.operation_info().type() == OPERATION_UNKNOWN) {
  ... // Drop the operation and trigger a re-registration or send an
  // `UpdateSlaveMessage` to force the master to update the total
resource of
  // the slave.
}
```

Pros and cons:
+ Easier to add new operations since no new logic needs to be added for
backward
  Compability.
- Since the old agent won't know whether the new operations are speculative
or not,
  a re-registration or an `UpdateSlaveMessage` is required.
- Mesos 1.5.0 agents will still have the bug and crash when a new master
sends a
  new operation to them.

Since both options are viable and there seems to be no clear winner, we'd
like to
check with the community to see which convention is preferable. Please let
us know
what you think. Thanks!

Best,
Chun-Hung


[1] https://issues.apache.org/jira/browse/MESOS-4965
[2]
https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
[3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
[4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359


Re: API Review: Resize (persistent) volume support

2018-03-19 Thread Chun-Hung Hsiao
>From the perspective of resource allocation, GROW takes two resources and
merge them into one, while SHRINK takes one resource and split it into two.
So, having two separated calls could make it explicit to the framework
about what the resources being consumed are.
Jie also mentioned in the comment
https://reviews.apache.org/r/66049/#comment279663 that specifying two
resources instead of one in GROW would make the validation clear.

I don't think allowing the operation to be applied more than once is a good
idea, and thus I'm thinking about the following validation:
1. The master checks that its resources contain the consumed resource(s).
2. The master forwards the operation to the agent.
3. The agent checks that its resources contain the consumed resource(s).
4. The agent applies the operation to update its resources, and returns a
resource conversion.
5. The master receives the resource conversion and applies it to update its
resources.

On Sun, Mar 18, 2018 at 8:16 PM, James Peach  wrote:

>
>
> > On Mar 16, 2018, at 11:12 AM, Zhitao Li  wrote:
> >
> > Hi everyone,
> >
> > Chun, Greg, Gastón and I are working on supporting resizing of persistent
> > volume[1]. See [2] for the design doc in length.
> >
> > The proposed new offer operation and corresponding operator API are in
> > following two patches:
> >
> > https://reviews.apache.org/r/66049/
> > https://reviews.apache.org/r/66052
> >
> > Our intention is to eventually support resizing of not only persistent
> > volumes, but also CSI volumes[3] introduced after Mesos 1.5 in the same
> set
> > of API, so we are declaring the API as experimental in its first release
> > version.
> >
> > We also want to make sure the API is reasonable to use to framework
> authors
> > and operators.
>
> Why do you have separate GROW/SHRINK operations? Could a RESIZE operation
> with a target size work?
>
> In all of these cases, is it possible for the operation to be applied more
> than once? Clearly, replaying a SHRINK would be bad. Applying RESIZE
> operations out of order would also be bad, but not in the same way.
>
> What is the response to this request?
>
> > Considering the above, both APIs need to include the original volume as
> > resource. Some alternatives on extra fields:
> > 1) size difference in Resource format: this may not be applicable in CSI
> > volume;
> > 2) size difference in Scalar value: this can be applicable in both CSI
> and
> > persistent volume case, since there is always a quantitive difference. We
> > can add extra CSI only fields once the spec is defined;
> > 3) target volume in `Resource` format: this may not be possible for any
> CSI
> > volume because the implementation could change certain metadata, so we
> did
> > not take this approach.
> >
> > Therefore, we are taking option 2) in current patches.
> >
> > Please let me know what you think. Thanks.
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-4965
> > [2] https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_
> > 6EOaPzPtwYNVQUQ/edit#
> > [3] https://github.com/apache/mesos/blob/master/docs/csi.md
> >
> > --
> > Cheers,
> >
> > Zhitao Li
>
>


Re: API Review: Resize (persistent) volume support

2018-03-16 Thread Chun-Hung Hsiao
Thanks Zhitao for the summary. My thoughts are:

For `SHRINK_VOLUME`, I feel option 2 is appropriate, as it gives the
component that actually applies the operation to decide what the resulting
free disk space would become. Option 3 is also acceptable.

For `GROW_VOLUME`, I actually prefer option 1 more, and I think it can
handle more cases, including CSI volumes. To be more concrete, here is a
prototype I would suggest:
```
message GrowVolume {
  Resource volume = 1;
  Resource addition = 2;
}
```
Potentially, we may let a framework to grow `volume` with either an
existing `PATH` volume or a `RAW` storage pool. Neither option 2 nor 3 can
provide such functionality, because it cannot specify where the extra space
comes from.

That said, I'm not sure if this is a valid concern since we don't have such
CSI functions yet. So input from folks would be very welcome!

On Mar 16, 2018 11:12 AM, "Zhitao Li"  wrote:

Hi everyone,

Chun, Greg, Gastón and I are working on supporting resizing of persistent
volume[1]. See [2] for the design doc in length.

The proposed new offer operation and corresponding operator API are in
 following two patches:

https://reviews.apache.org/r/66049/
https://reviews.apache.org/r/66052

Our intention is to eventually support resizing of not only persistent
volumes, but also CSI volumes[3] introduced after Mesos 1.5 in the same set
of API, so we are declaring the API as experimental in its first release
version.

We also want to make sure the API is reasonable to use to framework authors
and operators.

Considering the above, both APIs need to include the original volume as
resource. Some alternatives on extra fields:
1) size difference in Resource format: this may not be applicable in CSI
volume;
2) size difference in Scalar value: this can be applicable in both CSI and
persistent volume case, since there is always a quantitive difference. We
can add extra CSI only fields once the spec is defined;
3) target volume in `Resource` format: this may not be possible for any CSI
volume because the implementation could change certain metadata, so we did
not take this approach.

Therefore, we are taking option 2) in current patches.

Please let me know what you think. Thanks.

[1] https://issues.apache.org/jira/browse/MESOS-4965
[2] https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_
6EOaPzPtwYNVQUQ/edit#
[3] https://github.com/apache/mesos/blob/master/docs/csi.md

--
Cheers,

Zhitao Li


Re: Welcome Zhitao Li as Mesos Committer and PMC Member

2018-03-12 Thread Chun-Hung Hsiao
Congrats Zhitao!

On Mon, Mar 12, 2018 at 2:51 PM, Benjamin Mahler  wrote:

> Welcome Zhitao! Thanks for your contributions so far
>
> On Mon, Mar 12, 2018 at 2:02 PM, Gilbert Song  wrote:
>
> > Hi,
> >
> > I am excited to announce that the PMC has voted Zhitao Li as a new
> > committer and member of PMC for the Apache Mesos project. Please join me
> to
> > congratulate Zhitao!
> >
> > Zhitao has been an active contributor to Mesos for one and a half years.
> > His main contributions include:
> >
> >- Designed and implemented Container Image Garbage Collection (
> >MESOS-4945 );
> >- Designed and implemented part of the HTTP Operator API (MESOS-6007
> >);
> >- Reported and fixed a lot of bugs
> > 20Bug%20AND%20(assignee%20%3D%20zhitao%20OR%20reporter%20%
> 3D%20zhitao%20)%20ORDER%20BY%20priority%20>
> >.
> >
> > Zhitao spares no effort to improve the project quality and to propose
> > ideas. Thank you Zhitao for all contributions!
> >
> > Here is his committer candidate checklist for your perusal:
> > https://docs.google.com/document/d/1HGz7iBdo1Q9z9c8fNRgNNLnj0XQ_
> > PhDhjXLAfOx139s/
> >
> > Congrats Zhitao!
> >
> > Cheers,
> > Gilbert
> >
>


Re: Welcome Chun-Hung Hsiao as Mesos Committer and PMC Member

2018-03-12 Thread Chun-Hung Hsiao
Thanks guys! I'm honored to join PMC.
Looking forward to have more collaborations with the community!

On Mon, Mar 12, 2018 at 10:46 AM, Alex Evonosky <alex.evono...@gmail.com>
wrote:

> congratulations!
>
> On Mon, Mar 12, 2018 at 1:16 PM, Meng Zhu <m...@mesosphere.com> wrote:
>
>> Congrats Chun! Well deserved!
>>
>> On Mon, Mar 12, 2018 at 10:09 AM, Zhitao Li <zhitaoli...@gmail.com>
>> wrote:
>>
>>> Congrats, Chun!
>>>
>>> On Sun, Mar 11, 2018 at 11:47 PM, Gilbert Song <gilb...@mesosphere.io>
>>> wrote:
>>>
>>> > Congrats, Chun!
>>> >
>>> > It is great to have you in the community!
>>> >
>>> > - Gilbert
>>> >
>>> > On Sun, Mar 11, 2018 at 4:40 PM, Andrew Schwartzmeyer <
>>> > and...@schwartzmeyer.com> wrote:
>>> >
>>> > > Congratulations Chun!
>>> > >
>>> > > I apologize for not also giving you a +1, as I certainly would have,
>>> but
>>> > > just discovered my mailing list isn't working. Just a heads up,
>>> don't let
>>> > > that happen to you too!
>>> > >
>>> > > I look forward to continuing to work with you.
>>> > >
>>> > > Cheers,
>>> > >
>>> > > Andy
>>> > >
>>> > >
>>> > > On 03/10/2018 9:14 pm, Jie Yu wrote:
>>> > >
>>> > >> Hi,
>>> > >>
>>> > >> I am happy to announce that the PMC has voted Chun-Hung Hsiao as a
>>> new
>>> > >> committer and member of PMC for the Apache Mesos project. Please
>>> join me
>>> > >> to
>>> > >> congratulate him!
>>> > >>
>>> > >> Chun has been an active contributor for the past year. His main
>>> > >> contributions to the project include:
>>> > >> * Designed and implemented gRPC client support to libprocess
>>> > (MESOS-7749)
>>> > >> * Designed and implemented Storage Local Resource Provider
>>> (MESOS-7235,
>>> > >> MESOS-8374)
>>> > >> * Implemented part of the CSI support (MESOS-7235, MESOS-8374)
>>> > >>
>>> > >> Chun is friendly and humble, but also intelligent, insightful, and
>>> > >> opinionated. I am confident that he will be a great addition to our
>>> > >> committer pool. Thanks Chun for all your contributions to the
>>> project so
>>> > >> far!
>>> > >>
>>> > >> His committer checklist can be found here:
>>> > >> https://docs.google.com/document/d/1FjroAvjGa5NdP29zM7-2eg6t
>>> > >> LPAzQRMUmCorytdEI_U/edit?usp=sharing
>>> > >>
>>> > >> - Jie
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> Cheers,
>>>
>>> Zhitao Li
>>>
>>
>>
>


Re: Tasks may be explicitly dropped by agent in Mesos 1.5

2018-03-02 Thread Chun-Hung Hsiao
Gilbert I think you're right. The code path doesn't exist in 1.5.0.

On Mar 2, 2018 9:36 AM, "Chun-Hung Hsiao" <chhs...@mesosphere.io> wrote:

> This is a new behavior we have after solving MESOS-1720, and thus a new
> problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same executor)
> will be launched because whoever comes first will launch the executor.
> Since 1.5, one might be dropped.
>
> On Mar 1, 2018 4:36 PM, "Gilbert Song" <gilb...@mesosphere.io> wrote:
>
>> Meng,
>>
>> Could you double check if this is really an issue in Mesos 1.5.0 release?
>>
>> MESOS-1720 <https://issues.apache.org/jira/browse/MESOS-1720> was
>> resolved
>> after the 1.5 release (rc-2) and it seems like
>> it is only at the master branch and 1.5.x branch (not 1.5.0).
>>
>> Did I miss anything?
>>
>> - Gilbert
>>
>> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>>
>> > Put another way, we currently don't guarantee in-order task delivery to
>> > the executor. Due to the changes for MESOS-1720, one special case of
>> task
>> > re-ordering now leads to the re-ordered task being dropped (rather than
>> > delivered out-of-order as before). Technically, this is strictly better.
>> >
>> > However, we'd like to start guaranteeing in-order task delivery.
>> >
>> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu <m...@mesosphere.com> wrote:
>> >
>> >> Hi all:
>> >>
>> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
>> >> if all three conditions are met:
>> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
>> >>  use the same executor.
>> >> (2) The executor currently does not exist on the agent.
>> >> (3) Due to some race conditions, these tasks are trying to launch
>> >> on the agent in a different order from their original launch order.
>> >>
>> >> In this case, tasks that are trying to launch on the agent
>> >> before the first task in the original order will be explicitly dropped
>> by
>> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
>> >>
>> >> This bug will be fixed in 1.5.1. It is tracked in
>> >> https://issues.apache.org/jira/browse/MESOS-8624
>> >>
>> >> 
>> >>
>> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
>> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
>> >> calls to a new executor. The master would specify that the first call
>> is
>> >> the
>> >> one to launch a new executor through the `launch_executor` field in
>> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
>> >> use the existing executor launched by the first one.
>> >>
>> >> On the agent side, running a task/task group goes through a series of
>> >> continuations, one is `collect()` on the future that unschedule
>> >> frameworks from
>> >> being GC'ed:
>> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
>> >> another is `collect()` on task authorization:
>> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
>> >> Since these `collect()` calls run on individual actors, the futures of
>> the
>> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
>> >> out-of-order, even if the futures these two `collect()` wait for are
>> >> satisfied in
>> >> order (which is true in these two cases).
>> >>
>> >> As a result, under some race conditions (probably under some heavy load
>> >> conditions), tasks rely on the previous task to launch executor may
>> >> get processed before the task that is supposed to launch the executor
>> >> first, resulting in the tasks being explicitly dropped by the agent.
>> >>
>> >> -Meng
>> >>
>> >>
>> >>
>> >
>>
>


Re: Tasks may be explicitly dropped by agent in Mesos 1.5

2018-03-02 Thread Chun-Hung Hsiao
This is a new behavior we have after solving MESOS-1720, and thus a new
problem only in 1.5.x. Prior to 1.5, reordered tasks (to the same executor)
will be launched because whoever comes first will launch the executor.
Since 1.5, one might be dropped.

On Mar 1, 2018 4:36 PM, "Gilbert Song"  wrote:

> Meng,
>
> Could you double check if this is really an issue in Mesos 1.5.0 release?
>
> MESOS-1720  was resolved
> after the 1.5 release (rc-2) and it seems like
> it is only at the master branch and 1.5.x branch (not 1.5.0).
>
> Did I miss anything?
>
> - Gilbert
>
> On Thu, Mar 1, 2018 at 4:22 PM, Benjamin Mahler 
> wrote:
>
> > Put another way, we currently don't guarantee in-order task delivery to
> > the executor. Due to the changes for MESOS-1720, one special case of task
> > re-ordering now leads to the re-ordered task being dropped (rather than
> > delivered out-of-order as before). Technically, this is strictly better.
> >
> > However, we'd like to start guaranteeing in-order task delivery.
> >
> > On Thu, Mar 1, 2018 at 2:56 PM, Meng Zhu  wrote:
> >
> >> Hi all:
> >>
> >> TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent
> >> if all three conditions are met:
> >> (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls
> >>  use the same executor.
> >> (2) The executor currently does not exist on the agent.
> >> (3) Due to some race conditions, these tasks are trying to launch
> >> on the agent in a different order from their original launch order.
> >>
> >> In this case, tasks that are trying to launch on the agent
> >> before the first task in the original order will be explicitly dropped
> by
> >> the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
> >>
> >> This bug will be fixed in 1.5.1. It is tracked in
> >> https://issues.apache.org/jira/browse/MESOS-8624
> >>
> >> 
> >>
> >> In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an
> >> ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
> >> calls to a new executor. The master would specify that the first call is
> >> the
> >> one to launch a new executor through the `launch_executor` field in
> >> `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should
> >> use the existing executor launched by the first one.
> >>
> >> On the agent side, running a task/task group goes through a series of
> >> continuations, one is `collect()` on the future that unschedule
> >> frameworks from
> >> being GC'ed:
> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
> >> another is `collect()` on task authorization:
> >> https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
> >> Since these `collect()` calls run on individual actors, the futures of
> the
> >> `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return
> >> out-of-order, even if the futures these two `collect()` wait for are
> >> satisfied in
> >> order (which is true in these two cases).
> >>
> >> As a result, under some race conditions (probably under some heavy load
> >> conditions), tasks rely on the previous task to launch executor may
> >> get processed before the task that is supposed to launch the executor
> >> first, resulting in the tasks being explicitly dropped by the agent.
> >>
> >> -Meng
> >>
> >>
> >>
> >
>


Re: Collecting futures in the same actor in libprocess

2018-03-01 Thread Chun-Hung Hsiao
Some background for the bug AlexR and Meng found:

In https://issues.apache.org/jira/browse/MESOS-1720,
we introduce an ordering dependency between two `LAUNCH`/`LAUNCH_GROUP`
calls to a new executor.
The master would specify that the first call is the one to launch a new
executor
through the `launch_executor` field in
`RunTaskMessage`/`RunTaskGroupMessage`,
and the second one should use the existing executor launched by the first
call.
At the agent side, it will drop any task that want to launch an executor
which is already existing,
or any task that want to run on a non-existent executor.

Running a task/task group goes through a series of continuations,
one is `collect()` on the future that unschedule frameworks from being
GC'ed:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158
another is `collect()` on task authorization:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333
Since these `collect()` calls run on individual actors, the futures of the
`collect()` calls for
two `LAUNCH`/`LAUNCH_GROUP` calls may returns out-of-order,
even if the futures these two `collect()` wait for are satisfied in order
(which is true).

The result is that, if this race condition is triggered,
the agent will try to run the second task/task group before the first one,
and since the executor is supposed to be launched by the first one,
the agent will end up sending `TASK_DROPPED` for the second call.

If we can have an interface to make sure that `collect()` returns in the
same order
of their dependent futures, this can be avoided.

On Mar 1, 2018 12:50 PM, "Benjamin Mahler" <bmah...@apache.org> wrote:

> Could you explain the problem in more detail?
>
> On Thu, Mar 1, 2018 at 12:15 PM Chun-Hung Hsiao <chhs...@mesosphere.io>
> wrote:
>
> > Hi all,
> >
> > Meng found a bug in `slave.cpp`, where the proper fix requires collecting
> > futures in order. Currently every `collect` call spawns it's own actor,
> so
> > for two `collect` calls, even though their futures are satisfied in
> order,
> > they may finish out-of-order. So we need some libprocess changes to have
> > the ability to collect futures in the same actor. Here I have two
> > proposals:
> >
> > 1. Add a new `collect` interface that takes an actor as a parameter.
> >
> > 2. Introduce `process::Executor::collect()` for this.
> >
> > Any opinion on these two options?
> >
>


Collecting futures in the same actor in libprocess

2018-03-01 Thread Chun-Hung Hsiao
Hi all,

Meng found a bug in `slave.cpp`, where the proper fix requires collecting
futures in order. Currently every `collect` call spawns it's own actor, so
for two `collect` calls, even though their futures are satisfied in order,
they may finish out-of-order. So we need some libprocess changes to have
the ability to collect futures in the same actor. Here I have two proposals:

1. Add a new `collect` interface that takes an actor as a parameter.

2. Introduce `process::Executor::collect()` for this.

Any opinion on these two options?


Re: API working group

2018-02-13 Thread Chun-Hung Hsiao
I'm in. Especially, I'd like to continue the work of adapting gRPC into
libprocess,
so we could have a gRPC-based API!


Re: [VOTE] Release Apache Mesos 1.5.0 (rc2)

2018-02-05 Thread Chun-Hung Hsiao
+1 (non-binding)

Tested with `make distcheck` with grpc disabled and enabled on mac.
Tested with `make distcheck DISTCHECK_CONFIGURE_FLAGS='--enable-grpc'` on
centos 7.

On Mon, Feb 5, 2018 at 8:33 PM, Vinod Kone  wrote:

> +1 (binding)
>
> Tested on ASF CI. The red builds were known flaky tests regarding
> checks/health checks.
>
> *Revision*: f7e3872b0359c6095f8eeaefe408cb7dcef5bb83
>
>- refs/tags/1.5.0-rc2
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Failed]
>  ease/47/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> %20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=
> 1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> --verbose autotools
> [image: Failed]
>  ease/47/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> ,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
>  ease/47/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/47/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> %20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=
> 1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_
> v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> --verbose autotools
> [image: Success]
>  ease/47/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/47/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=-
> -verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose
> ,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
>  ease/47/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A1
> 4.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>
> On Sat, Feb 3, 2018 at 11:11 AM, Zhitao Li  wrote:
>
> > +1 (non-binding)
> >
> > Tested with running all tests on Debian/jessie server on AWS.
> >
> > On Fri, Feb 2, 2018 at 3:25 PM, Jie Yu  wrote:
> >
> >> +1
> >>
> >> Verified in our internal CI that `sudo make check` passed in CentOS 6,
> >> CentOS7, Debian 8, Ubuntu 14.04, Ubuntu 16.04 (both w/ or w/o SSL
> >> enabled).
> >>
> >>
> >> On Thu, Feb 1, 2018 at 5:36 PM, Gilbert Song 
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Please vote on releasing the following candidate as Apache Mesos
> 1.5.0.
> >> >
> >> > 1.5.0 includes the following:
> >> > 
> >> > 

Re: [VOTE] Release Apache Mesos 1.5.0 (rc1)

2018-01-23 Thread Chun-Hung Hsiao
-1 for https://issues.apache.org/jira/browse/MESOS-8481

On Tue, Jan 23, 2018 at 9:38 AM, Jie Yu  wrote:

> +1
>
> Verified in our internal CI that `sudo make check` passed in CentOS 6,
> CentOS7, Debian 8, Ubuntu 14.04, Ubuntu 16.04 (both w/ or w/o SSL enabled).
>
> - Jie
>
> On Mon, Jan 22, 2018 at 9:17 PM, Sam  wrote:
>
> > +1
> >
> >
> > Regards,
> >
> > [image: Watch the Video]
> >  O2QPg/>
> >
> >
> > On Jan 23, 2018, at 11:15 AM, Gilbert Song  wrote:
> >
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.5.0.
> >
> > 1.5.0 includes the following:
> > 
> > 
> >   * Support Container Storage Interface (CSI).
> >   * Agent reconfiguration policy.
> >   * Auto GC docker images in Mesos Containerizer.
> >   * Standalone containers.
> >   * Support gRPC client.
> >   * Non-leading VOTING replica catch-up.
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> > plain;f=CHANGELOG;hb=1.5.0-rc1
> > 
> > 
> >
> > The candidate for Mesos 1.5.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/
> mesos-1.5.0.tar.gz
> >
> > The tag to be voted on is 1.5.0-rc1:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.0-rc1
> >
> > The MD5 checksum of the tarball can be found at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/
> > mesos-1.5.0.tar.gz.md5
> >
> > The signature of the tarball can be found at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/
> > mesos-1.5.0.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1221
> >
> > Please vote on releasing this package as Apache Mesos 1.5.0!
> >
> > The vote is open until Thu Jan 25 18:24:36 PST 2018 and passes if a
> > majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.5.0
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Jie and Gilbert
> >
> >
>


Re: Adding process::Executor::execute()

2017-09-12 Thread Chun-Hung Hsiao
Thanks Ben. Just posted a patch: https://reviews.apache.org/r/62252/.

- Chun-Hung

On Mon, Sep 11, 2017 at 9:15 PM, Benjamin Hindman <b...@mesosphere.io>
wrote:

> Quick clarification: you'll have a single `process::Executor` and queue up
> all the rmdirs on that, correct? So you'll still tie up a worker thread,
> but only one of them.
>
> Either way it makes sense to add `process::Executor::execute()`. I'm
> happy to shepherd that for you Chun, send me a patch!
>
> On Mon, Sep 11, 2017 at 7:32 PM, Chun-Hung Hsiao <chhs...@mesosphere.io>
> wrote:
>
>> Hi,
>>
>> I'm thinking about extending `process::Executor` with a new `execute()`
>> interface.
>> The need of this new interface surfaced when I'm working on
>> https://issues.apache.org/jira/browse/MESOS-7964
>> Summary:
>> 1. A disk GC might execute multiple `rmdirs` callbacks, and some of them
>> are heavy duty. We don't want to run them on `GarbageCollectorProcess` so
>> that it won't block other events of the process.
>> Currently we do the following:
>> async(rmdirs).onAny(...);
>> 2. `async` puts each `rmdir` callback in an actor. When there are many
>> heavy-duty `rmdirs` callbacks, the actors end up occupying all worker
>> threads and blocking other actors for minutes.
>>
>> Yan suggested me to use `process::Executor` such that:
>> 1. The `rmdirs` callbacks are not executed on `GarbaceGollectorProcess`
>> 2. All `rmdirs` callbacks are executed on a single thread
>> Since the `Executor` class only contains a `defer()` function that
>> returns a `_Deferred` structure,
>> I'm doing the following:
>> executor.defer(rmdirs).operator std::function<Future(
>> )>()().onAny(...)
>>
>> Would it make sense to add another `execute()` function to directly
>> return a `Future`?
>>
>> - Chun-Hung
>>
>>
>
>
> --
> Benjamin Hindman
> Founder of Mesosphere and Co-Creator of Apache Mesos
> Mesosphere Inc.  <http://www.mesosphere.io/>
>
> Follow us on Twitter: @mesosphere <http://twitter.com/mesosphere>
>
> [image: All New DC/OS 1.10]
> <http://smart.mesosphere.io/v2/a/dcos1_10_ver3/59b75f7429a6455da34fd9e4-g0PWg/httpsmesosphere.comblogdcos-1_10-kubernetes>
>
>


Adding process::Executor::execute()

2017-09-11 Thread Chun-Hung Hsiao
Hi,

I'm thinking about extending `process::Executor` with a new `execute()`
interface.
The need of this new interface surfaced when I'm working on
https://issues.apache.org/jira/browse/MESOS-7964
Summary:
1. A disk GC might execute multiple `rmdirs` callbacks, and some of them
are heavy duty. We don't want to run them on `GarbageCollectorProcess` so
that it won't block other events of the process.
Currently we do the following:
async(rmdirs).onAny(...);
2. `async` puts each `rmdir` callback in an actor. When there are many
heavy-duty `rmdirs` callbacks, the actors end up occupying all worker
threads and blocking other actors for minutes.

Yan suggested me to use `process::Executor` such that:
1. The `rmdirs` callbacks are not executed on `GarbaceGollectorProcess`
2. All `rmdirs` callbacks are executed on a single thread
Since the `Executor` class only contains a `defer()` function that returns
a `_Deferred` structure,
I'm doing the following:
executor.defer(rmdirs).operator
std::function()().onAny(...)

Would it make sense to add another `execute()` function to directly return
a `Future`?

- Chun-Hung


Deprecating `--disable-zlib` in libprocess

2017-08-08 Thread Chun-Hung Hsiao
Hi all,

In libprocess, we have an optional `--disable-zlib` flag, but it's
currently not used
for conditional compilation and we always use zlib in libprocess,
and there's a requirement check in Mesos to make sure that zlib exists.
Should this option be removed then?
Or is there anyone working on a system without zlib?

Thanks for your opinions!
Chun-Hung