Re: Welcome Andrei Sekretenko as a new committer and PMC member!

2020-01-22 Thread Andrei Budnik
Congrats! Well deserved!

On Tue, Jan 21, 2020 at 10:42 PM Benjamin Mahler  wrote:

> Please join me in welcoming Andrei Sekretenko as the newest committer and
> PMC member!
>
> Andrei has been active in the project for almost a year at this point and
> has been a productive and collaborative member of the community.
>
> He has helped out a lot with allocator work, both with code and
> investigations of issues. He made improvements to multi-role framework
> scalability (which includes the addition of the UPDATE_FRAMEWORK call), and
> exposed metrics for per-role quota consumption.
>
> He has also investigated, identified, and followed up on important bugs.
> One such example is the message re-ordering issue he is currently working
> on: https://issues.apache.org/jira/browse/MESOS-10023
>
> Thanks for all your work so far Andrei, I'm looking forward to more of your
> contributions in the project.
>
> Ben
>


On supporting Docker `--live-restore` option for Mesos Docker executor

2019-10-02 Thread Andrei Budnik
Hi all,

Currently, Mesos Docker executor treats Docker task as TASK_FAILED on
Docker daemon restart. It causes problems for operators with cluster
maintenance. Starting with Docker 1.12, one can configure the daemon so
that containers remain running if the daemon becomes unavailable.

We're proposing an improvement for the Mesos Docker executor to address
this problem.

The current design doc is:
https://docs.google.com/document/d/1JeLTr9L31S8eIg-6xpjedIUKvnfNake0kPTzxEwdUdI/


On adding a debug endpoint for Mesos containerizer

2019-06-04 Thread Andrei Budnik
Hi folks,

We have been encountering container stuck issues for quite a long time.
Some of these issues are caused by external components such as CNI/CSI
plugins, custom Mesos modules, etc. Also, there were cases when a container
become stuck due to a Linux kernel bug. All these kinds of issues make it
difficult to debug container stuck issues.

We are proposing a container debug endpoint for the Mesos agent [1], which
is based on a new mechanism for tracking pending libprocess futures [2].

Please review both of them.

[1] Container debug endpoint:
https://docs.google.com/document/d/1VtlKD6b8a22HzSdaJUeI7cPGuKd01vLwBJT4XfkeUDI
[2] Tracking libprocess futures:
https://docs.google.com/document/d/1Unu2pe0dRq3Z6XQ5S8lWZm2cU2REjfkUj0xk2ePQ0MY


Re: [VOTE] Release Apache Mesos 1.8.0 (rc2)

2019-04-23 Thread Andrei Budnik
+1

sudo make -j16 distcheck
DISTCHECK_CONFIGURE_FLAGS='--disable-libtool-wrappers
--disable-parallel-test-execution --enable-seccomp-isolator
--enable-launcher-sealing'
on Fedora 25

I gave +1, but some of the recently added tests are failing:
[  FAILED  ] VolumeGidManagerTest.ROOT_UNPRIVILEGED_USER_SlaveReboot
[  FAILED  ] CniIsolatorTest.VETH_VerifyResourceStatistics
[  FAILED  ] DockerVolumeIsolatorTest.ROOT_EmptyCheckpointFileSlaveRecovery


On Thu, Apr 18, 2019 at 3:00 PM Benno Evers  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.8.0.
>
>
> 1.8.0 includes the following:
>
> 
>  * Greatly reduced allocator cycle time.
>  * Operation feedback for v1 schedulers.
>  * Per-framework minimum allocatable resources.
>  * New CLI subcommands `task attach` and `task exec`.
>  * New `linux/seccomp` isolator.
>  * Support for Docker v2 Schema2 manifest format.
>  * XFS quota for persistent volumes.
>  * **Experimental** Support for the new CSI v1 API.
>
> In addition, 1.8.0-rc2 includes the following changes:
>
> -
>  * Docker manifest v2s2 config with image GC.
>  * Expanded `highlights` section in the CHANGELOG.
>
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc2
>
> 
>
> The candidate for Mesos 1.8.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz
>
> The tag to be voted on is 1.8.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1252
>
> Please vote on releasing this package as Apache Mesos 1.8.0!
>
> The vote is open until Wednesday, April 24th and passes if a majority of
> at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.8.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Benno and Joseph
>


Re: Welcome Benno Evers as committer and PMC member!

2019-01-31 Thread Andrei Budnik
Congratulations!

On Thu, Jan 31, 2019 at 2:41 AM Benjamin Mahler  wrote:

> Welcome Benno! Thanks for all the great contributions
>
> On Wed, Jan 30, 2019 at 6:21 PM Alex R  wrote:
>
> > Folks,
> >
> > Please welcome Benno Evers as an Apache committer and PMC member of the
> > Apache Mesos!
> >
> > Benno has been active in the project for more than a year now and has
> made
> > significant contributions, including:
> >   * Agent reconfiguration, MESOS-1739
> >   * Memory profiling, MESOS-7944
> >   * "/state" performance improvements, MESOS-8345
> >
> > I have been working closely with Benno, paired up on, and shepherded some
> > of his work. Benno has very strong technical knowledge in several areas
> and
> > he is willing to share it with others and help his peers.
> >
> > Benno, thanks for all your contributions so far and looking forward to
> > continuing to work with you on the project!
> >
> > Alex.
> >
>


Re: [VOTE] Release Apache Mesos 1.5.2 (rc2)

2018-11-22 Thread Andrei Budnik
+1

On Thu, Nov 22, 2018 at 8:23 AM Jie Yu  wrote:

> +1
>
> > On Oct 31, 2018, at 4:26 PM, Gilbert Song  wrote:
> >
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.5.2.
> >
> > 1.5.2 includes the following:
> >
> 
> > *Announce major bug fixes here*
> >   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
> >   * [MESOS-8128] - Make os::pipe file descriptors O_CLOEXEC.
> >   * [MESOS-8418] - mesos-agent high cpu usage because of numerous
> /proc/mounts reads.
> >   * [MESOS-8545] -
> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> >   * [MESOS-8568] - Command checks should always call
> `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`.
> >   * [MESOS-8620] - Containers stuck in FETCHING possibly due to
> unresponsive server.
> >   * [MESOS-8830] - Agent gc on old slave sandboxes could empty
> persistent volume data.
> >   * [MESOS-8871] - Agent may fail to recover if the agent dies before
> image store cache checkpointed.
> >   * [MESOS-8904] - Master crash when removing quota.
> >   * [MESOS-8906] - `UriDiskProfileAdaptor` fails to update profile
> selectors.
> >   * [MESOS-8907] - Docker image fetcher fails with HTTP/2.
> >   * [MESOS-8917] - Agent leaking file descriptors into forked processes.
> >   * [MESOS-8921] - Autotools don't work with newer OpenJDK versions.
> >   * [MESOS-8935] - Quota limit "chopping" can lead to cpu-only and
> memory-only offers.
> >   * [MESOS-8936] - Implement a Random Sorter for offer allocations.
> >   * [MESOS-8942] - Master streaming API does not send (health) check
> updates for tasks.
> >   * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
> >   * [MESOS-8947] - Improve the container preparing logging in
> IOSwitchboard and volume/secret isolator.
> >   * [MESOS-8952] - process::await/collect n^2 performance issue.
> >   * [MESOS-8963] - Executor crash trying to print container ID.
> >   * [MESOS-8978] - Command executor calling setsid breaks the tty
> support.
> >   * [MESOS-8980] - mesos-slave can deadlock with docker pull.
> >   * [MESOS-8986] - `slave.available()` in the allocator is expensive and
> drags down allocation performance.
> >   * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
> >   * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
> >   * [MESOS-9049] - Agent GC could unmount a dangling persistent volume
> multiple times.
> >   * [MESOS-9116] - Launch nested container session fails due to
> incorrect detection of `mnt` namespace of command executor's task.
> >   * [MESOS-9125] - Port mapper CNI plugin might fail with "Resource
> temporarily unavailable".
> >   * [MESOS-9127] - Port mapper CNI plugin might deadlock iptables on the
> agent.
> >   * [MESOS-9131] - Health checks launching nested containers while a
> container is being destroyed lead to unkillable tasks.
> >   * [MESOS-9142] - CNI detach might fail due to missing network config
> file.
> >   * [MESOS-9144] - Master authentication handling leads to request
> amplification.
> >   * [MESOS-9145] - Master has a fragile burned-in 5s authentication
> timeout.
> >   * [MESOS-9146] - Agent has a fragile burn-in 5s authentication timeout.
> >   * [MESOS-9147] - Agent and scheduler driver authentication retry
> backoff time could overflow.
> >   * [MESOS-9151] - Container stuck at ISOLATING due to FD leak.
> >   * [MESOS-9170] - Zookeeper doesn't compile with newer gcc due to
> format error.
> >   * [MESOS-9196] - Removing rootfs mounts may fail with EBUSY.
> >   * [MESOS-9231] - `docker inspect` may return an unexpected result to
> Docker executor due to a race condition.
> >   * [MESOS-9267] - Mesos agent crashes when CNI network is not
> configured but used.
> >   * [MESOS-9279] - Docker Containerizer 'usage' call might be expensive
> if mount table is big.
> >   * [MESOS-9283] - Docker containerizer actor can get backlogged with
> large number of containers.
> >   * [MESOS-9305] - Create cgoup recursively to workaround systemd
> deleting cgroups_root.
> >   * [MESOS-9308] - URI disk profile adaptor could deadlock.
> >   * [MESOS-9334] - Container stuck at ISOLATING state due to libevent
> poll never returns.
> >
> > The CHANGELOG for the release is available at:
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.2-rc2
> >
> 
> >
> > The candidate for Mesos 1.5.2 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz
> >
> > The tag to be voted on is 1.5.2-rc2:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.2-rc2
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.5.2-rc2/mesos-1.5.2.tar.gz.sha512
> >
> > The signature of the ta

Supporting Seccomp in Mesos

2018-07-11 Thread Andrei Budnik
Hi Folks,

Here is the design doc for Seccomp support in Mesos:
https://docs.google.com/document/d/146FJJ0sDi1sp_HQxVUg-vhqVSTEsdCeD4If3b1xCeec

Seccomp is a security facility in the Linux kernel, which allows a user to
specify syscall filtering rules per a process. This design doc includes
various aspects of the implementation of Seccomp in Mesos, including choice
of the configuration format for Seccomp profile.

Thanks for your time reviewing and providing feedback for the design!

Cheers,
Andrei


Update the *Minimum Linux Kernel version* supported on Mesos

2018-04-05 Thread Andrei Budnik
Hi All,

We would like to update minimum supported Linux kernel from 2.6.23 to
2.6.28.
Linux kernel supports cgroups v1 starting from 2.6.24, but `freezer` cgroup
functionality was merged into 2.6.28, which supports nested containers.

If anyone uses older Linux kernel version, please let me know!

Best,
Andrei


Re: Adding a `FLAKY` label to flaky unit tests

2018-03-29 Thread Andrei Budnik
I have a couple of questions:
1) What would be the criteria for removing `FLAKY` label from a test? Who
will take care of removing this label?
2) Do we expect that most of our tests will eventually get `FLAKY` label?

On Thu, Mar 29, 2018 at 7:35 PM, Meng Zhu  wrote:

> +1, the advantages are appealing.
>
> Though I am afraid that this will probably reduce the incentive to fix
> flaky tests.
>
> -Meng
>
> On Thu, Mar 29, 2018 at 9:45 AM, Benno Evers 
> wrote:
>
> > Hi all,
> >
> > if you're regularly running Mesos unit tests, e.g. because you've set up
> a
> > CI system, you probably noticed that there is a lot of noise in the
> results
> > due to flaky tests.
> >
> > As a measure to ease the pain, what do you think about adding a `FLAKY`
> > label to known flaky unit tests, similar to how we have `ROOT`,
> `INTERNET`,
> > `DISABLED`, etc. right now?
> >
> > The advantages, in my opinion, would be:
> >  - Looking at test results, it would be immediately visible whether a
> test
> > failure was known flaky or not without going to JIRA
> >  - People who want to reduce noise can disable all known flaky tests by a
> > simple gtest filter
> >  - People who want to can still run the flaky tests easier than if they
> get
> > disabled outright
> >  - With a little bit of scripting, it would be possible to add logic like
> > "for flaky tests, run them 10 times and only report a failure if more
> than
> > x% of the runs fail."
> >
> > What do you think?
> >
> > Best regards,
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>


Re: On fixing the FUTURE_DISPATCH macro

2017-06-07 Thread Andrei Budnik
Hey Michael,
The example of flaky test is provided in [1].
Andrei
[1] https://gist.github.com/abudnik/242026538f9e4d6861e0b51408618161

On Sun, Jun 4, 2017 at 12:22 AM, Michael Park  wrote:

> This sounds good to me in principle. Could you explain why it leads to
> *flaky* tests? I only vaguely remember the issues here...
>
> On Fri, Jun 2, 2017 at 7:05 AM Andrei Budnik 
> wrote:
>
> > MESOS-5886
> >
> >
> >
> > Problem description:
> >
> > Using FUTURE_DISPATCH might lead to flakiness or errors in tests.
> > FUTURE_DISPATCH
> > <
> > https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6a
> bee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50
> > >
> > uses DispatchMatcher
> > <
> > https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6a
> bee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350
> > >
> > to figure out whether a processed DispatchEvent is the same the user is
> > waiting for. Currently, we compare std::type_info of function pointers,
> > which is not enough: different class methods with same signatures will be
> > matched (see MESOS-5886 for an example).
> >
> >
> >
> > A little bit of history on the issue.
> >
> >
> >
> > Initial implementation of DispatchMatcher used stringified version of
> > pointer-to-member function — it’s just the same thing as comparing by
> value
> > two pointer-to-member functions, which, in essence, means comparison of
> the
> > virtual offsets in vtable or comparison of function addresses. This
> > approach has an issue: if two independent classes C1 and C2 have virtual
> > functions with the same vtable offsets, then DispatchMatcher might match
> > them as same functions under specific conditions (see
> > https://reviews.apache.org/r/28052/).
> >
> >
> >
> > To address the aforementioned problem (MESOS-2112), it has been decided
> to
> > use type_info instead of function pointers for function matching.
> type_info
> > for class methods includes information about function signature, related
> > class name and class namespace. However, type_info is not enough to
> > uniquely identify two different methods with same signature. AlexR
> > described a simple test that reproduces the bug in MESOS-5886.
> >
> >
> >
> > Michael Park proposed a solution in
> > https://reviews.apache.org/r/28052/#comment106033:
> > <https://reviews.apache.org/r/28052/#comment106033>keeping both
> type_info
> > and value of pointer-to-member function in DispatchEvent allows us to
> > uniquely identify class methods.
> >
> >
> >
> > We plan to follow MPark’s suggestion and additionally store
> > pointer-to-member function in DispatchEvent. This will increase the
> memory
> > footprint of actors’ mailboxes, which is an acceptable consequence in our
> > opinion.
> >
> >
> >
> > Looking forward to comments and suggestions on the proposed change,
> >
> > Andrei
> >
>


Re: Added task status update reason for health checks

2017-06-07 Thread Andrei Budnik
Hey James,
We have a ticket MESOS-5078 in jira, so it's in our plans to document all 
TaskStatus reasons.
Andrei

On 2017-05-22 18:31 (+0200), James Peach  wrote: 
> 
> > On May 22, 2017, at 5:28 AM, Andrei Budnik  wrote:> 
> > > 
> > Hi All,> 
> > > 
> > The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.> 
> > The corresponding ticket is 
> > https://issues.apache.org/jira/browse/MESOS-6905> 
> 
> Is there any documentation about how executors ought to use this reason? Even 
> a comment in the proto files would help executor authors use this 
> consistently.> 
> 
> J> 

On fixing the FUTURE_DISPATCH macro

2017-06-02 Thread Andrei Budnik
MESOS-5886



Problem description:

Using FUTURE_DISPATCH might lead to flakiness or errors in tests.
FUTURE_DISPATCH

uses DispatchMatcher

to figure out whether a processed DispatchEvent is the same the user is
waiting for. Currently, we compare std::type_info of function pointers,
which is not enough: different class methods with same signatures will be
matched (see MESOS-5886 for an example).



A little bit of history on the issue.



Initial implementation of DispatchMatcher used stringified version of
pointer-to-member function — it’s just the same thing as comparing by value
two pointer-to-member functions, which, in essence, means comparison of the
virtual offsets in vtable or comparison of function addresses. This
approach has an issue: if two independent classes C1 and C2 have virtual
functions with the same vtable offsets, then DispatchMatcher might match
them as same functions under specific conditions (see
https://reviews.apache.org/r/28052/).



To address the aforementioned problem (MESOS-2112), it has been decided to
use type_info instead of function pointers for function matching. type_info
for class methods includes information about function signature, related
class name and class namespace. However, type_info is not enough to
uniquely identify two different methods with same signature. AlexR
described a simple test that reproduces the bug in MESOS-5886.



Michael Park proposed a solution in
https://reviews.apache.org/r/28052/#comment106033:
keeping both type_info
and value of pointer-to-member function in DispatchEvent allows us to
uniquely identify class methods.



We plan to follow MPark’s suggestion and additionally store
pointer-to-member function in DispatchEvent. This will increase the memory
footprint of actors’ mailboxes, which is an acceptable consequence in our
opinion.



Looking forward to comments and suggestions on the proposed change,

Andrei


Added task status update reason for health checks

2017-05-22 Thread Andrei Budnik
Hi All,

The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.
The corresponding ticket is https://issues.apache.org/jira/browse/MESOS-6905


Best,
Andrei Budnik