[jira] [Created] (MESOS-9776) Mention removal of *.json endpoints in 1.8.0 CHANGELOG

2019-05-08 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9776:
--

 Summary: Mention removal of *.json endpoints in 1.8.0 CHANGELOG
 Key: MESOS-9776
 URL: https://issues.apache.org/jira/browse/MESOS-9776
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


We should mention in the CHANGELOG and update notes that the *.json that were 
deprecated in Mesos 0.25 were actually removed in Mesos 1.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`

2019-05-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9761:
--

 Summary: Mesos UI does not properly account for resources set via 
`--default-role`
 Key: MESOS-9761
 URL: https://issues.apache.org/jira/browse/MESOS-9761
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers
 Attachments: default_role_ui.png

In our cluster, we have two agents configured with  
"--default_role=slave_public" and 64 cpus each, for a total of 128 cpus 
allocated to this role. The right side of the screenshot shows one of them.

However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor 
"Limit" does show any resources for this role.

See attached screenshot for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9730) Executors cannot reconnect with agents using TLS1.3

2019-04-29 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829733#comment-16829733
 ] 

Benno Evers commented on MESOS-9730:


{noformat}
commit 4fa4f77549b43285cac974111a5a3f28828a19d8
Author: Stéphane Cottin 
Date:   Mon Apr 29 13:28:06 2019 +0200

Documented LIBPROCESS_SSL_ENABLE_TLS_V1_3.

Updated documentation about `LIBPROCESS_SSL_ENABLE_TLS_V1_3` and TLS1.3.

Review: https://reviews.apache.org/r/70563/

commit 712ee298800e257050d01b69abeaf3c4bc7d12ee
Author: Stéphane Cottin 
Date:   Mon Apr 29 13:27:04 2019 +0200

Added LIBPROCESS_SSL_ENABLE_TLS_V1_3 environment variable.

When building mesos with libopenssl >= 1.1.1, TLS1.3 is enabled by
default. This causes major communication issues between executors
and agents.

This patch adds a new `LIBPROCESS_SSL_ENABLE_TLS_V1_3` env var,
disabled by default. It should be changed to enabled by default when
full openssl >= 1.1 support will land.

Review: https://reviews.apache.org/r/70562/
{noformat}

Also backported the patches to 1.8.x branch.

> Executors cannot reconnect with agents using TLS1.3
> ---
>
> Key: MESOS-9730
> URL: https://issues.apache.org/jira/browse/MESOS-9730
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.8.0
>Reporter: Stéphane Cottin
>Assignee: Stéphane Cottin
>Priority: Major
>  Labels: integration, ssl
>
> TLS 1.3 support is enabled by default from openssl >= 1.1.0
> Executors do not reconnect with agents after restart when using TLS 1.3, and 
> I guess this should also affect master/slave communication.
> suggested action :
> add a `LIBPROCESS_SSL_ENABLE_TLS_V1_3` environment variable with a `false` 
> default, and apply `SSL_OP_NO_TLSv1_3` ssl option when building with openssl 
> >= 1.1.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-3394) Pull in glog 0.3.6 (when it's released)

2019-04-28 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-3394:
---
Comment: was deleted

(was: www.rtat.net)

> Pull in glog 0.3.6 (when it's released)
> ---
>
> Key: MESOS-3394
> URL: https://issues.apache.org/jira/browse/MESOS-3394
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Andrew Schwartzmeyer
>Priority: Major
>  Labels: arm64, build, cmake, freebsd, mesosphere, windows
>
> To build on Windows, we have to build glog on Windows. But, glog doesn't 
> build on Windows, so we had to submit a patch to the project. So, to build on 
> Windows, we download the patched version directly from the pull request that 
> was sent to the glog repository on GitHub.
> When these patches move upstream, we need to change this to point at the 
> "real" glog release instead of the pull request.
> (For details see the `CMakeLists.txt` in `3rdparty/libprocess/3rdparty`.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9745) Re-enable validation of protobuf unions in `ContainerInfo`

2019-04-26 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9745:
--

 Summary: Re-enable validation of protobuf unions in `ContainerInfo`
 Key: MESOS-9745
 URL: https://issues.apache.org/jira/browse/MESOS-9745
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


In MESOS-9740, we disabled protobuf union validation for `ContainerInfo` 
messages, since it was discovered that frameworks generating invalid protobuf 
of this kind currently exist in the wild.

However, that is somewhat unsatisfactory since it re-enables the issue 
originally described in MESOS-6874, i.e. Mesos not rejecting tasks where the 
`ContainerInfo` was accidentally malformed.

Ideally, we should implement a metric counting the number of tasks with 
malformed `ContainerInfo`s and re-enable validation after an approprate warning 
period has passed.
Ideally, we should implement 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-25 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826149#comment-16826149
 ] 

Benno Evers commented on MESOS-9740:


Preliminary review: https://reviews.apache.org/r/70538/

> Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents 
> from reregistering with 1.8+ masters
> ---
>
> Key: MESOS-9740
> URL: https://issues.apache.org/jira/browse/MESOS-9740
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Joseph Wu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: foundations, mesosphere
>
> As part of MESOS-6874, the master now validates protobuf unions passed as 
> part of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from 
> specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the 
> {{docker}} field (which is then ignored by the agent).
> However, if a task was already launched with an invalid protobuf union, the 
> same validation will happen when the agent tries to reregister with the 
> master.  In this case, if the master is upgraded to validate protobuf unions, 
> the agent reregistration will be rejected.
> {code}
> master.cpp:7201] Dropping re-registration of agent at 
> slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
> Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
> field `docker` set.
> {code}
> This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
> MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
> However, on the test cluster, 13/17 agents had at least one invalid 
> ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9736) Error building libgrpc++ on Mac from a source tarball

2019-04-23 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9736:
--

 Summary: Error building libgrpc++ on Mac from a source tarball
 Key: MESOS-9736
 URL: https://issues.apache.org/jira/browse/MESOS-9736
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The following error was reported by [~tillt] trying to build the `1.8.0-rc2` 
release candidate on a MacOS machine:

{noformat}
make[2]: *** No rule to make target 
`../3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a', needed by `libmesos.la'.  Stop.
{noformat}

Looking into the issue, the following was theory was offered for the cause of 
the problem:
{quote}
I have the hunch that this isnt an macOS thing but instead a problem in our 
build setup which does (not intentionally) try to do certain things in parallel.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9732) Python installation using `make install` fails inside a symlinked directory

2019-04-16 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9732:
--

 Summary: Python installation using `make install` fails inside a 
symlinked directory
 Key: MESOS-9732
 URL: https://issues.apache.org/jira/browse/MESOS-9732
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


I used to have a symlink pointing from `~/mesos` to `~/src/mesos`.

Then I attempted to `make install` from inside the `~/mesos/worktrees/release` 
directory on a build with python bindings enabled.

Now I don't have a symlink anymore.

{noformat}
bevers@poincare:~$ ls ~/src/mesos
3rdpartycompile install-sh   mpi
aclocal.m4  config.guessLICENSE  NOTICE
ar-lib  config.sub  ltmain.shREADME.md
autom4te.cache  configure   m4   site
bin configure.acMakefile.am  src
bootstrap   depcomp Makefile.in  support
bootstrap.bat   docsmesos.pc.in  worktrees
CHANGELOG   Doxyfilemesos.sublime-project
cmake   etc_issue_orig  mesos.sublime-workspace
CMakeLists.txt  include missing
bevers@poincare:~$ ls ~/mesos
worktrees
bevers@poincare:~$ ls ~/mesos/worktrees/release/build/src/python/dist
mesos-1.8.0-py2.7.egg
mesos-1.8.0-py2-none-any.whl
mesos.cli-1.8.0-py2.7.egg
mesos.cli-1.8.0-py2-none-any.whl
mesos.executor-1.8.0-cp27-none-linux_x86_64.whl
mesos.executor-1.8.0-py2.7-linux-x86_64.egg
mesos.interface-1.8.0-py2.7.egg
mesos.interface-1.8.0-py2-none-any.whl
mesos.native-1.8.0-py2.7.egg
mesos.native-1.8.0-py2-none-any.whl
mesos.scheduler-1.8.0-cp27-none-linux_x86_64.whl
mesos.scheduler-1.8.0-py2.7-linux-x86_64.egg
{noformat}

The installation itself also fails with a predictable error:
{noformat}
OSError: [Errno 2] No such file or directory: 
'/home/bevers/mesos/worktrees/release/build/../src/python/executor/src/mesos/executor'
{noformat}

Leaving the system in a funny state as a side effect:
{noformat}
bevers@poincare:~/mesos/worktrees/release/build$ ls .
3rdparty  bin  config.log  config.lt  config.status  description-pak  include  
libtool  Makefile  mesos.pc  mpi  src
bevers@poincare:~/mesos/worktrees/release/build$ ls `pwd`
src
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-04-11 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814545#comment-16814545
 ] 

Benno Evers edited comment on MESOS-9697 at 4/11/19 9:12 AM:
-

After some investigation, here's my current understanding of the situation:

* The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( 
https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a 
branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, 
i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm 
packages for centos 6 and 7 as artifacts (using the script 
`support/packaging/centos/build-rpm-docker.sh`), but does not do anything with 
them, i.e. there is no connection to bintray. I don't know if there is any 
public download for the generated artifacts.

* The is another job `Mesos/Packaging/CentosRPMs` 
(https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in 
the ASF Jenkins that is not run automatically. For its setup, its using the 
file `support/packaging/Jenkinsfile` from branch `bintray` on 
`http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` 
and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I 
still don't understand where exactly it's taking the source code from) and 
afterwards upload them to bintray using credentials 
"karya_bintray_credentials". It was last run by [~karya] on  Feb 8, 2018 to 
produce Mesos 1.5.0 packages.

So it looks like this might not actually be broken, but rather just release 
managers not being aware that they are supposed to manually run this Jenkins 
job. I'd like to test that theory by triggering a 1.7.0 build of the latter 
job, but I don't seem to have permissions to do that on the ASF Jenkins.


was (Author: bennoe):
After some investigation, here's my current understanding of the situation:

* The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( 
https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a 
branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, 
i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm 
packages for centos 6 and 7 as artifacts (using the script 
`support/packaging/centos/build-rpm-docker.sh`), but does not do anything with 
them, i.e. there is no connection to bintray. I don't know if there is any 
public download for the generated artifacts.

* The is another job `Mesos/Packaging/CentosRPMs` 
(https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in 
the ASF Jenkins that is not run manually. For its setup, its using the file 
`support/packaging/Jenkinsfile` from branch `bintray` on 
`http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` 
and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I 
still don't understand where exactly it's taking the source code from) and 
afterwards upload them to bintray using credentials 
"karya_bintray_credentials". It was last run by [~karya] on  Feb 8, 2018 to 
produce Mesos 1.5.0 packages.

So it looks like this might not actually be broken, but rather just release 
managers not being aware that they are supposed to manually run this Jenkins 
job. I'd like to test that theory by triggering a 1.7.0 build of the latter 
job, but I don't seem to have permissions to do that on the ASF Jenkins.

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-04-10 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814545#comment-16814545
 ] 

Benno Evers commented on MESOS-9697:


After some investigation, here's my current understanding of the situation:

* The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( 
https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a 
branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, 
i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm 
packages for centos 6 and 7 as artifacts (using the script 
`support/packaging/centos/build-rpm-docker.sh`), but does not do anything with 
them, i.e. there is no connection to bintray. I don't know if there is any 
public download for the generated artifacts.

* The is another job `Mesos/Packaging/CentosRPMs` 
(https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in 
the ASF Jenkins that is not run manually. For its setup, its using the file 
`support/packaging/Jenkinsfile` from branch `bintray` on 
`http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` 
and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I 
still don't understand where exactly it's taking the source code from) and 
afterwards upload them to bintray using credentials 
"karya_bintray_credentials". It was last run by [~karya] on  Feb 8, 2018 to 
produce Mesos 1.5.0 packages.

So it looks like this might not actually be broken, but rather just release 
managers not being aware that they are supposed to manually run this Jenkins 
job. I'd like to test that theory by triggering a 1.7.0 build of the latter 
job, but I don't seem to have permissions to do that on the ASF Jenkins.

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-04-08 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812435#comment-16812435
 ] 

Benno Evers commented on MESOS-9697:


Changing priority to "Critical", since this does not have an associated target 
version. (and is thus, technically, not blocking any release)

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Priority: Blocker
>  Labels: integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9565) Unit tests for destroying persistent volumes in SLRP.

2019-04-08 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812423#comment-16812423
 ] 

Benno Evers commented on MESOS-9565:


Status summary. The first 6 reviews of the chain posted above have been 
submitted, the remaining two are still pending due to the following review 
comment by [~bbannier].

{quote}
These tests seem to have issues when executed under load. When putting extra 
stress on the system with stress-ng I was able to get e.g., 
CreateDestroyPersistentVolume to break after only 4 iterations
{quote}

> Unit tests for destroying persistent volumes in SLRP.
> -
>
> Key: MESOS-9565
> URL: https://issues.apache.org/jira/browse/MESOS-9565
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> The plan is to add/update the following unit tests to test persistent volume 
> destroy:
> * CreateDestroyDisk
> * CreateDestroyDiskWithRecovery
> * CreateDestroyPersistentMountVolume
> * CreateDestroyPersistentMountVolumeWithRecovery
> * CreateDestroyPersistentMountVolumeWithReboot
> * CreateDestroyPersistentBlockVolume
> * DestroyPersistentMountVolumeFailed
> * DestroyUnpublishedPersistentVolume
> * DestroyUnpublishedPersistentVolumeWithRecovery
> * DestroyUnpublishedPersistentVolumeWithReboot
> * RecoverPublishedPersistentVolumeFailed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.

2019-04-08 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812416#comment-16812416
 ] 

Benno Evers edited comment on MESOS-9624 at 4/8/19 1:15 PM:


Closing this since all related patches seem to have been landed.

{noformat}
commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 21:26:04 2019 -0700

Bundled CSI spec 1.1.0.

Since the CSI v1 spec proto file depends on certain proto files in the
Protobuf library, we have to ensure the Protobuf library's include path
is in the proto paths of the `protoc` command when compiling the CSI
spec proto file. Specifically in Autotools, this path is passed through
the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled
protobuf library.

Review: https://reviews.apache.org/r/70360
{noformat}
{noformat}
commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 22:14:32 2019 -0700

Added spec inclusion header and type helpers for CSI v1.

Review: https://reviews.apache.org/r/70361
{noformat}



was (Author: bennoe):
Closing this since all related patches seem to have been landed.

{noformat}
commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 21:26:04 2019 -0700

Bundled CSI spec 1.1.0.

Since the CSI v1 spec proto file depends on certain proto files in the
Protobuf library, we have to ensure the Protobuf library's include path
is in the proto paths of the `protoc` command when compiling the CSI
spec proto file. Specifically in Autotools, this path is passed through
the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled
protobuf library.

Review: https://reviews.apache.org/r/70360
{noformat}
commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 22:14:32 2019 -0700

Added spec inclusion header and type helpers for CSI v1.

Review: https://reviews.apache.org/r/70361
{noformat}


> Bundle CSI spec v1.0 in Mesos.
> --
>
> Key: MESOS-9624
> URL: https://issues.apache.org/jira/browse/MESOS-9624
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
> Fix For: 1.8.0
>
>
> We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of 
> the source code filesystem layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.

2019-04-08 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812416#comment-16812416
 ] 

Benno Evers commented on MESOS-9624:


Closing this since all related patches seem to have been landed.

{noformat}
commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 21:26:04 2019 -0700

Bundled CSI spec 1.1.0.

Since the CSI v1 spec proto file depends on certain proto files in the
Protobuf library, we have to ensure the Protobuf library's include path
is in the proto paths of the `protoc` command when compiling the CSI
spec proto file. Specifically in Autotools, this path is passed through
the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled
protobuf library.

Review: https://reviews.apache.org/r/70360
{noformat}
commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c
Author: Chun-Hung Hsiao chhs...@mesosphere.io
Date:   Thu Mar 28 22:14:32 2019 -0700

Added spec inclusion header and type helpers for CSI v1.

Review: https://reviews.apache.org/r/70361
{noformat}


> Bundle CSI spec v1.0 in Mesos.
> --
>
> Key: MESOS-9624
> URL: https://issues.apache.org/jira/browse/MESOS-9624
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of 
> the source code filesystem layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path

2019-04-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810819#comment-16810819
 ] 

Benno Evers commented on MESOS-8257:


I removed the 1.8.0 target designation here and in the linked ticket since it 
looks like there hasn't been any recent activity here, please feel free to 
revert as you see fit.

> Unified Containerizer "leaks" a target container mount path to the host FS 
> when the target resolves to an absolute path
> ---
>
> Key: MESOS-8257
> URL: https://issues.apache.org/jira/browse/MESOS-8257
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Critical
>  Labels: bug, containerization, containerizer, mountpath
>
> If a target path under the root FS provisioned from an image resolves to an 
> absolute path, it will not appear in the container root FS after 
> {{pivot_root(2)}} is called.
> A typical example is that when the target path is under {{/var/run}} (e.g. 
> {{/var/run/some-dir}}), which is usually a symlink to an absolute path of 
> {{/run}} in Debian images, the target path will get resolved as and created 
> at {{/run/some-dir}} in the host root FS, after the container root FS gets 
> provisioned. The target path will get unmounted after {{pivot_root(2)}} as it 
> is part of the old root (host FS).
> A workaround is to use {{/run}} instead of {{/var/run}}, but absolute 
> symlinks need to be resolved within the scope of the container root FS path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9677) RPM packages should be built with launcher sealing

2019-04-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810759#comment-16810759
 ] 

Benno Evers commented on MESOS-9677:


On the `memfd_create()` manpage I can read:
{quote}
The memfd_create() system call first appeared in Linux 3.17
{quote}

According to Wikipedia, CentOS 7 uses kernels from the 3.10 series:
https://en.wikipedia.org/wiki/CentOS#Latest_version_information

So I'm not sure if it will really be safe to enable this per default on CentOS 
7. [~gilbert], can you clarify this?

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: integration, mesosphere, packaging, rpm, storage
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9313) Document speculative offer operation semantics for framework writers.

2019-04-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809972#comment-16809972
 ] 

Benno Evers commented on MESOS-9313:


I'm not so sure that framework authors can just treat this as an opaque 
implementation detail, because I'd assume the `reason` field would be different 
between a task launching on reserved resources that were not reserved on the 
agent, and a task failing for other reasons.

Additionally, I think it's just better user experience to get people to 
understand *why* certain state transitions can happen, as opposed to just 
saying nothing is ever certain so deal with it.

That said, it doesn't look like anyone is currently working on this so I'm 
removing the 1.8 target version designation from this task.

> Document speculative offer operation semantics for framework writers.
> -
>
> Key: MESOS-9313
> URL: https://issues.apache.org/jira/browse/MESOS-9313
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere, operation-feedback, operations
>
> It recently came to my attention that a subset of offer operations (e.g. 
> RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos 
> master. Meaning that the master will apply the resource conversion internally 
> **before** the conversion is checkpointed on the agent. The master may then 
> re-offer the converted resource to a framework -- even though the agent may 
> still not have checkpointed the resource conversion. If the checkpointing 
> process on the agent fails, then subsequent operations issued for the 
> falsely-offered resource will fail. Because the master essentially "lied" to 
> the framework about the true state of the supposedly-converted resource.
> It's also been explained to me that this case is expected to be rare. 
> However, it *can* impact the design/implementation of framework state 
> machines and so it's critical that this information be documented clearly - 
> outside of the C++ code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9675) Docker Manifest V2 Schema2 Support.

2019-04-04 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9675:
--

Assignee: Gilbert Song

> Docker Manifest V2 Schema2 Support.
> ---
>
> Key: MESOS-9675
> URL: https://issues.apache.org/jira/browse/MESOS-9675
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.

2019-04-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809815#comment-16809815
 ] 

Benno Evers commented on MESOS-8068:


Removed the 1.8.0 target version since it's not going to completed for that 
version, feel free to revert as you see fit.

> Non-revocable bursting over quota guarantees via limits.
> 
>
> Key: MESOS-8068
> URL: https://issues.apache.org/jira/browse/MESOS-8068
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy, resource-management
>
> Prior to introducing a revocable tier of allocation (see MESOS-4441), there 
> is a notion of whether a role can burst over its quota guarantee.
> We currently apply implicit limits in the following way:
> No quota guarantee set: (guarantee 0, no limit)
> Quota guarantee set: (guarantee G, limit G)
> That is, we only allow support burst-only without guarantee and 
> guarantee-only without burst. We do not support bursting over some non-zero 
> guarantee: (guarantee G, limit L >= G).
> The idea here is that we should make these implicit limits explicit to 
> clarify for users the distinction between guarantees and limits, and to 
> support bursting over the guarantee.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7428) Report exit code of tasks from default and command executors

2019-04-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809793#comment-16809793
 ] 

Benno Evers commented on MESOS-7428:


I'm removing the 1.8.0 target version since this hasn't been updated for a 
while. Please feel free to revert as you see fit.

> Report exit code of tasks from default and command executors
> 
>
> Key: MESOS-7428
> URL: https://issues.apache.org/jira/browse/MESOS-7428
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Eric Chung
>Priority: Major
>
> Use case: some tasks should only be retried if the exit code matches certain 
> user requirement.
> Based on [~gilbert], we already checkpoint the exit code in containerizer 
> now, and we need to clarify how to report exit code for executor containers 
> v.s. nested containers, and we should do this consistently for command and 
> default executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7776) Document `MESOS_CONTAINER_IP`

2019-04-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809776#comment-16809776
 ] 

Benno Evers commented on MESOS-7776:


I'm removing the target version designation for now, since it looks like this 
is currently not being worked on. Please revert as you see fit.

> Document `MESOS_CONTAINER_IP` 
> --
>
> Key: MESOS-7776
> URL: https://issues.apache.org/jira/browse/MESOS-7776
> Project: Mesos
>  Issue Type: Documentation
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>Priority: Major
>
> We introduced `MESOS_CONTAINER_IP` to inform tasks launched by the 
> default-executor to inform the tasks about their container IP. This was done 
> primarily to break the dependency of the containers on `LIBPROCESS_IP` to 
> learn their IP addresses which was misleading. 
> This change need to be documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2019-04-03 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808914#comment-16808914
 ] 

Benno Evers commented on MESOS-7974:


Re-targeted to 1.9.0.

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9082) Avoid two trips through the master mailbox for state.json requests.

2019-04-03 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9082:
--

Assignee: (was: Benno Evers)

> Avoid two trips through the master mailbox for state.json requests.
> ---
>
> Key: MESOS-9082
> URL: https://issues.apache.org/jira/browse/MESOS-9082
> Project: Mesos
>  Issue Type: Task
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: foundations, mesosphere, performance
>
> Currently, a state.json request travels through the master's mailbox twice: 
> before authorization and after. This increases the overall state.json 
> response time by around 30%.
> To remove one mailbox trip, we can perform the initial portion (validation 
> and authorization) of state and /state off the master actor by using a 
> top-level {{Route}}, then dispatch onto the master actor only for json / 
> protobuf serialization. This should drop the authorization time down to near 
> 0 if it's indeed mostly queuing delay.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8148) Enforce text attribute value specification for zone and region values

2019-04-03 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-8148:
--

Assignee: (was: Benno Evers)

> Enforce text attribute value specification for zone and region values
> -
>
> Key: MESOS-8148
> URL: https://issues.apache.org/jira/browse/MESOS-8148
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Tim Harper
>Priority: Major
>
> Mesos has a specification for characters allowed by attribute values:
> http://mesos.apache.org/documentation/latest/attributes-resources/
> The specification is as follows:
> {code}
> scalar : floatValue
> floatValue : ( intValue ( "." intValue )? ) | ...
> intValue : [0-9]+
> range : "[" rangeValue ( "," rangeValue )* "]"
> rangeValue : scalar "-" scalar
> set : "{" text ( "," text )* "}"
> text : [a-zA-Z0-9_/.-]
> {code}
> Marathon is [implementing IN and IS 
> constraints|https://docs.google.com/document/d/e/2PACX-1vSFvPol0pcHC2Web7EaNU0oSDS5wrOWSgFcmuslYBtISV2NB2JZ_D-B4wpWy_Vutaf08m2LX6WZVy6s/pub],
>  and includes plans to support further attribute types as it makes sense to 
> do so (IE {{{a,b} IS {b,a}}}, {{5 IN [0-10]}}). In order 
> to do this, Marathon has adopted the Mesos attribute value specification and 
> will enforce it in the validation layer. As an example, it will be possible 
> to write things like:
> {code:java}
> "constraints": [
>   ["attribute", "IN", "{value-a,value-b,value-c}"]
> ]
> {code}
> Additionally, Marathon allows one to specify constraints on non-attribute 
> properties, such as region, hostname, or zone. If somebody specified a zone 
> value with a comma, then the user would not be able to use the Mesos set 
> value type specification to describe a set of zones in which an app should be 
> deployed, and, as a consequence, would result in additional complexity (IE: 
> Marathon would need to implement an escaping mechanism for this case).
> Ideally, the character space is confined to begin with. It the text type 
> specification is sufficient, then, it seems simpler to re-use it rather than 
> create another one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9615) Example framework for feedback on agent default resources

2019-03-29 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805002#comment-16805002
 ] 

Benno Evers commented on MESOS-9615:


{noformat}
commit 1915150c6a83cd95197e25a68a6adf9b3ef5fb11
Author: Benno Evers 
Date:   Fri Mar 22 17:51:34 2019 +0100

Added new example framework for operation feedback.

This adds a new example framework showcasing a possible
implementation of the newly added operation feedback API.

Review: https://reviews.apache.org/r/70282
{noformat}

> Example framework for feedback on agent default resources
> -
>
> Key: MESOS-9615
> URL: https://issues.apache.org/jira/browse/MESOS-9615
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations, mesosphere
>
> We need a framework that can be used to test operations on agent default 
> resources which request operation feedback.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.

2019-03-29 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804978#comment-16804978
 ] 

Benno Evers commented on MESOS-9687:


Interface extension landed in:
{noformat}
commit 8cba86825449c35733a0b4cf0d14284055c2cc30 (HEAD -> master, origin/master)
Author: Andrei Sekretenko 
Date:   Fri Mar 29 14:23:57 2019 +0100

Extended the glog LogSink interface to be able to log microseconds.

Extended the LogSink interface to be able to log microseconds.

This makes possible to solve a problem with modules implementing custom 
LogSink which currently log 00 instead of microseconds.

This is a backport of this patch: https://github.com/google/glog/pull/441 
to glog 0.3.3

Review: https://reviews.apache.org/r/70334/
{noformat}

Modules now can use the new interface method
{noformat}
virtual void send(LogSeverity severity, const char* full_filename,
const char* base_filename, int line,
const struct ::tm* tm_time,
const char* message, size_t message_len, int32 usecs) 
{noformat}

to log including microseconds.

> Add the glog patch to pass microseconds via the LogSink interface.
> --
>
> Key: MESOS-9687
> URL: https://issues.apache.org/jira/browse/MESOS-9687
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Sekretenko
>Priority: Major
>
> Currently, custom LogSink implementations in the modules (for example, this 
> one:
>  [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] 
> )
>  are logging `00` instead of microseconds in the timestamp - simply 
> because the LogSink interface in glog has no place for microseconds.
> The proposed glog fix is here: [https://github.com/google/glog/pull/441]
> Getting this into glog release might take a long time (they released 0.4.0 
> recently, but the previous release 0.3.5 was two years ago), therefore it 
> makes sense to add this patch into Mesos build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.

2019-03-29 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9687:
--

Assignee: Benno Evers

> Add the glog patch to pass microseconds via the LogSink interface.
> --
>
> Key: MESOS-9687
> URL: https://issues.apache.org/jira/browse/MESOS-9687
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Sekretenko
>Assignee: Benno Evers
>Priority: Major
>
> Currently, custom LogSink implementations in the modules (for example, this 
> one:
>  [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] 
> )
>  are logging `00` instead of microseconds in the timestamp - simply 
> because the LogSink interface in glog has no place for microseconds.
> The proposed glog fix is here: [https://github.com/google/glog/pull/441]
> Getting this into glog release might take a long time (they released 0.4.0 
> recently, but the previous release 0.3.5 was two years ago), therefore it 
> makes sense to add this patch into Mesos build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.

2019-03-29 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9687:
--

Assignee: Andrei Sekretenko  (was: Benno Evers)

> Add the glog patch to pass microseconds via the LogSink interface.
> --
>
> Key: MESOS-9687
> URL: https://issues.apache.org/jira/browse/MESOS-9687
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>
> Currently, custom LogSink implementations in the modules (for example, this 
> one:
>  [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] 
> )
>  are logging `00` instead of microseconds in the timestamp - simply 
> because the LogSink interface in glog has no place for microseconds.
> The proposed glog fix is here: [https://github.com/google/glog/pull/441]
> Getting this into glog release might take a long time (they released 0.4.0 
> recently, but the previous release 0.3.5 was two years ago), therefore it 
> makes sense to add this patch into Mesos build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9690) Framework registration can silently fail w/o visible error

2019-03-29 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804905#comment-16804905
 ] 

Benno Evers commented on MESOS-9690:


The authentication issues mentioned in the original ticket turned out to be a 
red herring, so I updated the ticket description and labels.

> Framework registration can silently fail w/o visible error
> --
>
> Key: MESOS-9690
> URL: https://issues.apache.org/jira/browse/MESOS-9690
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: foundations
>
> When running a v1 framework the master can sometimes respond with "503 
> Service Unavailable" to a SUBSCRIBE request, without any log message hinting 
> at what might be wrong even at log level `GLOG_v=4`. For example, this is 
> from an attempt to run the `OperationFeedbackFramework` against `mesos-local`:
> {noformat}
> I0328 18:17:53.273442  7793 scheduler.cpp:600] Sending SUBSCRIBE call to 
> http://127.0.1.1:36423/master/api/v1/scheduler
> I0328 18:17:53.273653  7797 leveldb.cpp:347] Persisting action (14 bytes) to 
> leveldb took 3.185352ms
> I0328 18:17:53.273695  7797 replica.cpp:712] Persisted action NOP at position > 0
> I0328 18:17:53.274099  7798 containerizer.cpp:1123] Recovering isolators
> I0328 18:17:53.274602  7794 replica.cpp:695] Replica received learned notice 
> for position 0 from log-network(1)@127.0.1.1:36423
> I0328 18:17:53.274829  7798 containerizer.cpp:1162] Recovering provisioner
> I0328 18:17:53.275249  7795 process.cpp:3588] Handling HTTP event for process 
> 'master' with path: '/master/api/v1/scheduler'
> I0328 18:17:53.276659  7792 provisioner.cpp:494] Provisioner recovery complete
> I0328 18:17:53.277318  7796 slave.cpp:7602] Recovering executors
> I0328 18:17:53.277470  7796 slave.cpp:7755] Finished recovery
> I0328 18:17:53.277743  7794 leveldb.cpp:347] Persisting action (16 bytes) to 
> leveldb took 3.110989ms
> I0328 18:17:53.27  7794 replica.cpp:712] Persisted action NOP at position > 0
> I0328 18:17:53.278400  7795 http.cpp:1105] HTTP POST for 
> /master/api/v1/scheduler from 127.0.0.1:45952
> I0328 18:17:53.278426  7793 task_status_update_manager.cpp:181] Pausing 
> sending task status updates
> I0328 18:17:53.278453  7794 log.cpp:570] Writer started with ending position 0
> I0328 18:17:53.278425  7798 status_update_manager_process.hpp:379] Pausing 
> operation status update manager
> I0328 18:17:53.278431  7796 slave.cpp:1258] New master detected at 
> master@127.0.1.1:36423
> I0328 18:17:53.278502  7796 slave.cpp:1312] No credentials provided. 
> Attempting to register without authentication
> I0328 18:17:53.278560  7796 slave.cpp:1323] Detecting new master
> W0328 18:17:53.279768  7791 scheduler.cpp:697] Received '503 Service 
> Unavailable' () for SUBSCRIBE
> {noformat}
> Regardless of the actual issue that caused the error response, I think at the 
> very least,
>  - the `mesos::scheduler::Mesos` class should either have a way to provide 
> some feedback to the user or retry itself, not silently swallow the error
>  - out documentation should mention the possibility of this call returning 
> errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8241) Add metrics for offer operation feedback

2019-03-28 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804560#comment-16804560
 ] 

Benno Evers commented on MESOS-8241:


{noformat}
commit ede2a94ebaf9710516816bae7d012d926c533a59
Author: Benno Evers 
Date:   Thu Feb 28 18:02:56 2019 +0100

Added unit tests for offer operation feedback metrics.

This adds a set of checks to verify the metrics introduced
in the previous commit are working as intended.

Review: https://reviews.apache.org/r/70117

commit 18c401563c33022240fede63fbe3ec9b7bf4c385
Author: Benno Evers 
Date:   Thu Feb 28 18:03:27 2019 +0100

Added metrics for offer operation feedback.

This commit adds additional metrics counting the
number of operations in each state.

Unit tests are added in the subsequent commit.

Review: https://reviews.apache.org/r/70116

commit af2c47a5e680b5c3140fd7d4639750f476f1627c
Author: Benno Evers 
Date:   Thu Mar 7 17:51:22 2019 +0100

Added helper to test for metrics values.

This patch adds a new helper function to check
whether a given metric has some specified value.

Review: https://reviews.apache.org/r/70156

commit 5e4aa14a2b6c5c753248e642289c04a267aca074
Author: Benno Evers 
Date:   Thu Feb 28 18:01:47 2019 +0100

Updated comment about operations.

Review: https://reviews.apache.org/r/70115
{noformat}

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9690) Framework registration on mesos-local fails w/o error unless http_framework_authenticators flag is set

2019-03-28 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9690:
--

 Summary: Framework registration on mesos-local fails w/o error 
unless http_framework_authenticators flag is set
 Key: MESOS-9690
 URL: https://issues.apache.org/jira/browse/MESOS-9690
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When running a v1 framework against mesos-local without setting the 
"--http_framework_authenticators=basic" flag, the master will respond with "503 
Service Unavailable" to a SUBSCRIBE request, without any log message hinting at 
what might be wrong even at log level `GLOG_v=4`:

{noformat}
I0328 18:17:53.273442  7793 scheduler.cpp:600] Sending SUBSCRIBE call to 
http://127.0.1.1:36423/master/api/v1/scheduler
I0328 18:17:53.273653  7797 leveldb.cpp:347] Persisting action (14 bytes) to 
leveldb took 3.185352ms
I0328 18:17:53.273695  7797 replica.cpp:712] Persisted action NOP at position 0
I0328 18:17:53.274099  7798 containerizer.cpp:1123] Recovering isolators
I0328 18:17:53.274602  7794 replica.cpp:695] Replica received learned notice 
for position 0 from log-network(1)@127.0.1.1:36423
I0328 18:17:53.274829  7798 containerizer.cpp:1162] Recovering provisioner
I0328 18:17:53.275249  7795 process.cpp:3588] Handling HTTP event for process 
'master' with path: '/master/api/v1/scheduler'
I0328 18:17:53.276659  7792 provisioner.cpp:494] Provisioner recovery complete
I0328 18:17:53.277318  7796 slave.cpp:7602] Recovering executors
I0328 18:17:53.277470  7796 slave.cpp:7755] Finished recovery
I0328 18:17:53.277743  7794 leveldb.cpp:347] Persisting action (16 bytes) to 
leveldb took 3.110989ms
I0328 18:17:53.27  7794 replica.cpp:712] Persisted action NOP at position 0
I0328 18:17:53.278400  7795 http.cpp:1105] HTTP POST for 
/master/api/v1/scheduler from 127.0.0.1:45952
I0328 18:17:53.278426  7793 task_status_update_manager.cpp:181] Pausing sending 
task status updates
I0328 18:17:53.278453  7794 log.cpp:570] Writer started with ending position 0
I0328 18:17:53.278425  7798 status_update_manager_process.hpp:379] Pausing 
operation status update manager
I0328 18:17:53.278431  7796 slave.cpp:1258] New master detected at 
master@127.0.1.1:36423
I0328 18:17:53.278502  7796 slave.cpp:1312] No credentials provided. Attempting 
to register without authentication
I0328 18:17:53.278560  7796 slave.cpp:1323] Detecting new master
W0328 18:17:53.279768  7791 scheduler.cpp:697] Received '503 Service 
Unavailable' () for SUBSCRIBE
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9666) Specifying custom CXXFLAGS breaks Mesos build

2019-03-21 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9666:
--

 Summary: Specifying custom CXXFLAGS breaks Mesos build
 Key: MESOS-9666
 URL: https://issues.apache.org/jira/browse/MESOS-9666
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The environment variable CXXFLAGS (as well CFLAGS and CPPFLAGS) is intended to 
give the user a way to add custom compiler flags to the build at both 
configure-time and build-time.

For example, a user wishing to use the address-sanitizer feature for a 
development build could run configure like
{nocode}
./configure CXXFLAGS="-fsanitize=address"
{nocode}
or a user wishing to investigate a particular binary might want to rebuild that 
framework with additional debug information:
{nocode}
make -C src/ dynamic-reservation-framework CXXFLAGS="-g3 -O0"
{nocode}

Therefore, providing custom CXXFLAGS should not break the build. However, we 
currently add some essential flags (like '-std=c++11') into CXXFLAGS, and a 
user specifying custom CXXFLAGS has to replicate all of these before he can 
provide his own.

Instead, we should try to restrict CXXFLAGS to some harmless default (e.g. '-g 
-O2') and move essential flags into some other variable MESOS_CXXFLAGS that is 
always added to the mesos build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6874) Agent silently ignores FS isolation when protobuf is malformed

2019-03-19 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796412#comment-16796412
 ] 

Benno Evers commented on MESOS-6874:


{noformat}
commit 93aca1eb0efcec941e19e976f683a35ecd9a840b
Author: Andrei Sekretenko 
Date:   Tue Mar 19 18:55:55 2019 +0100

Validated the match between Type and *Infos in the ContainerInfo.
[...]
{noformat}

> Agent silently ignores FS isolation when protobuf is malformed
> --
>
> Key: MESOS-6874
> URL: https://issues.apache.org/jira/browse/MESOS-6874
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
>Reporter: Michael Gummelt
>Assignee: Andrei Sekretenko
>Priority: Minor
>  Labels: foundations, newbie
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> cc [~vinodkone]
> I accidentally set my Mesos ContainerInfo to include a DockerInfo instead of 
> a MesosInfo:
> {code}
> executorInfoBuilder.setContainer(
>  Protos.ContainerInfo.newBuilder()
>  .setType(Protos.ContainerInfo.Type.MESOS)
>  .setDocker(Protos.ContainerInfo.DockerInfo.newBuilder()
>  
> .setImage(podSpec.getContainer().get().getImageName()))
> {code}
> I would have expected a validation error before or during containerization, 
> but instead, the agent silently decided to ignore filesystem isolation 
> altogether, and launch my executor on the host filesystem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9660) Documentation should mention constraints for `ACCEPT` calls

2019-03-18 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9660:
--

 Summary: Documentation should mention constraints for `ACCEPT` 
calls
 Key: MESOS-9660
 URL: https://issues.apache.org/jira/browse/MESOS-9660
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Our current documentation does not mention any constraints on the `ACCEPT` 
scheduler api call.

However, in addition to the trivial constraints (i.e. all operations must have 
valid resources and have required fields set), we also have a number of 
non-obvious constraints that should be documented.

One example is that all offer_ids in this call must belong to offers of the 
same agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9657) Launching a command task twice can crash the agent

2019-03-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9657:
--

 Summary: Launching a command task twice can crash the agent
 Key: MESOS-9657
 URL: https://issues.apache.org/jira/browse/MESOS-9657
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When launching a command task, we verify that the framework has no existing 
executor for that task:
{noformat}
  // We are dealing with command task; a new command executor will be
  // launched.
  CHECK(executor == nullptr);
{noformat}
and afterwards an executor is created with the same executor id as the task id:
{noformat}
  // (slave.cpp)
  // Either the master explicitly requests launching a new executor
  // or we are in the legacy case of launching one if there wasn't
  // one already. Either way, let's launch executor now.
  if (executor == nullptr) {
Try added = framework->addExecutor(executorInfo);
  [...]
{noformat}

This means that if we relaunch the task with the same task id before the 
executor is removed, it will crash the agent:
{noformat}
F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr 
*** Check failure stack trace: ***
@ 0x7feb29a407af  google::LogMessage::Flush()
@ 0x7feb29a43c3f  google::LogMessageFatal::~LogMessageFatal()
@ 0x7feb28a5a886  mesos::internal::slave::Slave::__run()
@ 0x7feb28af4f0e  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEclEOS3_
@ 0x7feb2998a620  process::ProcessBase::consume()
@ 0x7feb29987675  process::ProcessManager::resume()
@ 0x7feb299a2d2b  
_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8E6_M_runEv
@ 0x7feb2632f523  (unknown)
@ 0x7feb25e40594  start_thread
@ 0x7feb25b73e6f  __GI___clone
Aborted (core dumped)
{noformat}

Instead of crashing, the agent should just drop the task with an appropriate 
error in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9656) Empty reservations fail with confusing error message

2019-03-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9656:
--

 Summary: Empty reservations fail with confusing error message
 Key: MESOS-9656
 URL: https://issues.apache.org/jira/browse/MESOS-9656
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When attempting to apply a reserve operation containing empty resources, the 
operation fails during validation with the error message:
{noformat}
W0315 11:17:37.687129 25931 master.cpp:2292] Dropping UNRESERVE operation from 
framework e4cd5335-8af5-4db2-b6f8-07adbef1c6a3- (Operation Feedback 
Framework (C++)): Invalid resources: The resources have multiple resource 
providers: 
{noformat}

Instead, the error message should say that the reservation does not contain any 
resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9652) URL handler lookup might miss root handlers

2019-03-14 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9652:
--

 Summary: URL handler lookup might miss root handlers
 Key: MESOS-9652
 URL: https://issues.apache.org/jira/browse/MESOS-9652
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When looking up url handlers, libprocess is looking for the longest URL prefix 
that corresponds to a http endpoint handler registered by the handling process.

For example if a process did setup route `/foo` and `/foo/bar`, an incoming 
http request for `/foo/bar/baz` would be dispatched onto the `/foo/bar` handler.

However, if a process registers a route `/` the lookup will only succeed if the 
request is exactly for `/`, and a request for `/baz` will return a 404 Not 
Found response.

The root cause of this is the implementation of the handler lookup:
{noformat}
// ProcessBase::consume(HttpEvent&&)
name = strings::trim(name, strings::PREFIX, "/");
[...]
while (Path(name, '/').dirname() != name) {
  [...]
}
{noformat}

where `dirname()` returns "." when given an input string that does not contain 
any `/` as `name`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9650) Document the semantics of operation pipelining

2019-03-12 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9650:
--

 Summary: Document the semantics of operation pipelining
 Key: MESOS-9650
 URL: https://issues.apache.org/jira/browse/MESOS-9650
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


In our `Accept` protobuf, frameworks can specify multiple offer operations that 
are to be executed on the received offer:

https://github.com/apache/mesos/blob/40abcefab4f2887e61786365b46bc22155a2d1ff/include/mesos/scheduler/scheduler.proto#L317

However, the semantics of specifying multiple operations in this way are 
currently not documented anywhere I could find, except for a comment on that 
protobuf that the master will be "performing the specified operations in a 
sequential manner."

In particular, it is unclear what will happen if any operation in the sequence 
fails, or at which particular points during the operation the master will move 
on to the next (i.e. if we have [RESERVE, LAUNCH, RESERVE], when exactly does 
the second reserve happen), and if there are any restrictions on combining 
operations in this way.

While all of this can be figured out by reading the master source code, we 
should add some user-facing documentation about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9645) Add a way to access a subset of metrics

2019-03-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9645:
--

 Summary: Add a way to access a subset of metrics
 Key: MESOS-9645
 URL: https://issues.apache.org/jira/browse/MESOS-9645
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Currently, the only way to access libprocess metrics is via the 
`metrics/snapshot` endpoint, which returns the current values of all installed 
metrics.

If the caller is only interested in a specific metric, or a subset of the 
metrics, this is wasteful in two ways: First the process has to do extra work 
to collect these metrics, and second the caller has to do extra work to filter 
out the unneeded metrics.

Ideally libprocess could use the request path to implement filtering such that 
e.g. a request to
{noformat}
wget http://mesos.master:5050/metrics/allocator/mesos/
{noformat}
would return all metrics whose name begins with "allocator/mesos/", would I'm 
not sure that this is currently implementable.

Alternatively, a request parameter could be added to the same effect:
{noformat}
wget http://mesos.master:5050/metrics/snapshot?prefix=allocator/mesos/
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9644) Marking an Agent as Gone Breaks Metrics Process in Unit Tests

2019-03-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9644:
--

 Summary: Marking an Agent as Gone Breaks Metrics Process in Unit 
Tests
 Key: MESOS-9644
 URL: https://issues.apache.org/jira/browse/MESOS-9644
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When an agent is marked as gone, the master will tell that agent to shut down 
which it tries via
{noformat}
// slave.cpp:974
terminate(self());
{noformat}

However, terminating a process will only call `Slave::finalize()`, but *not* 
the destructor `Slave::~Slave()`.

In a standalone slave, this doesn't matter since terminating the slave process 
will cause the OS process to immediately exit as well.

However, in unit tests that is not the case, and since the slave was never 
properly destructed its metrics keys are still contained in the global metrics 
object. The pull gauges will then cause a deadlock the next time a metrics 
snapshot is requested, since their dispatches will be silently (for VLOG < 2) 
dropped:

{noformat}
I0311 11:08:53.329043 34499 authorization.cpp:135] Authorizing principal 'ANY' 
to GET the endpoint '/metrics/snapshot'
I0311 11:08:53.329067 34499 clock.cpp:435] Clock of 
local-authorizer(2)@66.70.182.167:35815 updated to 2019-03-11 
15:08:53.273557888+00:00
I0311 11:08:53.329121 34499 process.cpp:2880] Resuming 
local-authorizer(2)@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00
I0311 11:08:53.329160 34496 process.cpp:2880] Resuming 
__auth_handlers__(2)@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00
I0311 11:08:53.329260 34496 process.cpp:2880] Resuming 
metrics@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00
I0311 11:08:53.353018 34486 process.cpp:2803] Dropping event for process 
slave(1)@66.70.182.167:35815
I0311 11:08:53.353040 34486 process.cpp:2803] Dropping event for process 
slave(1)@66.70.182.167:35815
I0311 11:08:53.353063 34486 process.cpp:2803] Dropping event for process 
slave(1)@66.70.182.167:35815
I0311 11:08:53.353080 34486 process.cpp:2803] Dropping event for process 
slave(1)@66.70.182.167:35815
I0311 11:08:53.353097 34486 process.cpp:2803] Dropping event for process 
slave(1)@66.70.182.167:35815
[...]
{noformat}

It's not immediately clear to me what the correct fix for this would be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8241) Add metrics for offer operation feedback

2019-03-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783566#comment-16783566
 ] 

Benno Evers commented on MESOS-8241:


I've opened a review for the scope that is outline in the comment above at: 
https://reviews.apache.org/r/70116/

Some ideas I've had for further metrics that might become interesting:

Master-wide versions of the per-framework metrics we currently collect about 
operations types:
 - master/operations/create_disk/finished
 - master/operations/create_disk/dropped
 - [...]

A counter to see how many user-provided operations failed validation:
 - master/invalid_operations

A per-framework counter for the number of unacknowledged operations.

A counter for the total number of operation update retries.

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9611) Add `/machines` endpoint to show mapping between machines and agents

2019-02-26 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9611:
--

 Summary: Add `/machines` endpoint to show mapping between machines 
and agents
 Key: MESOS-9611
 URL: https://issues.apache.org/jira/browse/MESOS-9611
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


It is currently quite hard to get information about the machines known to the 
master. This can result in situations that are hard to debug for silly reasons, 
e.g. mistyping a machine id when posting a maintenance schedule.

It would be nice to have an endpoint that displays the current mapping between 
machine id's and agents to the user. This could become a new endpoint like 
`/machines` or `/machine/info`, or added as part of an existing one like 
`/mainenance/status`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9588) Add a way to view current offer filters

2019-02-20 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9588:
--

 Summary: Add a way to view current offer filters
 Key: MESOS-9588
 URL: https://issues.apache.org/jira/browse/MESOS-9588
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Looking at just mesos, it's currently not possible to see which offer filters 
are active for which amount of time.

The closest one can get is to check whether a filter currently exists, either 
by looking at  via the `metrics/snapshot` if per-frameworks metrics are enabled 
or by scanning the master logs for this message
{noformat}
  VLOG(1) << "Filtered offer with " << resources
  << " on agent " << slaveId
  << " for role " << role
  << " of framework " << frameworkId;
{noformat}

However, that does not tell the user how long the filter was there, which 
resources it contains and how long it will stay.

Maybe MESOS-8621 would be a viable way to surface this information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9585) Agent host IP can differ between subpages in the WebUI

2019-02-19 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9585:
--

 Summary: Agent host IP can differ between subpages in the WebUI
 Key: MESOS-9585
 URL: https://issues.apache.org/jira/browse/MESOS-9585
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers
 Attachments: mesos_agent_ip.webm

Apparently, the WebUI receives the agent host ip from different sources between 
the "Agents" tab and the information page for an individual agent.

For example, in the attached video the host ip of the given agent is once given 
as 172.31.3.68 (the correct ip) and once as 172.31.10.48.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9584) Inactive frameworks show incorrect "Registered Time" in Web UI

2019-02-19 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9584:
--

 Summary: Inactive frameworks show incorrect "Registered Time" in 
Web UI
 Key: MESOS-9584
 URL: https://issues.apache.org/jira/browse/MESOS-9584
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers
 Attachments: image-2019-02-19-16-48-04-927.png

Currently, inactive frameworks have their "Registered Time" shown as 
`1970-01-01` in the WebUI (see attached screenshot):

 !image-2019-02-19-16-48-04-927.png! 

Instead, this should probably be displayed as "-" to indicate that this field 
does not have a useful value for these frameworks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-18 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771167#comment-16771167
 ] 

Benno Evers commented on MESOS-9490:


[~bmahler], seems I remembered wrong, after re-running the test above it's 
actually not a CHECK failure but just a normal error:

{noformat}
[ RUN  ] MasterLoadTest.AcceptEncoding
I0218 10:45:26.316328 70511 cluster.cpp:174] Creating default 'local' authorizer
I0218 10:45:26.318068 70572 master.cpp:414] Master 
67635eb2-df26-4db8-a5e4-a5f3aa9f3ebc (core1.hw.ca1.mesosphere.com) started on 
66.70.182.167:46839
I0218 10:45:26.318110 70572 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/qKeUnl/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/qKeUnl/master" --zk_session_timeout="10secs"
I0218 10:45:26.319782 70572 master.cpp:466] Master only allowing authenticated 
frameworks to register
I0218 10:45:26.319829 70572 master.cpp:472] Master only allowing authenticated 
agents to register
I0218 10:45:26.319839 70572 master.cpp:478] Master only allowing authenticated 
HTTP frameworks to register
I0218 10:45:26.319851 70572 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/qKeUnl/credentials'
I0218 10:45:26.320096 70572 master.cpp:522] Using default 'crammd5' 
authenticator
I0218 10:45:26.320171 70572 authenticator.cpp:520] Initializing server SASL
I0218 10:45:26.325443 70572 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0218 10:45:26.325582 70572 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0218 10:45:26.325675 70572 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0218 10:45:26.325704 70572 master.cpp:603] Authorization enabled
I0218 10:45:26.329525 70572 master.cpp:2103] Elected as the leading master!
I0218 10:45:26.329560 70572 master.cpp:1638] Recovering from registrar
I0218 10:45:26.331326 70526 registrar.cpp:383] Successfully fetched the 
registry (0B) in 1.668864ms
I0218 10:45:26.331449 70526 registrar.cpp:487] Applied 1 operations in 38387ns; 
attempting to update the registry
I0218 10:45:26.331748 70530 registrar.cpp:544] Successfully updated the 
registry in 259072ns
I0218 10:45:26.331821 70530 registrar.cpp:416] Successfully recovered registrar
I0218 10:45:26.331980 70530 master.cpp:1752] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0218 10:45:26.334493 70554 http.cpp:1105] HTTP GET for /master//state from 
66.70.182.167:59082
I0218 10:45:26.335484 70552 http.cpp:1122] HTTP GET for /master//state from 
66.70.182.167:59082: '200 OK' after 2.06899ms
../../src/tests/master_load_tests.cpp:570: Failure
(response).failure(): Failed to decode response
I0218 10:45:26.336654 70511 master.cpp:1109] Master terminating
[  FAILED  ] MasterLoadTest.AcceptEncoding (22 ms)
{noformat}

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> wit

[jira] [Comment Edited] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-15 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234
 ] 

Benno Evers edited comment on MESOS-9490 at 2/15/19 11:57 AM:
--

[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

If I remember correctly, running this test leads to a segfault due to some 
internal CHECK failure.


was (Author: bennoe):
[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> with a decode error in this case.
> For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-15 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234
 ] 

Benno Evers commented on MESOS-9490:


[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> with a decode error in this case.
> For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9575) Mesos Web UI can't display relative timestamps in the future

2019-02-14 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9575:
--

 Summary: Mesos Web UI can't display relative timestamps in the 
future
 Key: MESOS-9575
 URL: https://issues.apache.org/jira/browse/MESOS-9575
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The `relativeDate()` function used by the Mesos WebUI 
(https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=src/webui/assets/libs/relative-date.js;hb=HEAD)
 is only able to handle dates in the past. All dates in the future are rendered 
as "just now".

This can be especially confusing when posting maintenance windows, where 
usually both dates are in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9569) Missing master-side validation of UpdateOperationStatusMessage

2019-02-13 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9569:
--

 Summary: Missing master-side validation of 
UpdateOperationStatusMessage
 Key: MESOS-9569
 URL: https://issues.apache.org/jira/browse/MESOS-9569
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The master is currently not validating incoming 
`UpdateOperationStatusMessage`s, and is performing `CHECK()`s on the values of 
certain protobuf fields of the message.

This means a malformed HTTP request can trigger a master crash. This can be 
reproduced e.g. by executing code like this on a master host:
{noformat}
import urllib.request
rq = 
urllib.request.Request("http://localhost:5050/master/mesos.internal.UpdateOperationStatusMessage";,
 headers={"Libprocess-From": "foo@127.0.1.1:5052"}, method="POST", 
data=b'\x1a\x02\x10\x01*\x05\n\x03xxx')
rsp = urllib.request.urlopen(rq).read()
{noformat}

(where the posted data is just an UpdateOperationStatusMessage protobuf without 
a  slave_id serialized as string)

{noformat}
F0213 13:14:22.507489 16492 master.cpp:8413] Check failed: 
update.has_slave_id() External resource provider is not supported yet
{noformat}

Looking at other internal messages, some of them already have a validation step 
implemented (i.e. RegisterSlaveMessage), so probably we should probably add 
something similar for this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9521) MasterAPITest.OperationUpdatesUponAgentGone is flaky

2019-01-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742276#comment-16742276
 ] 

Benno Evers commented on MESOS-9521:


Review: https://reviews.apache.org/r/69726/

The warning is known, but due to the caveat that is printed right below the 
warning
{noformat}
NOTE: You can safely ignore the above warning unless this call should not 
happen.  Do not suppress it by blindly adding an EXPECT_CALL() if you don't 
mean to enforce the call.  See 
https://github.com/google/googletest/blob/master/googlemock/docs/CookBook.md#knowing-when-to-expect
 for details.
{noformat}

I left it, because the test does not really care about whether `disconnect()` 
is called or not.

> MasterAPITest.OperationUpdatesUponAgentGone is flaky
> 
>
> Key: MESOS-9521
> URL: https://issues.apache.org/jira/browse/MESOS-9521
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.8.0
> Environment: Fedora28, cmake w/ SSL
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: flaky, flaky-test
>
> The recently added test {{MasterAPITest.OperationUpdatesUponAgentGone}} is 
> flaky, e.g.,
> {noformat}../src/tests/api_tests.cpp:5051: Failure
> Value of: resources.empty()
>   Actual: true
> Expected: false
> ../3rdparty/libprocess/src/../include/process/gmock.hpp:504: Failure
> Actual function call count doesn't match EXPECT_CALL(filter->mock, filter(to, 
> testing::A()))...
> Expected args: message matcher (32-byte object  24-00 00-00 00-00 00-00 24-00 00-00 00-00 00-00 41-63 74-75 61-6C 20-66>, 
> 1-byte object )
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> {noformat}
> I am able to reproduce this reliable in less than 10 iterations when running 
> the test in repetition under additional system stress.
> Even if the test does not fail it produces the following gmock warning,
> {noformat}
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
> Function call: disconnected()
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.

2019-01-11 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740647#comment-16740647
 ] 

Benno Evers commented on MESOS-9394:


Both the analysis and the proposed change look correct to me - the current 
behaviour certainly does not match what the documentation at 
http://mesos.apache.org/documentation/latest/maintenance/#scheduling-maintenance
 suggests.

[~carlone], if you want to keep credit for the fix I'd suggest to go ahead and 
post a patch to reviewboard, otherwise if you prefer I can also go ahead and do 
that for you.

> Maintenance of machine A causes "Removing offers" for machine B.
> 
>
> Key: MESOS-9394
> URL: https://issues.apache.org/jira/browse/MESOS-9394
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>  Labels: maintenance
>
> If I schedule machine A in a maintenance call, the logic in 
> "___updateMaintenanceSchedule" will check all the master's machines. 
> Another machine(say machine B) not in the maintenance schedule will be set to 
> UP Mode and call "updateUnavailability". This results in removing all offers 
> of slaves on machine B.
> If I am using these offers to run some tasks, these tasks would be lost for 
> REASON_INVALID_OFFERS.
> I think a maintenance schedule should not affect machines not in it. Is that 
> right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9472) Unblock operation feedback on agent default resources.

2019-01-09 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9472:
--

Assignee: Benno Evers

> Unblock operation feedback on agent default resources.
> --
>
> Key: MESOS-9472
> URL: https://issues.apache.org/jira/browse/MESOS-9472
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Gastón Kleiman
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>
> # Remove {{CHECK}} marked with a TODO in {{Master::updateOperationStatus()}}.
> # Update {{Master::acknowledgeOperationStatus()}}, remove the CHECK requiring 
> a resource provider ID.
> # Remove validation in {{Option validate(mesos::scheduler::Call& call, 
> const Option& principal)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8783) Transition pending operations to OPERATION_UNREACHABLE when an agent is removed.

2019-01-04 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734500#comment-16734500
 ] 

Benno Evers commented on MESOS-8783:


Opened a review for the first paragraph here: 
https://reviews.apache.org/r/69669/

The second part needs a bit more consideration, and should probably be done in 
a separate ticket. It might be not necessary to send updates from the master 
when the agent reconnects, since at that point the agent can send the updated 
operation statuses itself.

> Transition pending operations to OPERATION_UNREACHABLE when an agent is 
> removed.
> 
>
> Key: MESOS-8783
> URL: https://issues.apache.org/jira/browse/MESOS-8783
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gastón Kleiman
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, mesosphere
> Fix For: 1.8.0
>
>
> Pending operations on an agent should be transitioned to 
> `OPERATION_UNREACHABLE` when an agent is marked unreachable. We should also 
> make sure that we pro-actively send operation status updates for these 
> operations when the agent becomes unreachable.
> We should also make sure that we send new operation updates if/when the agent 
> reconnects - perhaps this is already accomplished with the existing operation 
> update logic in the agent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9506) Master will leaks operations when agents are removed

2019-01-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9506:
--

 Summary: Master will leaks operations when agents are removed
 Key: MESOS-9506
 URL: https://issues.apache.org/jira/browse/MESOS-9506
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Usually, offer operations are removed when the framework acknowledges
a terminal operation status update.

However, currently only operations on registered agents can be
acknowledged, so operations on agents which don't come back will be permanently 
leaked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9506) Master will leaks operations when agents are removed

2019-01-02 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732203#comment-16732203
 ] 

Benno Evers commented on MESOS-9506:


https://reviews.apache.org/r/69597

> Master will leaks operations when agents are removed
> 
>
> Key: MESOS-9506
> URL: https://issues.apache.org/jira/browse/MESOS-9506
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Usually, offer operations are removed when the framework acknowledges
> a terminal operation status update.
> However, currently only operations on registered agents can be
> acknowledged, so operations on agents which don't come back will be 
> permanently leaked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9494) Add a unit test for the interaction between request batching and response compression

2018-12-19 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9494:
--

 Summary: Add a unit test for the interaction between request 
batching and response compression
 Key: MESOS-9494
 URL: https://issues.apache.org/jira/browse/MESOS-9494
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


As discussed in https://reviews.apache.org/r/69064/ , we should try to add a 
unit test that verifies that simultaneous requests with different accept 
encoding headers produce different responses.

It could look like this:
{noformat}
TEST_F(MasterLoadTest, AcceptEncoding)
{
  MockAuthorizer authorizer;
  prepareCluster(&authorizer);

  Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
  Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};
  Headers acceptRawHeaders  = {{"Accept-Encoding", "raw"}};

  RequestDescriptor descriptor1;
  descriptor1.endpoint = "/state";
  descriptor1.headers = authHeaders + acceptGzipHeaders;

  RequestDescriptor descriptor2 = descriptor1;
  descriptor2.headers = authHeaders + acceptRawHeaders;

  auto responses = launchSimultaneousRequests({descriptor1, descriptor2});

  foreachpair (
  const RequestDescriptor& request,
  Future& response,
  responses)
  {
AWAIT_READY(response);

ASSERT_SOME(request.headers.get("Accept-Encoding"));
if (request.headers.get("Accept-Encoding").get() == "gzip") {
  ASSERT_SOME(response->headers.get("Content-Encoding"));
  EXPECT_EQ(response->headers.get("Content-Encoding").get(), "gzip");
} else {
  EXPECT_NONE(response->headers.get("Content-Encoding"));
}
  }

  // Ensure that we actually hit the metrics code path while executing
  // the test.
  JSON::Object metrics = Metrics();
  ASSERT_TRUE(metrics.values["master/http_cache_hits"].is());
  ASSERT_GT(
  metrics.values["master/http_cache_hits"].as().as(),
  0u);
}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8782) Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent gone.

2018-12-19 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725230#comment-16725230
 ] 

Benno Evers commented on MESOS-8782:


Review: https://reviews.apache.org/r/69575/

> Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent 
> gone.
> ---
>
> Key: MESOS-8782
> URL: https://issues.apache.org/jira/browse/MESOS-8782
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gastón Kleiman
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations
> Fix For: 1.8.0
>
>
> The master should transition operations to the state 
> {{OPERATION_GONE_BY_OPERATOR}} when an agent is marked gone, sending an 
> operation status update to the frameworks that created them.
> We should also remove them from {{Master::frameworks}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9490) Support accepting gzipped responses in libprocess

2018-12-18 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9490:
--

 Summary: Support accepting gzipped responses in libprocess
 Key: MESOS-9490
 URL: https://issues.apache.org/jira/browse/MESOS-9490
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Currently all libprocess endpoints support the serving of gzipped responses 
when the client is requesting this with an `Accept-Encoding: gzip` header.

However, libprocess does not support receiving gzipped responses, failing with 
a decode error in this case.

For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9484) GroupTest.GroupDataWithDisconnect is flaky

2018-12-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9484:
--

 Summary: GroupTest.GroupDataWithDisconnect is flaky
 Key: MESOS-9484
 URL: https://issues.apache.org/jira/browse/MESOS-9484
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX w/ libevent
Reporter: Benno Evers


Observed the following error in our CI:
{noformat}
../../src/tests/group_tests.cpp:129: Failure
data.get() is NONE
{noformat}

Full log:
{noformat}
[ RUN  ] GroupTest.GroupDataWithDisconnect
I1214 15:06:53.386937 398710208 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 51193
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:51193 sessionTimeout=1 
watcher=0x11a65f9a0 sessionId=0 sessionPasswd= context=0x7fcd06163550 
flags=0
2018-12-14 15:06:53,387:69505(0x74415000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:51193]
2018-12-14 15:06:53,389:69505(0x74415000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:51193], 
sessionId=0x167aef9004a, negotiated timeout=1
I1214 15:06:53.389168 60743680 group.cpp:341] Group process 
(zookeeper-group(40)@10.0.49.4:49309) connected to ZooKeeper
I1214 15:06:53.389210 60743680 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I1214 15:06:53.389227 60743680 group.cpp:419] Trying to create path '/test' in 
ZooKeeper
I1214 15:06:53.392253 398710208 zookeeper_test_server.cpp:116] Shutting down 
ZooKeeperTestServer on port 51193
2018-12-14 
15:06:53,393:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1782: 
Socket [127.0.0.1:51193] zk retcode=-4, errno=64(Host is down): failed while 
receiving a server response
I1214 15:06:53.393187 59133952 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...
I1214 15:06:53.393661 59670528 group.cpp:700] Trying to get '/test/00' 
in ZooKeeper
2018-12-14 
15:06:53,393:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket [127.0.0.1:51193] zk retcode=-4, errno=61(Connection refused): server 
refused to accept the client
I1214 15:06:53.395321 398710208 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 51193
W1214 15:07:04.003191 59670528 group.cpp:495] Timed out waiting to connect to 
ZooKeeper. Forcing ZooKeeper session (sessionId=167aef9004a) expiration
I1214 15:07:04.003652 59670528 group.cpp:511] ZooKeeper session expired
2018-12-14 15:07:04,004:69505(0x738e8000):ZOO_INFO@zookeeper_close@2579: 
Freeing zookeeper resources for sessionId=0x167aef9004a

2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:51193 sessionTimeout=1 
watcher=0x11a65f9

[jira] [Created] (MESOS-9483) ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors is flaky

2018-12-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9483:
--

 Summary: ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors 
is flaky
 Key: MESOS-9483
 URL: https://issues.apache.org/jira/browse/MESOS-9483
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX w/ libevent
Reporter: Benno Evers


Observed a failure with the following error:
{noformat}
../../src/tests/master_contender_detector_tests.cpp:409: Failure
Failed to wait 15secs for group1.join("data")
{noformat}

Full log:
{noformat}
[ RUN  ] ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors
I1214 15:03:56.036525 398710208 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 50199
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:50199 sessionTimeout=1 
watcher=0x11a65f9a0 sessionId=0 sessionPasswd= context=0x7fcd061125a0 
flags=0
2018-12-14 15:03:56,037:69505(0x74415000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50199]
2018-12-14 15:03:56,039:69505(0x74415000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:50199], 
sessionId=0x167aef64b83, negotiated timeout=1
I1214 15:03:56.039242 60207104 group.cpp:341] Group process 
(zookeeper-group(14)@10.0.49.4:49309) connected to ZooKeeper
I1214 15:03:56.039286 60207104 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I1214 15:03:56.039309 60207104 group.cpp:395] Authenticating with ZooKeeper 
using digest
2018-12-14 15:04:05,989:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 6619ms
2018-12-14 
15:04:05,989:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1702: 
Socket [127.0.0.1:50199] zk retcode=-7, errno=60(Operation timed out): 
connection to 127.0.0.1:50199 timed out (exceeded timeout by 3284ms)
2018-12-14 15:04:05,989:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 6619ms
I1214 15:04:05.990031 60207104 group.cpp:452] Lost connection to ZooKeeper, 
attempting to reconnect ...
2018-12-14 15:04:09,332:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 9963ms
2018-12-14 15:04:09,332:69505(0x74415000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50199]
2018-12-14 
15:04:09,333:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1800: 
Socket [127.0.0.1:50199] zk retcode=-112, errno=70(Stale NFS file handle): 
sessionId=0x167aef64b83 has expired.
I1214 15:04:09.333552 59670528 group.cpp:511] ZooKeeper session expired
2018-12-14 15:04:09,333:69505(0x738e8000):ZOO_INFO@zookeeper_close@2579: 
Freeing zookeeper resources for sessionId=0x167aef64b83

2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-14 15:04:09,333:69505(0x737

[jira] [Created] (MESOS-9478) ZooKeeperTest.Create is flaky

2018-12-14 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9478:
--

 Summary: ZooKeeperTest.Create is flaky
 Key: MESOS-9478
 URL: https://issues.apache.org/jira/browse/MESOS-9478
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX w/ libeven
Reporter: Benno Evers


Observed the following 
{noformat}
../../src/tests/zookeeper_tests.cpp:124
  Expected: ZNODEEXISTS
  Which is: -110
To be equal to: nonOwnerZk.create("/foo/bar/baz", "", 
zookeeper::EVERYONE_READ_CREATOR_ALL, 0, nullptr, true)
  Which is: -9
{noformat}

Full log:
{noformat}
[ RUN  ] ZooKeeperTest.Create
I1213 18:43:49.478912 222864832 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 57250
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:57250 sessionTimeout=1 
watcher=0x10fea4f00 sessionId=0 sessionPasswd= context=0x7fe4d5e7c680 
flags=0
2018-12-13 18:43:49,479:66260(0x7659e000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:57250]
2018-12-13 18:43:49,480:66260(0x7659e000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:57250], 
sessionId=0x167aa994066, negotiated timeout=1
2018-12-13 
18:43:52,819:66260(0x7659e000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:57250 sessionTimeout=1 
watcher=0x10fea4f00 sessionId=0 sessionPasswd= context=0x7fe4d5cf7a20 
flags=0
2018-12-13 18:43:52,823:66260(0x76d36000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:57250]
2018-12-13 18:43:52,824:66260(0x76d36000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:57250], 
sessionId=0x167aa9940660001, negotiated timeout=1
2018-12-13 18:44:05,891:66260(0x7659e000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 9735ms
2018-12-13 
18:44:05,891:66260(0x7659e000):ZOO_ERROR@handle_socket_error_msg@1702: 
Socket [127.0.0.1:57250] zk retcode=-7, errno=60(Operation timed out): 
connection to 127.0.0.1:57250 timed out (exceeded timeout by 6402ms)
2018-12-13 18:44:05,891:66260(0x7659e000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 9735ms
2018-12-13 18:44:05,892:66260(0x76d36000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 9736ms
2018-12-13 
18:44:05,892:66260(0x76d36000):ZOO_ERROR@handle_socket_error_msg@1702: 
Socket [127.0.0.1:57250] zk retcode=-7, errno=60(Operation timed out): 
connection to 127.0.0.1:57250 timed out (exceeded timeout by 6402ms)
2018-12-13 18:44:05,892:66260(0x76d36000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 9736m

[jira] [Commented] (MESOS-9247) MasterAPITest.EventAuthorizationFiltering is flaky

2018-12-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721328#comment-16721328
 ] 

Benno Evers commented on MESOS-9247:


Observed the same failure today on a CentOS 7 build.

> MasterAPITest.EventAuthorizationFiltering is flaky
> --
>
> Key: MESOS-9247
> URL: https://issues.apache.org/jira/browse/MESOS-9247
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: flaky, flaky-test, integration, mesosphere
> Attachments: MasterAPITest.EventAuthorizationFiltering.txt
>
>
> Saw this failure on a CentOS 6 SSL build in our internal CI. Build log 
> attached. For some reason, it seems that the initial {{TASK_ADDED}} event is 
> missed:
> {code}
> ../../src/tests/api_tests.cpp:2922
>   Expected: v1::master::Event::TASK_ADDED
>   Which is: TASK_ADDED
> To be equal to: event->get().type()
>   Which is: TASK_UPDATED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9468) SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask is flaky

2018-12-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9468:
--

 Summary: SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask 
is flaky
 Key: MESOS-9468
 URL: https://issues.apache.org/jira/browse/MESOS-9468
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX with ssl enabled
Reporter: Benno Evers


The following test failure was observed in an internal CI run:
{noformat}
../../src/tests/slave_tests.cpp:6341: Failure
Actual function call count doesn't match EXPECT_CALL(*slave.get()->mock(), 
_shutdownExecutor(_, _))...
 Expected: to be called once
   Actual: never called - unsatisfied and active
{noformat}

Full log:
{noformat}
[ RUN  ] SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask
I1210 16:20:13.298667 338650560 cluster.cpp:173] Creating default 'local' 
authorizer
I1210 16:20:13.36 238522368 master.cpp:414] Master 
4c470ddd-dc29-4d9c-9b46-8e7a8b6c7801 (Jenkinss-Mac-mini.local) started on 
10.0.49.4:54069
I1210 16:20:13.300034 238522368 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/master"
 --zk_session_timeout="10secs"
I1210 16:20:13.300215 238522368 master.cpp:466] Master only allowing 
authenticated frameworks to register
I1210 16:20:13.300227 238522368 master.cpp:472] Master only allowing 
authenticated agents to register
I1210 16:20:13.300237 238522368 master.cpp:478] Master only allowing 
authenticated HTTP frameworks to register
I1210 16:20:13.300246 238522368 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/credentials'
I1210 16:20:13.300427 238522368 master.cpp:522] Using default 'crammd5' 
authenticator
I1210 16:20:13.300489 238522368 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1210 16:20:13.300559 238522368 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1210 16:20:13.300607 238522368 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1210 16:20:13.300657 238522368 master.cpp:603] Authorization enabled
I1210 16:20:13.300863 237985792 whitelist_watcher.cpp:77] No whitelist given
I1210 16:20:13.300884 239058944 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1210 16:20:13.302809 235302912 master.cpp:2089] Elected as the leading master!
I1210 16:20:13.302834 235302912 master.cpp:1644] Recovering from registrar
I1210 16:20:13.302875 237985792 registrar.cpp:339] Recovering registrar
I1210 16:20:13.303133 237985792 registrar.cpp:383] Successfully fetched the 
registry (0B) in 08ns
I1210 16:20:13.303207 237985792 registrar.cpp:487] Applied 1 operations in 
24653ns; attempting to update the registry
I1210 16:20:13.303490 237985792 registrar.cpp:544] Successfully updated the 
registry in 258048ns
I1210 16:20:13.303539 237985792 registrar.cpp:416] Successfully recovered 
registrar
I1210 16:20:13.303692 236376064 master.cpp:1758] Recovered 0 agents from the 
registry (155B); allowing 10mins for agents to reregister
I1210 16:20:13.303723 235839488 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W1210 16:20:13.306483 338650560 process.cpp:2829] Attempted to spawn already 
running process files@10.0.49.4:54069
I1210 16:20:13.307142 338650560 containerizer.cpp:305] Using isolation { 
environment_secre

[jira] [Created] (MESOS-9467) ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky

2018-12-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9467:
--

 Summary: 
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
 is flaky
 Key: MESOS-9467
 URL: https://issues.apache.org/jira/browse/MESOS-9467
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX with ssl enabled
Reporter: Benno Evers


The following error was observed in an internal CI run:
{noformat}
../../src/tests/master_contender_detector_tests.cpp:872: Failure
Failed to wait 15secs for detected
{noformat}

Full log:
{noformat}
[ RUN  ] 
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
I1210 16:18:13.068011 338650560 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 54990
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:54990 sessionTimeout=1 
watcher=0x116d03e00 sessionId=0 sessionPasswd= context=0x7fd3883958d0 
flags=0
2018-12-10 16:18:13,068:28813(0x7ed1d000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:54990]
I1210 16:18:13.069262 236376064 contender.cpp:152] Joining the ZK group
2018-12-10 16:18:13,070:28813(0x7ed1d000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:54990], 
sessionId=0x1679aa0ddc9, negotiated timeout=1
I1210 16:18:13.070789 239058944 group.cpp:341] Group process 
(zookeeper-group(28)@10.0.49.4:54069) connected to ZooKeeper
I1210 16:18:13.070853 239058944 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I1210 16:18:13.070868 239058944 group.cpp:419] Trying to create path '/mesos' 
in ZooKeeper
I1210 16:18:13.073835 235839488 contender.cpp:268] New candidate (id='0') has 
entered the contest for leadership
I1210 16:18:13.074319 237985792 detector.cpp:152] Detected a new leader: 
(id='0')
I1210 16:18:13.074406 237449216 group.cpp:700] Trying to get 
'/mesos/json.info_00' in ZooKeeper
I1210 16:18:13.075139 239058944 zookeeper.cpp:262] A new leading master 
(UPID=@0.152.150.128:1) is detected
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/RELEASE_X86_64
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:54990 sessionTimeout=1 
watcher=0x116d03e00 sessionId=0 sessionPasswd= context=0x7fd3886b40e0 
flags=0
2018-12-10 16:18:13,075:28813(0x7f944000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:54990]
I1210 16:18:13.076236 238522368 contender.cpp:152] Joining the ZK group
2018-12-10 16:18:13,077:28813(0x7f944000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:54990], 
sessionId=0x1679aa0ddc90001, negotiated timeout=1
I1210 16:18:13.077278 239058944 group.cpp:341] Group process 
(zookeeper-group

[jira] [Created] (MESOS-9466) FetcherCacheTest.LocalCachedMissing is flaky

2018-12-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9466:
--

 Summary: FetcherCacheTest.LocalCachedMissing is flaky
 Key: MESOS-9466
 URL: https://issues.apache.org/jira/browse/MESOS-9466
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX with ssl enabled
Reporter: Benno Evers


Observed the following failure in an internal CI run:
{noformat}
../../src/tests/fetcher_cache_tests.cpp:722: Failure
Failed to wait 15secs for awaitFinished(task.get())
{noformat}


Full log:
{noformat}
[ RUN  ] FetcherCacheTest.LocalCachedMissing
I1210 16:16:09.364095 338650560 cluster.cpp:173] Creating default 'local' 
authorizer
I1210 16:16:09.365344 237985792 master.cpp:414] Master 
57f28035-e5fa-4e2a-8b8c-1caf1f9c85ca (Jenkinss-Mac-mini.local) started on 
10.0.49.4:54069
I1210 16:16:09.365368 237985792 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/master"
 --zk_session_timeout="10secs"
I1210 16:16:09.365530 237985792 master.cpp:466] Master only allowing 
authenticated frameworks to register
I1210 16:16:09.365541 237985792 master.cpp:472] Master only allowing 
authenticated agents to register
I1210 16:16:09.365550 237985792 master.cpp:478] Master only allowing 
authenticated HTTP frameworks to register
I1210 16:16:09.365559 237985792 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/credentials'
I1210 16:16:09.365763 237985792 master.cpp:522] Using default 'crammd5' 
authenticator
I1210 16:16:09.365819 237985792 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1210 16:16:09.365888 237985792 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1210 16:16:09.365967 237985792 http.cpp:1017] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1210 16:16:09.366027 237985792 master.cpp:603] Authorization enabled
I1210 16:16:09.366263 239058944 whitelist_watcher.cpp:77] No whitelist given
I1210 16:16:09.366286 237449216 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1210 16:16:09.368378 237985792 master.cpp:2089] Elected as the leading master!
I1210 16:16:09.368408 237985792 master.cpp:1644] Recovering from registrar
I1210 16:16:09.368455 235839488 registrar.cpp:339] Recovering registrar
I1210 16:16:09.368711 235839488 registrar.cpp:383] Successfully fetched the 
registry (0B) in 224us
I1210 16:16:09.368775 235839488 registrar.cpp:487] Applied 1 operations in 
23922ns; attempting to update the registry
I1210 16:16:09.369017 235839488 registrar.cpp:544] Successfully updated the 
registry in 218112ns
I1210 16:16:09.369065 235839488 registrar.cpp:416] Successfully recovered 
registrar
I1210 16:16:09.369207 238522368 master.cpp:1758] Recovered 0 agents from the 
registry (155B); allowing 10mins for agents to reregister
I1210 16:16:09.369225 236912640 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W1210 16:16:09.369658 338650560 process.cpp:2829] Attempted to spawn already 
running process version@10.0.49.4:54069
I1210 16:16:09.370749 338650560 containerizer.cpp:305] Using isolation { 
environment_secret, filesystem/posix, posix/mem, posix/cpu }
I1210 16:16:09.371047 338650560 provisioner.cpp:298] Using default backend 
'copy'
W1210 16:16:09.372812 338650560 process.cpp:2829] Attempted to

[jira] [Created] (MESOS-9465) ProcessRemoteLinkTest.RemoteStaleLinkRelink is flaky again

2018-12-11 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9465:
--

 Summary: ProcessRemoteLinkTest.RemoteStaleLinkRelink is flaky again
 Key: MESOS-9465
 URL: https://issues.apache.org/jira/browse/MESOS-9465
 Project: Mesos
  Issue Type: Bug
 Environment: Mac OSX with SSL enabled
Reporter: Benno Evers


The test failed with the following error in our internal CI:
{noformat}
[ RUN  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink
[warn] kq_init: detected broken kqueue; not using.: No such process
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1210 10:34:07.134811 351110592 process.cpp:1239] libprocess is initialized on 
10.0.49.4:58630 with 8 worker threads
I1210 10:34:07.137801 109821952 test_linkee.cpp:73] EXIT with status 0: 
../../../3rdparty/libprocess/src/tests/process_tests.cpp:1176: Failure
Mock function called more times than expected - returning directly.
Function call: exited(@0x7f9ef7f0d888 (1)@10.0.49.4:58631)
 Expected: to be called once
   Actual: called twice - over-saturated and active
W1210 10:34:07.139040 95457280 process.cpp:838] Failed to recv on socket 8 to 
peer 'unknown': Connection reset by peer
[  FAILED  ] ProcessRemoteLinkTest.RemoteStaleLinkRelink (22 ms)
{noformat}

Interestingly, looking at some context from the same CI run, it looks like many 
similar tests also had severe issues but still succeeded:
{noformat}
[ RUN  ] ProcessRemoteLinkTest.RemoteDoubleLinkRelink
[warn] kq_init: detected broken kqueue; not using.: No such process
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1210 10:34:06.945520 368641472 process.cpp:1239] libprocess is initialized on 
10.0.49.4:58618 with 8 worker threads
W1210 10:34:06.948437 95457280 process.cpp:838] Failed to recv on socket 8 to 
peer 'unknown': Connection reset by peer
W1210 10:34:06.948755 95457280 process.cpp:1423] Failed to recv on socket 11 to 
peer 'unknown': Connection reset by peer
[   OK ] ProcessRemoteLinkTest.RemoteDoubleLinkRelink (21 ms)
[ RUN  ] ProcessRemoteLinkTest.RemoteLinkLeak
[warn] kq_init: detected broken kqueue; not using.: No such process
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1210 10:34:06.966291 379131328 process.cpp:1239] libprocess is initialized on 
10.0.49.4:58623 with 8 worker threads
W1210 10:34:07.055934 300283328 process.cpp:1587] Failed to link to 
'10.0.49.4:58624', create socket: Failed to create socket: Too many open files
W1210 10:34:07.096643 95457280 process.cpp:838] Failed to recv on socket 8 to 
peer 'unknown': Connection reset by peer
[   OK ] ProcessRemoteLinkTest.RemoteLinkLeak (148 ms)
[ RUN  ] ProcessRemoteLinkTest.RemoteUseStaleLink
[warn] kq_init: detected broken kqueue; not using.: No such process
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1210 10:34:07.114372 219854272 process.cpp:1239] libprocess is initialized on 
10.0.49.4:58626 with 8 worker threads
W1210 10:34:07.117367 95457280 process.cpp:838] Failed to recv on socket 8 to 
peer 'unknown': Connection reset by peer
[   OK ] ProcessRemoteLinkTest.RemoteUseStaleLink (20 ms)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7217) CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.

2018-12-11 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716997#comment-16716997
 ] 

Benno Evers commented on MESOS-7217:


Same again on Centos 7 - I'm starting to see a pattern ;)

{noformat}
Expected: (0.30) >= (cpuTime), actual: 0.3 vs 0.3
{noformat}

> CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.
> 
>
> Key: MESOS-7217
> URL: https://issues.apache.org/jira/browse/MESOS-7217
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.8.0
> Environment: ubuntu-14.04, centos-7
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: containerizer, flaky, flaky-test, mesosphere, test
>
> The test CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs appears to be flaky 
> on Ubuntu 14.04.
> When failing, the test shows the following:
> {noformat}
> 14:05:48  [ RUN  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
> 14:05:48  I0306 14:05:48.704794 27340 cluster.cpp:158] Creating default 
> 'local' authorizer
> 14:05:48  I0306 14:05:48.716588 27340 leveldb.cpp:174] Opened db in 
> 11.681905ms
> 14:05:48  I0306 14:05:48.718921 27340 leveldb.cpp:181] Compacted db in 
> 2.309404ms
> 14:05:48  I0306 14:05:48.718945 27340 leveldb.cpp:196] Created db iterator in 
> 3075ns
> 14:05:48  I0306 14:05:48.718951 27340 leveldb.cpp:202] Seeked to beginning of 
> db in 558ns
> 14:05:48  I0306 14:05:48.718955 27340 leveldb.cpp:271] Iterated through 0 
> keys in the db in 257ns
> 14:05:48  I0306 14:05:48.718966 27340 replica.cpp:776] Replica recovered with 
> log positions 0 -> 0 with 1 holes and 0 unlearned
> 14:05:48  I0306 14:05:48.719113 27361 recover.cpp:451] Starting replica 
> recovery
> 14:05:48  I0306 14:05:48.719172 27361 recover.cpp:477] Replica is in EMPTY 
> status
> 14:05:48  I0306 14:05:48.719460 27361 replica.cpp:673] Replica in EMPTY 
> status received a broadcasted recover request from 
> __req_res__(6807)@10.179.217.143:53643
> 14:05:48  I0306 14:05:48.719537 27363 recover.cpp:197] Received a recover 
> response from a replica in EMPTY status
> 14:05:48  I0306 14:05:48.719625 27365 recover.cpp:568] Updating replica 
> status to STARTING
> 14:05:48  I0306 14:05:48.720384 27361 master.cpp:380] Master 
> cb9586dc-a080-41eb-b5b8-88274f84a20a (ip-10-179-217-143.ec2.internal) started 
> on 10.179.217.143:53643
> 14:05:48  I0306 14:05:48.720404 27361 master.cpp:382] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/tzyTvK/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/tzyTvK/master" --zk_session_timeout="10secs"
> 14:05:48  I0306 14:05:48.720553 27361 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> 14:05:48  I0306 14:05:48.720559 27361 master.cpp:446] Master only allowing 
> authenticated agents to register
> 14:05:48  I0306 14:05:48.720562 27361 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> 14:05:48  I0306 14:05:48.720566 27361 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/tzyTvK/credentials'
> 14:05:48  I0306 14:05:48.720655 27361 master.cpp:504] Using default 'crammd5' 
> authenticator
> 14:05:48  I0306 14:05:48.720700 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-readonly'
> 14:05:48  I0306 14:05:48.720767 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-readwrite'
> 14:05:48  I0306 14:05:48.720808 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-scheduler'
> 14:05:48  I0306 14:05:48.720875 27361 master.cpp:584] Authorization enabled
> 14:05:48  I0306 14:05:48.720995 27360 whitelist_watcher.cpp:77] No whitelist 
> given

[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2018-12-11 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716992#comment-16716992
 ] 

Benno Evers commented on MESOS-8096:


Observed the same today in 
`MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0`:
{noformat}
[ RUN  ] 
MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0
[...]
I1210 18:51:52.317384  2570 default_executor.cpp:1126] Killing task 
2506c623-0270-4126-aa0c-8eeda080e50d running in child container 
a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec with 
SIGTERM signal
I1210 18:51:52.317389  2570 default_executor.cpp:1137] Scheduling escalation to 
SIGKILL in 3secs from now
I1210 18:51:52.317608  2570 default_executor.cpp:1126] Killing task 
40e69403-db71-4902-af53-746d445a7489 running in child container 
a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2 with 
SIGTERM signal
I1210 18:51:52.317620  2570 default_executor.cpp:1137] Scheduling escalation to 
SIGKILL in 3secs from now
I1210 18:51:52.318428 15462 process.cpp:3588] Handling HTTP event for process 
'slave(1107)' with path: '/slave(1107)/api/v1'
I1210 18:51:52.318593 15461 process.cpp:3588] Handling HTTP event for process 
'slave(1107)' with path: '/slave(1107)/api/v1'
*** Aborted at 1544467912 (unix time) try "date -d @1544467912" if you are 
using GNU date ***
I1210 18:51:52.319488 15461 http.cpp:1157] HTTP POST for /slave(1107)/api/v1 
from 172.16.10.38:60672
I1210 18:51:52.319586 15461 http.cpp:1157] HTTP POST for /slave(1107)/api/v1 
from 172.16.10.38:60673
I1210 18:51:52.319697 15461 http.cpp:2797] Processing KILL_NESTED_CONTAINER 
call for container 
'a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec'
I1210 18:51:52.319808 15461 http.cpp:2797] Processing KILL_NESTED_CONTAINER 
call for container 
'a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2'
I1210 18:51:52.319927 15461 containerizer.cpp:2839] Sending Terminated to 
container 
a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec in 
RUNNING state
I1210 18:51:52.320010 15460 containerizer.cpp:2839] Sending Terminated to 
container 
a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2 in 
RUNNING state
PC: @ 0x7fd51d72d013 mesos::v1::scheduler::Mesos::send()
*** SIGSEGV (@0x0) received by PID 23718 (TID 0x7fd50f38b700) from PID 0; stack 
trace: ***
@ 0x7fd4e614aabc (unknown)
@ 0x7fd4e614f751 (unknown)
@ 0x7fd4e6142f58 (unknown)
@ 0x7fd51a3ae890 (unknown)
@ 0x7fd51d72d013 mesos::v1::scheduler::Mesos::send()
@ 0x558cee3c1808 
_ZNK5mesos8internal5tests2v19scheduler23SendAcknowledgeActionP2INS_2v111FrameworkIDENS5_7AgentIDEE10gmock_ImplIFvPNS5_9scheduler5MesosERKNSA_12Event_UpdateEEE17gmock_PerformImplISC_SF_N7testing8internal12ExcessiveArgESL_SL_SL_SL_SL_SL_SL_EEvRKSt5tupleIJSC_SF_EET_T0_T1_T2_T3_T4_T5_T6_T7_T8_
@ 0x558cee3c1990 
_ZN5mesos8internal5tests2v19scheduler23SendAcknowledgeActionP2INS_2v111FrameworkIDENS5_7AgentIDEE10gmock_ImplIFvPNS5_9scheduler5MesosERKNSA_12Event_UpdateEEE7PerformERKSt5tupleIJSC_SF_EE
@ 0x558cee2c430f 
_ZN7testing8internal12DoBothActionI17PromiseArgActionPILi1EPN7process7PromiseIN5mesos2v19scheduler12Event_UpdateNS5_8internal5tests2v19scheduler23SendAcknowledgeActionP2INS6_11FrameworkIDENS6_7AgentID4ImplIFvPNS7_5MesosERKS8_EE7PerformERKSt5tupleIJSN_SP_EE
@ 0x558cee2e9f57 
testing::internal::FunctionMockerBase<>::UntypedPerformAction()
@ 0x558cef7b184f 
testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
@ 0x558cee3d075d 
mesos::internal::tests::scheduler::MockHTTPScheduler<>::events()
@ 0x558cee34cda0 std::_Function_handler<>::_M_invoke()
@ 0x7fd51d731098 process::AsyncExecutorProcess::execute<>()
@ 0x7fd51d74061b 
_ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeISA_SaISA_ESE_SK_RSE_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSQ_FSN_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteIS14_EEOSI_OSE_PNS1_11ProcessBaseEE_IS17_SI_SE_S1B_EEEDTclcl7forwardISN_Efp_Espcl7forwardIT0_Efp0_EEEOSN_DpOS1D_
@ 0x7fd51e5205d1 process::ProcessBase::consume()
@ 0x7fd51e537543 process::ProcessManager::resume()
@ 0x7fd51e53d116 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7fd51ab89970 (unknown)
@ 0x7fd51a3a7064 start_thread
@ 0x7fd51a0dc62d (unknown)
E1210 18:51:52.501421  2574 default_executor.cpp:801] Connection for waiting on 
child container 
a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec of 
task '2506c623-0270-4126-aa0c-8eeda080e50d' interrupted: Disconnected
{noformat}

> Enqueueing events in MockH

[jira] [Created] (MESOS-9453) Libprocess does not handle "identity" encoding rules

2018-12-05 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9453:
--

 Summary: Libprocess does not handle "identity" encoding rules
 Key: MESOS-9453
 URL: https://issues.apache.org/jira/browse/MESOS-9453
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


[RFC 7231|https://tools.ietf.org/html/rfc7231#section-5.3.4], as well as the 
relevant [libprocess 
comment|https://github.com/apache/mesos/blob/dad74012fa02a7fbf61b09968d9b7e9c730b1c97/3rdparty/libprocess/src/http.cpp#L315-L325]
 mention special handling of the "identity" encoding. 

However, this is currently ignored in mesos, which can lead to incorrect 
behaviour in combination with MESOS-9451:
{noformat}
$ nc localhost 5050
GET /tasks HTTP/1.1
Accept-Encoding: gzip, identity;q=0 

HTTP/1.1 200 OK
Date: Wed, 05 Dec 2018 11:02:24 GMT
Content-Type: application/json
Content-Length: 12

{"tasks":[]}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9451) Libprocess endpoints can ignore required gzip compression

2018-12-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709943#comment-16709943
 ] 

Benno Evers commented on MESOS-9451:


Good point, I've opened MESOS-9453 for our lack of handling of the "identity" 
encoding.

> Libprocess endpoints can ignore required gzip compression
> -
>
> Key: MESOS-9451
> URL: https://issues.apache.org/jira/browse/MESOS-9451
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently, libprocess decides whether a response should be compressed by the 
> following conditional:
> {noformat}
> if (response.type == http::Response::BODY &&
> response.body.length() >= GZIP_MINIMUM_BODY_LENGTH &&
> !headers.contains("Content-Encoding") &&
> request.acceptsEncoding("gzip")) {
>   [...]
> {noformat}
> However, this implies that a request sent with the header "Accept-Encoding: 
> gzip" can not rely on actually getting a gzipped response, e.g. when the 
> response size is below the threshold:
> {noformat}
> $ nc localhost 5050
> GET /tasks HTTP/1.1
> Accept-Encoding: gzip
> HTTP/1.1 200 OK
> Date: Tue, 04 Dec 2018 12:49:56 GMT
> Content-Type: application/json
> Content-Length: 12
> {"tasks":[]}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9451) Libprocess endpoints can ignore required gzip compression

2018-12-04 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9451:
--

 Summary: Libprocess endpoints can ignore required gzip compression
 Key: MESOS-9451
 URL: https://issues.apache.org/jira/browse/MESOS-9451
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, libprocess decides whether a response should be compressed by the 
following conditional:
{noformat}
if (response.type == http::Response::BODY &&
response.body.length() >= GZIP_MINIMUM_BODY_LENGTH &&
!headers.contains("Content-Encoding") &&
request.acceptsEncoding("gzip")) {
  [...]
{noformat}

However, this implies that a request sent with the header "Accept-Encoding: 
gzip" can not rely on actually getting a gzipped response, e.g. when the 
response size is below the threshold:
{noformat}
$ nc localhost 5050
GET /tasks HTTP/1.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Date: Tue, 04 Dec 2018 12:49:56 GMT
Content-Type: application/json
Content-Length: 12

{"tasks":[]}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8045) Update Mesos executables output if there is a typo

2018-12-03 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-8045:
--

Resolution: Fixed
  Assignee: Benno Evers

This is resolved by MESOS-8728, now we only print the full help string when the 
"--help" option is specified.

> Update Mesos executables output if there is a typo
> --
>
> Key: MESOS-8045
> URL: https://issues.apache.org/jira/browse/MESOS-8045
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>Assignee: Benno Evers
>Priority: Minor
>
> Current output if a user makes a typo while using one of the Mesos 
> executables:
> {code}
> build (master) $ ./bin/mesos-master.sh --ip=127.0.0.1 --workdir=/tmp
> Failed to load unknown flag 'workdir'
> Usage: mesos-master [options]
>   --acls=VALUE
>The value could be a JSON-formatted string of ACLs
>   
>or a file path containing the JSON-formatted ACLs used
>   
>for authorization. Path could be of the form `file:///path/to/file`
>   
>or `/path/to/file`.
>   
>Note that if the flag `--authorizers` is provided with a value
>   
>different than `local`, the ACLs contents
>   
>will be ignored.
>   
>See the ACLs protobuf in acls.proto for the expected format.
>   
>Example:
>   
>{
>   
>  "register_frameworks": [
>   
>{
>   
>  "principals": { "type": "ANY" },
>   
>  "roles": { "values": ["a"] }
>   
>}
>   
>  ],
>   
>  "run_tasks": [
>   
>{
>   
>  "principals": { "values": ["a", "b"] },
>   
>  "users": { "values": ["c"] }
>   
>}
>   
>  ],
>   
>  "teardown_frameworks": [
>   
>{
>   
>  "principals": { "values": ["a", "b"] },
>   
>  "framework_principals": { "values": ["c"] }
>   
>}
>   
>  ],
>   
>  "set_quotas": [
>   
>{
>   
>  "principals": { "values": ["a"] },
>   
>  "roles": { "values": ["a", "b"] }
>   
>}
>   
>  ],
>   
>  "remove_quotas": [
>   
>{
>

[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming

2018-12-03 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707413#comment-16707413
 ] 

Benno Evers commented on MESOS-9022:


Confirmed, this is caused by the same underlying problem as MESOS-9000 and 
should be solved by https://reviews.apache.org/r/67575/ .

> Race condition in task updates could cause missing event in streaming
> -
>
> Key: MESOS-9022
> URL: https://issues.apache.org/jira/browse/MESOS-9022
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Affects Versions: 1.6.0
>Reporter: Evelyn Liu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: events, foundations, mesos, mesosphere, race-condition, 
> streaming
>
> Master sends update event of {{TASK_STARTING}} when task's latest state is 
> already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, 
> {{sendSubscribersUpdate}} is set to {{false}} because of 
> [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805].
>  The subscriber would not receive update event of {{TASK_FAILED}}.
> This happened when a task failed very fast. Is there a race condition while 
> handling task updates?
> {{*master log:*}}
> {code:java}
> I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, 
> status update state: TASK_STARTING)
>  I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_STARTING)
>  I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED 
> (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update 
> TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_FAILED)
>  I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky

2018-11-22 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696221#comment-16696221
 ] 

Benno Evers commented on MESOS-9272:


https://reviews.apache.org/r/69436

> SlaveTest.DefaultExecutorCommandInfo is flaky
> -
>
> Key: MESOS-9272
> URL: https://issues.apache.org/jira/browse/MESOS-9272
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run (4499):
> {noformat}
> ../../src/tests/cluster.cpp:697
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 }
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] SlaveTest.DefaultExecutorCommandInfo
> I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0927 01:48:44.247200 11037 master.cpp:413] Master 
> 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started 
> on 172.16.10.254:33398
> I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs"
> I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7SQ2cR/credentials'
> I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled
> I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given
> I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master!
> I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar
> I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar
> I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 168960ns
> I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in 
> 6362ns; attempting to update the registry
> I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the 
> registry in 94208ns
> I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered 
> registrar
> I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the 
> registry (176B); allowing 10mins for agents to reregister
> W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already 
> running process files@172.16.10.254:33398
> I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default

[jira] [Assigned] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky

2018-11-22 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9272:
--

Assignee: Benno Evers

> SlaveTest.DefaultExecutorCommandInfo is flaky
> -
>
> Key: MESOS-9272
> URL: https://issues.apache.org/jira/browse/MESOS-9272
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run (4499):
> {noformat}
> ../../src/tests/cluster.cpp:697
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 }
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] SlaveTest.DefaultExecutorCommandInfo
> I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0927 01:48:44.247200 11037 master.cpp:413] Master 
> 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started 
> on 172.16.10.254:33398
> I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs"
> I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7SQ2cR/credentials'
> I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled
> I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given
> I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master!
> I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar
> I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar
> I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 168960ns
> I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in 
> 6362ns; attempting to update the registry
> I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the 
> registry in 94208ns
> I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered 
> registrar
> I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the 
> registry (176B); allowing 10mins for agents to reregister
> W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already 
> running process files@172.16.10.254:33398
> I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default 'local' 
> authorizer
> I0927 01:48:44.251428 11035 slave.c

[jira] [Commented] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky

2018-11-22 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696219#comment-16696219
 ] 

Benno Evers commented on MESOS-9272:


Caused by: https://issues.apache.org/jira/browse/MESOS-9413

> SlaveTest.DefaultExecutorCommandInfo is flaky
> -
>
> Key: MESOS-9272
> URL: https://issues.apache.org/jira/browse/MESOS-9272
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run (4499):
> {noformat}
> ../../src/tests/cluster.cpp:697
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 }
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] SlaveTest.DefaultExecutorCommandInfo
> I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0927 01:48:44.247200 11037 master.cpp:413] Master 
> 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started 
> on 172.16.10.254:33398
> I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs"
> I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7SQ2cR/credentials'
> I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled
> I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given
> I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master!
> I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar
> I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar
> I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 168960ns
> I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in 
> 6362ns; attempting to update the registry
> I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the 
> registry in 94208ns
> I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered 
> registrar
> I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the 
> registry (176B); allowing 10mins for agents to reregister
> W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already 
> running process files@172.16.10.254:33398
> I0927 01:48:44.251050 11015 cluster.

[jira] [Created] (MESOS-9413) Composing containerizer has no way to wait for container removal

2018-11-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9413:
--

 Summary: Composing containerizer has no way to wait for container 
removal
 Key: MESOS-9413
 URL: https://issues.apache.org/jira/browse/MESOS-9413
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Inside the composing containerizer, destruction is ultimately implemented like 
this:
{noformat}
  return container->containerizer->destroy(containerId)
.onAny(defer(self(), [=](const Future>&) {
  if (containers_.contains(containerId)) {
delete containers_.at(containerId);
containers_.erase(containerId);
  }
}));
{noformat}

This means that code trying to ensure that every container is killed like this
{noformat}
foreach (const ContainerID& containerId, containers.get()) {
  process::Future> termination =
containerizer->destroy(containerId);

  AWAIT(termination);
  }
  ASSERT_TRUE(containerizer->empty());
{noformat}

is inherently racy, because the call to `empty()` might happen before the 
removal that gets deferred in the `.onAny()`-callback is executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9400) Allow application to learn the number of libprocess worker threads.

2018-11-19 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9400:
--

 Summary: Allow application to learn the number of libprocess 
worker threads.
 Key: MESOS-9400
 URL: https://issues.apache.org/jira/browse/MESOS-9400
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The number of worker threads used by libprocess is usually dependen on the 
number of CPU cores on the machine, but can be overwritten using the 
environment variable `LIBPROCESS_NUM_WORKER_THREADS`.

However, as far as I could tell there is currently no way for applications 
using libprocess to learn the current number of worker threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9391) Parallel test runner can exhaust system resource in combination with libtool wrappers

2018-11-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9391:
--

 Summary: Parallel test runner can exhaust system resource in 
combination with libtool wrappers
 Key: MESOS-9391
 URL: https://issues.apache.org/jira/browse/MESOS-9391
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Using the default autotools build currently enables both the parallel test 
runner (--enable-parallel-test-execution) and the use of libtool wrapper 
scripts (--enable-libtool-wrappers).

These have an unfortunate interaction where the wrapper scripts will actually 
call the linker on first invocation, and the parallel test runner will run with 
the `nrproc` parallel tests, leading to this many parallel invocations of the 
linker for a huge link, which can completely exhaust available resources on the 
host machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9390) Warnings in AdaptedOperation prevent clang build

2018-11-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9390:
--

 Summary: Warnings in AdaptedOperation prevent clang build
 Key: MESOS-9390
 URL: https://issues.apache.org/jira/browse/MESOS-9390
 Project: Mesos
  Issue Type: Bug
 Environment: Fedora 28
Reporter: Benno Evers


Trying to build the latest mesos master using clang-8 as a compiler, the 
following warnings can be observed:
{noformat}
../../src/resource_provider/registrar.cpp:387:5: error: explicitly defaulted 
move constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
AdaptedOperation(AdaptedOperation&&) = default;
^
../../src/resource_provider/registrar.cpp:374:28: note: move constructor of 
'AdaptedOperation' is implicitly deleted because base class 
'master::RegistryOperation' has a deleted move constructor
  class AdaptedOperation : public master::RegistryOperation
   ^
../../src/master/registrar.hpp:45:27: note: copy constructor of 
'RegistryOperation' is implicitly deleted because base class 
'process::Promise' has an inaccessible copy constructor
class RegistryOperation : public process::Promise
  ^
../../src/resource_provider/registrar.cpp:389:23: error: explicitly defaulted 
move assignment operator is implicitly deleted 
[-Werror,-Wdefaulted-function-deleted]
AdaptedOperation& operator=(AdaptedOperation&&) = default;
  ^
../../src/resource_provider/registrar.cpp:374:28: note: move assignment 
operator of 'AdaptedOperation' is implicitly deleted because base class 
'master::RegistryOperation' has a deleted move assignment operator
  class AdaptedOperation : public master::RegistryOperation
   ^
../../src/master/registrar.hpp:45:27: note: copy assignment operator of 
'RegistryOperation' is implicitly deleted because base class 
'process::Promise' has an inaccessible copy assignment operator
class RegistryOperation : public process::Promise
  ^
2 errors generated.
{noformat}

I tried looking into this, but I can't make sense of the warnings, the required 
move constructor and move assignment operator seem to be correctly declared in 
`Promise`:

{noformat}
// 3rdparty/libprocess/include/process/future.hpp
template 
class Promise
{
public:
  Promise();
  virtual ~Promise();

  explicit Promise(const T& t);

  Promise(Promise&& that) = default;
  Promise& operator=(Promise&&) = default;
[...]
};
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9389) Cannot build python support using clang 8

2018-11-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9389:
--

 Summary: Cannot build python support using clang 8
 Key: MESOS-9389
 URL: https://issues.apache.org/jira/browse/MESOS-9389
 Project: Mesos
  Issue Type: Bug
 Environment: Fedora 28 w/ autotools build and clang
Reporter: Benno Evers


Trying to compile latest mesos master with python support enabled on a Fedora 
28 machine leads to the following configuration error:

{noformat}
$ ../configure CC=clang CXX=clang++
[...]
checking whether we can build usable Python eggs... clang-8: error: unknown 
argument: '-fstack-clash-protection'
clang-8: error: unknown argument: '-fstack-clash-protection'
error: command 'clang' failed with exit status 1
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9224) De-duplicate read-only requests to master based on principal.

2018-11-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686836#comment-16686836
 ] 

Benno Evers commented on MESOS-9224:


After discussions with Alex and Greg, we failed to identify a way to 
deteministically trigger the batching functionality without either introducing 
some inherent test flakiness or major modifications of both mesos and testing 
code. The main problems I was running into:

 - A correctly working cache should, ideally, be undetectable from the outside, 
so there's the question of how to verify that the test code actually was 
hitting the cache. We thought about introducing new endpoints dynamically that 
just count how often they've been accessed, but it seems not currently possible 
to introduce new routes or replace existing ones at runtime. Additionally, this 
has the problem that the dynamically introduced routes would not be cached.
 - The routines used to implemenet the de-duplication are currently all 
private. We can introduce public getters and setters or just directly open up 
master internals for use in tests, but that seems like a code smell. It's also 
hard to use `protected` here, because instantiating a new master instance is a 
messy process requiring lots of support code, all of which would need to be 
duplicated to use a sub-class of the mesos master.
 - Ideally, we should use the actual http pipeline used by mesos in our unit 
tests, including libprocess authentication and routing, so even if we could 
somehow directly access the mesos-master http internals its questionable if we 
should do it.

I'm currently working on an alternate, slightly probabilistic kind of test that 
tries to launch many requests at once and verifies that they still returning 
the correct answers.

> De-duplicate read-only requests to master based on principal.
> -
>
> Key: MESOS-9224
> URL: https://issues.apache.org/jira/browse/MESOS-9224
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: performance
>
> "Identical" read-only requests can be batched and answered together. With 
> batching available (MESOS-9158), we can now deduplicate requests based on 
> principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9224) De-duplicate read-only requests to master based on principal.

2018-10-18 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655743#comment-16655743
 ] 

Benno Evers commented on MESOS-9224:


A review chain with the required changes can be found at 
https://reviews.apache.org/r/68131

The one thing that is still missing is a set of unit tests, which is appended 
to the chain as a wip-commit but proves to be unexpectly hard due to the fuzzy 
interface between http handler and master, and the caching being deeply buried 
in the master internals.

> De-duplicate read-only requests to master based on principal.
> -
>
> Key: MESOS-9224
> URL: https://issues.apache.org/jira/browse/MESOS-9224
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: performance
>
> "Identical" read-only requests can be batched and answered together. With 
> batching available (MESOS-9158), we can now deduplicate requests based on 
> principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error

2018-10-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9329:
--

 Summary: CMake build on Fedora 28 fails due to libevent error
 Key: MESOS-9329
 URL: https://issues.apache.org/jira/browse/MESOS-9329
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Trying to build Mesos using cmake with the options 
{noformat}
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1
{noformat}

fails due to the following:
{noformat}
[  1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o
/home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
 In function ‘bio_bufferevent_new’:
/home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
 error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
  b->init = 0;
   ^~
/home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
 At top level:
/home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1:
 error: variable ‘methods_bufferevent’ has initializer but incomplete type
 static BIO_METHOD methods_bufferevent = {
[...]
{noformat}

Since the autotools build does not have issues when enabling libevent and ssl, 
it seems most likely that the `libevent-2.1.5-beta` version used by default in 
the cmake build is somehow connected to the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9328) Mock slave in mesos tests does not compile using gcc 8

2018-10-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9328:
--

 Summary: Mock slave in mesos tests does not compile using gcc 8
 Key: MESOS-9328
 URL: https://issues.apache.org/jira/browse/MESOS-9328
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Attempting to compile the mesos tests on a Fedora 28 machine using gcc 8 
results in the following error:

{noformat}
../../3rdparty/libprocess/include/process/future.hpp: In instantiation of 
‘process::Future::Future(const U&) [with U = const 
testing::MatcherInterface&>&>*; 
T = Nothing]’:
/usr/include/c++/8/type_traits:920:12:   required from ‘struct 
std::is_constructible&, const 
testing::MatcherInterface&>&>*&>’
/usr/include/c++/8/type_traits:126:12:   required from ‘struct 
std::__and_&, const 
testing::MatcherInterface&>&>*&> >’
/usr/include/c++/8/tuple:485:68:   required from ‘static constexpr bool 
std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = 
{const testing::MatcherInterface&>&>*&}; bool  = true; _Elements = {const 
process::Future&}]’
/usr/include/c++/8/tuple:641:59:   required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 
1), const process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) 
&& (1 >= 1)), bool>::type  > constexpr std::tuple&>::tuple(_UElements&& ...) [with _UElements = {const 
testing::MatcherInterface&>&>*&}; typename std::enable_if<((std::_TC<((1 == 
sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const 
process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_MoveConstructibleTuple<_UElements ...>() && 
std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 
1), const process::Future&>::_NotSameTuple<_UElements ...>()), const 
process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) 
&& (1 >= 1)), bool>::type  = 1]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-matchers.h:485:10:
   required from ‘testing::Matcher testing::MakeMatcher(const 
testing::MatcherInterface*) [with T = const std::tuple&>&]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-matchers.h:3732:43:
   required from ‘testing::Matcher testing::A() [with T = const 
std::tuple&>&]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:893:47:
   required from 
‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*,
 const char*, int, const string&, const ArgumentMatcherTuple&) [with F = 
void(const process::Future&); testing::internal::string = 
std::__cxx11::basic_string; 
testing::internal::TypedExpectation::ArgumentMatcherTuple = 
std::tuple&> >]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
   required from ‘testing::internal::TypedExpectation& 
testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, 
const string&, const ArgumentMatcherTuple&) [with F = void(const 
process::Future&); testing::internal::string = 
std::__cxx11::basic_string; 
testing::internal::FunctionMockerBase::ArgumentMatcherTuple = 
std::tuple&> >]’
../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43:
   required from ‘testing::internal::TypedExpectation& 
testing::internal::MockSpec::InternalExpectedAt(const char*, int, const 
char*, const char*) [with F = void(const process::Future&)]’
../../src/tests/mock_slave.cpp:141:3:   required from here
../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no matching 
function for call to ‘process::Future::set(const 
testing::MatcherInterface&>&>* 
const&)’
   set(u);
   ^~~
../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: 
‘bool process::Future::set(const T&) [with T = Nothing]’
 bool Future::set(const T& t)
  ^
../../3rdparty/libprocess/include/process/future.hpp:1761:6: note:   no known 
conversion for argument 1 from ‘const testing::MatcherInterface&>&>* const’ to ‘const Nothing&’
../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: 
‘bool process::Future::set(T&&) [with T = Nothing]’
 bool Future::set(T&& t)
  ^
../../3rdparty/libprocess/include/process/future.hpp:1754:6: note:   no known 
conversion for argument 1 from ‘const testing::MatcherInterface&>&>* const’ to ‘Nothing&&’
make[1]: *** [Makefile:10735: tests/mesos_tests-mock_slave.o] Error 1
{noformat}

The offending line looks like this:
{noformat}
  // mock_slave.cpp:141
  EXPECT_CALL(*this, __recover(_))
.WillRepeatedly(Invoke(this, &MockSlave::unmocked___recover));
{noformat}

>From a first glance, it looks like it is caused by additional compile-tim

[jira] [Commented] (MESOS-9323) Relocation errros against symbol id::UUID::random()

2018-10-17 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653343#comment-16653343
 ] 

Benno Evers commented on MESOS-9323:


After further investigation, this was caused by mixing `g++` as default 
compiiler with `lld` as default linker.

I'm able to reproduce the unique symbol and the DTPOFF32 relocation using this 
example program:
{noformat}
$ cat thread_local.cpp
#include 

class C {
public:
  static void* foo() {
static thread_local void* generator = nullptr;
return generator;
  }
};


void* cfoo() {
return C::foo();
}
$ g++ thread-local.cpp -c -O2 -fPIC
{noformat}

But this in itself doesn't seem to be enough to trigger the error, so I still 
don't know the actual root cause of this problem.

> Relocation errros against symbol id::UUID::random()
> ---
>
> Key: MESOS-9323
> URL: https://issues.apache.org/jira/browse/MESOS-9323
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Trying to build Mesos on a Fedora 28 machine using a combination of gcc 8.1 
> and lld 8-trunk results in the following error:
> {noformat}
> ld: error: can't create dynamic relocation R_X86_64_DTPOFF32 against symbol: 
> id::UUID::random()::generator in readonly segment; recompile object files 
> with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output
> >>> defined in 
> >>> ./.libs/libmesos_no_3rdparty.a(libmesos_no_3rdparty_la-checker_process.o)
> >>> referenced by uuid.hpp:43 (../../3rdparty/stout/include/stout/uuid.hpp:43)
> >>>   
> >>> lt15-libmesos_no_3rdparty_la-manager.o:(mesos::internal::ResourceProviderManagerProcess::newResourceProviderId())
> >>>  in archive ./.libs/libmesos_no_3rdparty.a
> ld: error: too many errors emitted, stopping now (use -error-limit=0 to see 
> all errors)
> {noformat}
> Both the linker and compiler flags already included `-fPIC`, so this part of 
> the error message seems bogus.
> I'm not sure if this an issue of the compiler generating invalid object files 
> or the linker misunderstanding the created artifacts. However, the symbol 
> `id::UUID::random()::generator` is a very special case because it is a 
> function-local static in an inline function, causing gcc to generate a 
> special `GNU_UNIQUE` symbol, and also a thread-local variable leading to the 
> DTPOFF32 relocation.
> It seems like this combination of uncommon things is somehow tripping up one 
> of the involved tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9302) Mesos fails to build on Fedora 28

2018-10-16 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651692#comment-16651692
 ] 

Benno Evers commented on MESOS-9302:


Opened https://reviews.apache.org/r/69043/ to fix the issue by passing 
`-Wno-error` to the `cares` build.

> Mesos fails to build on Fedora 28
> -
>
> Key: MESOS-9302
> URL: https://issues.apache.org/jira/browse/MESOS-9302
> Project: Mesos
>  Issue Type: Bug
> Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Fedora 28
>Reporter: Benno Evers
>Priority: Major
>  Labels: build-failure
>
> Trying to compile a fresh Mesos checkout on a Fedora 28 system with the 
> following configuration flags:
> {noformat}
> ../configure --enable-debug --enable-optimize --disable-java --disable-python 
> --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror
> {noformat}
> and the following compiler
> {noformat}
> [bev...@core1.hw.ca1 build]$ gcc --version
> gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> {noformat}
> fails the build due to two warnings (even though --disable-werror was passed):
> {noformat}
> make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0'
> [C]   Compiling third_party/cares/cares/ares_init.c
> third_party/cares/cares/ares_init.c: In function ‘ares_dup’:
> third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in 
> ‘strncpy’ call is the same expression as the source; did you mean to use the 
> size of the destination? [-Werror=sizeof-pointer-memaccess]
>sizeof(src->local_dev_name));
>  ^
> third_party/cares/cares/ares_init.c: At top level:
> cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ 
> [-Werror]
> cc1: all warnings being treated as errors
> make[4]: *** [Makefile:2635: 
> /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o]
>  Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9323) Relocation errros against symbol id::UUID::random()

2018-10-16 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9323:
--

 Summary: Relocation errros against symbol id::UUID::random()
 Key: MESOS-9323
 URL: https://issues.apache.org/jira/browse/MESOS-9323
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Trying to build Mesos on a Fedora 28 machine using a combination of gcc 8.1 and 
lld 8-trunk results in the following error:
{noformat}
ld: error: can't create dynamic relocation R_X86_64_DTPOFF32 against symbol: 
id::UUID::random()::generator in readonly segment; recompile object files with 
-fPIC or pass '-Wl,-z,notext' to allow text relocations in the output
>>> defined in 
>>> ./.libs/libmesos_no_3rdparty.a(libmesos_no_3rdparty_la-checker_process.o)
>>> referenced by uuid.hpp:43 (../../3rdparty/stout/include/stout/uuid.hpp:43)
>>>   
>>> lt15-libmesos_no_3rdparty_la-manager.o:(mesos::internal::ResourceProviderManagerProcess::newResourceProviderId())
>>>  in archive ./.libs/libmesos_no_3rdparty.a

ld: error: too many errors emitted, stopping now (use -error-limit=0 to see all 
errors)
{noformat}

Both the linker and compiler flags already included `-fPIC`, so this part of 
the error message seems bogus.

I'm not sure if this an issue of the compiler generating invalid object files 
or the linker misunderstanding the created artifacts. However, the symbol 
`id::UUID::random()::generator` is a very special case because it is a 
function-local static in an inline function, causing gcc to generate a special 
`GNU_UNIQUE` symbol, and also a thread-local variable leading to the DTPOFF32 
relocation.

It seems like this combination of uncommon things is somehow tripping up one of 
the involved tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9302) Mesos fails to build on Fedora 28

2018-10-09 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9302:
--

 Summary: Mesos fails to build on Fedora 28
 Key: MESOS-9302
 URL: https://issues.apache.org/jira/browse/MESOS-9302
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Trying to compile a fresh Mesos checkout on a Fedora 28 system with the 
following configuration flags:
{noformat}
../configure --enable-debug --enable-optimize --disable-java --disable-python 
--disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror
{noformat}
and the following compiler
{noformat}
[bev...@core1.hw.ca1 build]$ gcc --version
gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
{noformat}
fails the build due to two warnings (even though --disable-werror was passed):
{noformat}
make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0'
[C]   Compiling third_party/cares/cares/ares_init.c
third_party/cares/cares/ares_init.c: In function ‘ares_dup’:
third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in 
‘strncpy’ call is the same expression as the source; did you mean to use the 
size of the destination? [-Werror=sizeof-pointer-memaccess]
   sizeof(src->local_dev_name));
 ^
third_party/cares/cares/ares_init.c: At top level:
cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ 
[-Werror]
cc1: all warnings being treated as errors
make[4]: *** [Makefile:2635: 
/home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o]
 Error 1
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9292) Rejected quotas should include a reason in their error message

2018-10-04 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9292:
--

 Summary: Rejected quotas should include a reason in their error 
message
 Key: MESOS-9292
 URL: https://issues.apache.org/jira/browse/MESOS-9292
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


If we reject a quota request due to not having enough available resources, we 
fail with the following error:
{noformat}
Not enough available cluster capacity to reasonably satisfy quota
request; the force flag can be used to override this check
{noformat}

but we don't print *which* resource was not available. This can be confusing to 
operators when the quota was attempted to be set for multiple resources at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9286) ZooKeeperTest.LeaderContender is flaky

2018-10-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9286:
--

 Summary: ZooKeeperTest.LeaderContender is flaky
 Key: MESOS-9286
 URL: https://issues.apache.org/jira/browse/MESOS-9286
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run in a Mac environment.
{noformat}
../../src/tests/zookeeper_tests.cpp:307
Failed to wait 15secs for lostCandidacy
{noformat}

Sadly, the full build log was lost before it could be investigated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9285) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume is flaky

2018-10-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9285:
--

 Summary: 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume
 is flaky
 Key: MESOS-9285
 URL: https://issues.apache.org/jira/browse/MESOS-9285
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run (4432) in a Debian 8 environment:
{noformat}
../../src/tests/containerizer/docker_volume_isolator_tests.cpp:947
Failed to wait 15secs for statusStarting
{noformat}

Sadly, the full log seems to have been lost before it could be investigated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7217) CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.

2018-10-02 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635541#comment-16635541
 ] 

Benno Evers commented on MESOS-7217:


Observed again today in run 4432 in a CentOS 7 environment.

> CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.
> 
>
> Key: MESOS-7217
> URL: https://issues.apache.org/jira/browse/MESOS-7217
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1
> Environment: ubuntu-14.04
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere, test
>
> The test CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs appears to be flaky 
> on Ubuntu 14.04.
> When failing, the test shows the following:
> {noformat}
> 14:05:48  [ RUN  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
> 14:05:48  I0306 14:05:48.704794 27340 cluster.cpp:158] Creating default 
> 'local' authorizer
> 14:05:48  I0306 14:05:48.716588 27340 leveldb.cpp:174] Opened db in 
> 11.681905ms
> 14:05:48  I0306 14:05:48.718921 27340 leveldb.cpp:181] Compacted db in 
> 2.309404ms
> 14:05:48  I0306 14:05:48.718945 27340 leveldb.cpp:196] Created db iterator in 
> 3075ns
> 14:05:48  I0306 14:05:48.718951 27340 leveldb.cpp:202] Seeked to beginning of 
> db in 558ns
> 14:05:48  I0306 14:05:48.718955 27340 leveldb.cpp:271] Iterated through 0 
> keys in the db in 257ns
> 14:05:48  I0306 14:05:48.718966 27340 replica.cpp:776] Replica recovered with 
> log positions 0 -> 0 with 1 holes and 0 unlearned
> 14:05:48  I0306 14:05:48.719113 27361 recover.cpp:451] Starting replica 
> recovery
> 14:05:48  I0306 14:05:48.719172 27361 recover.cpp:477] Replica is in EMPTY 
> status
> 14:05:48  I0306 14:05:48.719460 27361 replica.cpp:673] Replica in EMPTY 
> status received a broadcasted recover request from 
> __req_res__(6807)@10.179.217.143:53643
> 14:05:48  I0306 14:05:48.719537 27363 recover.cpp:197] Received a recover 
> response from a replica in EMPTY status
> 14:05:48  I0306 14:05:48.719625 27365 recover.cpp:568] Updating replica 
> status to STARTING
> 14:05:48  I0306 14:05:48.720384 27361 master.cpp:380] Master 
> cb9586dc-a080-41eb-b5b8-88274f84a20a (ip-10-179-217-143.ec2.internal) started 
> on 10.179.217.143:53643
> 14:05:48  I0306 14:05:48.720404 27361 master.cpp:382] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/tzyTvK/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/tzyTvK/master" --zk_session_timeout="10secs"
> 14:05:48  I0306 14:05:48.720553 27361 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> 14:05:48  I0306 14:05:48.720559 27361 master.cpp:446] Master only allowing 
> authenticated agents to register
> 14:05:48  I0306 14:05:48.720562 27361 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> 14:05:48  I0306 14:05:48.720566 27361 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/tzyTvK/credentials'
> 14:05:48  I0306 14:05:48.720655 27361 master.cpp:504] Using default 'crammd5' 
> authenticator
> 14:05:48  I0306 14:05:48.720700 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-readonly'
> 14:05:48  I0306 14:05:48.720767 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-readwrite'
> 14:05:48  I0306 14:05:48.720808 27361 http.cpp:887] Using default 'basic' 
> HTTP authenticator for realm 'mesos-master-scheduler'
> 14:05:48  I0306 14:05:48.720875 27361 master.cpp:584] Authorization enabled
> 14:05:48  I0306 14:05:48.720995 27360 whitelist_watcher.cpp:77] No whitelist 
> given
> 14:05:48  I0306 14:05:48.721005 27364 hierarchical.cpp:149] Initialized 
> hierarchical allocator pr

[jira] [Created] (MESOS-9280) Allow specification of static reservations relative to the total resources

2018-10-01 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9280:
--

 Summary: Allow specification of static reservations relative to 
the total resources
 Key: MESOS-9280
 URL: https://issues.apache.org/jira/browse/MESOS-9280
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The current user interface for creating static reservations is described here:

http://mesos.apache.org/documentation/latest/reservation/

In summary, to create a static reservation, an operator needs to subdivide the 
available resources on an agent into reserved an unreserved resources, like 
this:
{noformat}
mesos-slave --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096" [...]
{noformat}

However, this can result in some awkward interactions when trying to change 
static reservations

1) *Requirement of an explicit upper bound*. By default, an agent will offer 
all CPU's and all Memory of its host machine. However, an agent with the above 
configuration running on a machine with e.g. 32 cpus will still only offer 12 
of them, 8 for `ads` and 4 for general use.

For an operator planning to deploy configuration to a diverse set of machines, 
it seems to be required to write a script to get the total amount of available 
resources, and ensure that it is re-run periodically to capture hardware 
changes - duplicating functionality that Mesos already offers out-of-the-box.

2) *Interaction with ranges*. A configuration like
{noformat}
mesos-slave --resources="ports:[0-32655];ports(__internal):[22-22]" [...]
{noformat}
will lead to the master still offering port 22 to all frameworks, because the 
master thinks that the reserved port is an additional item of the "ports" 
resource.
On the other hand, a configuration like
{noformat}
mesos-slave --resources="ports(__internal):[22-22]" [...]
{noformat}
leaves the master knowing only about the existence of the single, reserved port 
22.

Again, for an operator planning to reserve this port across a range of diverse 
agents, the only way seems to write a script parsing and processing the 
existing configuration and then slicing up the ranges like this:
{noformat}
mesos-slave --resources="ports:[0-21],[23-32655];ports(__internal):[22-22]" 
[...]
{noformat}


Ideally, it would be possible to specify static reservations as a subtraction 
from the total, i.e. being able to say "Reserve 4 GiB of memory for role X" 
instead of saying "Reserve 4GiB for role X and 4GiB for general use".

Doing so would probably require introducing some additional syntax to the 
resource specification strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9276) SlaveRecoveryTest/0.Reboot is flaky

2018-09-28 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9276:
--

 Summary: SlaveRecoveryTest/0.Reboot is flaky
 Key: MESOS-9276
 URL: https://issues.apache.org/jira/browse/MESOS-9276
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run: (4502)
{noformat}
../../src/tests/slave_recovery_tests.cpp:2746: Failure
Failed to wait 15secs for executorStatus
{noformat}

Full log:
{noformat}
[ RUN  ] SlaveRecoveryTest/0.Reboot
I0927 12:33:33.620496 2560127808 cluster.cpp:173] Creating default 'local' 
authorizer
I0927 12:33:33.621817 75808768 master.cpp:413] Master 
b351e786-2364-4c2e-bb10-1efc3c97e509 (Jenkinss-Mac-mini.local) started on 
10.0.49.4:65455
I0927 12:33:33.621845 75808768 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/master"
 --zk_session_timeout="10secs"
I0927 12:33:33.622007 75808768 master.cpp:465] Master only allowing 
authenticated frameworks to register
I0927 12:33:33.622015 75808768 master.cpp:471] Master only allowing 
authenticated agents to register
I0927 12:33:33.622020 75808768 master.cpp:477] Master only allowing 
authenticated HTTP frameworks to register
I0927 12:33:33.622026 75808768 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/credentials'
I0927 12:33:33.622184 75808768 master.cpp:521] Using default 'crammd5' 
authenticator
I0927 12:33:33.622243 75808768 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0927 12:33:33.622328 75808768 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 12:33:33.622391 75808768 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 12:33:33.622442 75808768 master.cpp:602] Authorization enabled
I0927 12:33:33.622640 74735616 whitelist_watcher.cpp:77] No whitelist given
I0927 12:33:33.622643 75272192 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0927 12:33:33.624191 77418496 master.cpp:2083] Elected as the leading master!
I0927 12:33:33.624217 77418496 master.cpp:1638] Recovering from registrar
I0927 12:33:33.624264 76881920 registrar.cpp:339] Recovering registrar
I0927 12:33:33.624541 76881920 registrar.cpp:383] Successfully fetched the 
registry (0B) in 255232ns
I0927 12:33:33.624619 76881920 registrar.cpp:487] Applied 1 operations in 
27286ns; attempting to update the registry
I0927 12:33:33.624822 76881920 registrar.cpp:544] Successfully updated the 
registry in 172032ns
I0927 12:33:33.624892 76881920 registrar.cpp:416] Successfully recovered 
registrar
I0927 12:33:33.625068 75272192 master.cpp:1752] Recovered 0 agents from the 
registry (155B); allowing 10mins for agents to reregister
I0927 12:33:33.625089 77955072 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
I0927 12:33:33.626883 2560127808 containerizer.cpp:305] Using isolation { 
environment_secret, filesystem/posix, posix/mem, posix/cpu }
I0927 12:33:33.627074 2560127808 provisioner.cpp:298] Using default backend 
'copy'
W0927 12:33:33.628770 2560127808 process.cpp:2810] Attempted to spawn already 
running process files@10.0.49.4:65455
I0927 12:33:33.629148 2560127808 cluster.cpp:485] Creating default 'local' 
authorizer
I0927 12:33:33.630077 75272192 slave.cpp:267] Mesos agent started on 
(525)@10.0.49.4:65455
I0927 12:33:33.630103 75272192

[jira] [Commented] (MESOS-9079) Test MasterTestPrePostReservationRefinement.LaunchGroup/0 is flaky.

2018-09-28 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631771#comment-16631771
 ] 

Benno Evers commented on MESOS-9079:


Observed the same for the `/1` variant: (Run 4504)

{noformat}
[ RUN  ] bool/MasterTestPrePostReservationRefinement.LaunchGroup/1
I0927 16:41:07.341975 2560127808 cluster.cpp:173] Creating default 'local' 
authorizer
I0927 16:41:07.343353 96841728 master.cpp:413] Master 
d8823df0-8625-4d84-9980-2c64d226d6f8 (Jenkinss-Mac-mini.local) started on 
10.0.49.4:56698
I0927 16:41:07.343381 96841728 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/master"
 --zk_session_timeout="10secs"
I0927 16:41:07.343574 96841728 master.cpp:465] Master only allowing 
authenticated frameworks to register
I0927 16:41:07.343582 96841728 master.cpp:471] Master only allowing 
authenticated agents to register
I0927 16:41:07.343588 96841728 master.cpp:477] Master only allowing 
authenticated HTTP frameworks to register
I0927 16:41:07.343603 96841728 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/credentials'
I0927 16:41:07.343760 96841728 master.cpp:521] Using default 'crammd5' 
authenticator
I0927 16:41:07.343873 96841728 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0927 16:41:07.343940 96841728 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 16:41:07.344009 96841728 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 16:41:07.344054 96841728 master.cpp:602] Authorization enabled
I0927 16:41:07.344269 95232000 whitelist_watcher.cpp:77] No whitelist given
I0927 16:41:07.344270 93085696 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0927 16:41:07.345897 95232000 master.cpp:2083] Elected as the leading master!
I0927 16:41:07.345923 95232000 master.cpp:1638] Recovering from registrar
I0927 16:41:07.345969 93085696 registrar.cpp:339] Recovering registrar
I0927 16:41:07.346166 93085696 registrar.cpp:383] Successfully fetched the 
registry (0B) in 175872ns
I0927 16:41:07.346269 93085696 registrar.cpp:487] Applied 1 operations in 
23524ns; attempting to update the registry
I0927 16:41:07.346478 93085696 registrar.cpp:544] Successfully updated the 
registry in 183040ns
I0927 16:41:07.346536 93085696 registrar.cpp:416] Successfully recovered 
registrar
I0927 16:41:07.346678 93622272 master.cpp:1752] Recovered 0 agents from the 
registry (155B); allowing 10mins for agents to reregister
I0927 16:41:07.346702 94695424 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
W0927 16:41:07.349237 2560127808 process.cpp:2810] Attempted to spawn already 
running process files@10.0.49.4:56698
I0927 16:41:07.349918 2560127808 containerizer.cpp:305] Using isolation { 
environment_secret, filesystem/posix, posix/mem, posix/cpu }
I0927 16:41:07.350147 2560127808 provisioner.cpp:298] Using default backend 
'copy'
I0927 16:41:07.351030 2560127808 cluster.cpp:485] Creating default 'local' 
authorizer
I0927 16:41:07.352041 93622272 slave.cpp:267] Mesos agent started on 
(905)@10.0.49.4:56698
I0927 16:41:07.352071 93622272 slave.cpp:268] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/var/folders/6w/rw03zh013y38ys6cyn8qppf80

[jira] [Created] (MESOS-9273) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume is flaky

2018-09-27 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9273:
--

 Summary: 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume 
is flaky
 Key: MESOS-9273
 URL: https://issues.apache.org/jira/browse/MESOS-9273
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run (4499):
{noformat}
../../src/tests/containerizer/docker_volume_isolator_tests.cpp:1361
Failed to wait 15secs for statusStarting
{noformat}

Full log:
{noformat}
[ RUN  ] 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume
I0927 01:52:53.770812 13860 cluster.cpp:173] Creating default 'local' authorizer
I0927 01:52:53.771752  3593 master.cpp:413] Master 
1c890578-e87d-41a2-bb4c-5ed9b7e0d8ec (ip-172-16-10-139.ec2.internal) started on 
172.16.10.139:46305
I0927 01:52:53.771773  3593 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/X4P8mF/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/X4P8mF/master" --zk_session_timeout="10secs"
I0927 01:52:53.771903  3593 master.cpp:465] Master only allowing authenticated 
frameworks to register
I0927 01:52:53.771914  3593 master.cpp:471] Master only allowing authenticated 
agents to register
I0927 01:52:53.771920  3593 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I0927 01:52:53.771926  3593 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/X4P8mF/credentials'
I0927 01:52:53.771996  3593 master.cpp:521] Using default 'crammd5' 
authenticator
I0927 01:52:53.772053  3593 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0927 01:52:53.772120  3593 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 01:52:53.772158  3593 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 01:52:53.772189  3593 master.cpp:602] Authorization enabled
I0927 01:52:53.772347  3597 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0927 01:52:53.772367  3594 whitelist_watcher.cpp:77] No whitelist given
I0927 01:52:53.773003  3594 master.cpp:2083] Elected as the leading master!
I0927 01:52:53.773023  3594 master.cpp:1638] Recovering from registrar
I0927 01:52:53.773063  3594 registrar.cpp:339] Recovering registrar
I0927 01:52:53.773201  3596 registrar.cpp:383] Successfully fetched the 
registry (0B) in 117760ns
I0927 01:52:53.773241  3596 registrar.cpp:487] Applied 1 operations in 8146ns; 
attempting to update the registry
I0927 01:52:53.773360  3596 registrar.cpp:544] Successfully updated the 
registry in 102912ns
I0927 01:52:53.773396  3596 registrar.cpp:416] Successfully recovered registrar
I0927 01:52:53.773474  3596 master.cpp:1752] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0927 01:52:53.773562  3597 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
I0927 01:52:53.774943 13860 isolator.cpp:144] Initialized the docker volume 
information root directory at '/run/mesos/isolators/docker/volume'
I0927 01:52:53.776796 13860 linux_launcher.cpp:144] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: 1: hadoop: not found
I0927 01:52:53.859550 13860 fetcher.cpp:66] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512
I0927 01:52:53.859833 13860 registry_puller.cpp:128] Creating registry puller 
with docker registry 'https://registry-1.docker.io'
I0927 01:52:53.860913 138

[jira] [Created] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky

2018-09-27 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9272:
--

 Summary: SlaveTest.DefaultExecutorCommandInfo is flaky
 Key: MESOS-9272
 URL: https://issues.apache.org/jira/browse/MESOS-9272
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run (4499):
{noformat}
../../src/tests/cluster.cpp:697
Value of: containers->empty()
  Actual: false
Expected: true
Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 }
{noformat}

Full log:
{noformat}
[ RUN  ] SlaveTest.DefaultExecutorCommandInfo
I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' authorizer
I0927 01:48:44.247200 11037 master.cpp:413] Master 
56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started on 
172.16.10.254:33398
I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs"
I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing authenticated 
frameworks to register
I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing authenticated 
agents to register
I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/7SQ2cR/credentials'
I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' 
authenticator
I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled
I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given
I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master!
I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar
I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar
I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the 
registry (0B) in 168960ns
I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in 6362ns; 
attempting to update the registry
I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the 
registry in 94208ns
I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered registrar
I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already 
running process files@172.16.10.254:33398
I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default 'local' authorizer
I0927 01:48:44.251428 11035 slave.cpp:267] Mesos agent started on 
(662)@172.16.10.254:33398
I0927 01:48:44.251672 11015 scheduler.cpp:189] Version: 1.8.0
I0927 01:48:44.251443 11035 slave.cpp:268] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/SlaveTest_DefaultExecutorCommandInfo_DsiR0M/store/appc" 
--authenticate_http_executors="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="cra

[jira] [Created] (MESOS-9271) DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP is flaky

2018-09-27 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9271:
--

 Summary: 
DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP
 is flaky
 Key: MESOS-9271
 URL: https://issues.apache.org/jira/browse/MESOS-9271
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed in an internal CI run (4498):
{noformat}
../../src/tests/health_check_tests.cpp:2080
Failed to wait 15secs for statusHealthy
{noformat}

Full log:
{noformat}
[ RUN  ] 
NetworkProtocol/DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP/1
I0927 00:57:43.336710 27845 docker.cpp:1659] Running docker -H 
unix:///var/run/docker.sock inspect zhq527725/https-server:latest
I0927 00:57:43.340283 27845 docker.cpp:1659] Running docker -H 
unix:///var/run/docker.sock inspect alpine:latest
I0927 00:57:43.343433 27845 docker.cpp:1659] Running docker -H 
unix:///var/run/docker.sock inspect alpine:latest
I0927 00:57:43.857142 27845 cluster.cpp:173] Creating default 'local' authorizer
I0927 00:57:43.858705 19628 master.cpp:413] Master 
f9e9ac63-826d-4d08-b216-c5f352afc25d (ip-172-16-10-217.ec2.internal) started on 
172.16.10.217:32836
I0927 00:57:43.858727 19628 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/QIaitl/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/QIaitl/master" --zk_session_timeout="10secs"
I0927 00:57:43.858912 19628 master.cpp:465] Master only allowing authenticated 
frameworks to register
I0927 00:57:43.858942 19628 master.cpp:471] Master only allowing authenticated 
agents to register
I0927 00:57:43.858948 19628 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I0927 00:57:43.858955 19628 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/QIaitl/credentials'
I0927 00:57:43.859072 19628 master.cpp:521] Using default 'crammd5' 
authenticator
I0927 00:57:43.859141 19628 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0927 00:57:43.859200 19628 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 00:57:43.859246 19628 http.cpp:1037] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 00:57:43.859268 19628 master.cpp:602] Authorization enabled
I0927 00:57:43.859541 19629 hierarchical.cpp:182] Initialized hierarchical 
allocator process
I0927 00:57:43.859582 19629 whitelist_watcher.cpp:77] No whitelist given
I0927 00:57:43.860060 19628 master.cpp:2083] Elected as the leading master!
I0927 00:57:43.860078 19628 master.cpp:1638] Recovering from registrar
I0927 00:57:43.860117 19628 registrar.cpp:339] Recovering registrar
I0927 00:57:43.860285 19628 registrar.cpp:383] Successfully fetched the 
registry (0B) in 144128ns
I0927 00:57:43.860328 19628 registrar.cpp:487] Applied 1 operations in 8246ns; 
attempting to update the registry
I0927 00:57:43.860527 19624 registrar.cpp:544] Successfully updated the 
registry in 167168ns
I0927 00:57:43.860571 19624 registrar.cpp:416] Successfully recovered registrar
I0927 00:57:43.860698 19625 master.cpp:1752] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0927 00:57:43.860761 19625 hierarchical.cpp:220] Skipping recovery of 
hierarchical allocator: nothing to recover
W0927 00:57:43.863813 27845 process.cpp:2810] Attempted to spawn already 
running process files@172.16.10.217:32836
I0927 00:57:43.863989 27845 cluster.cpp:485] Creating default 'local' authorizer
I0927 00:57:43.864542 19628 slave.cpp:267] Mesos agent started on 
(1170)@172

<    1   2   3   4   >