[jira] [Created] (MESOS-9776) Mention removal of *.json endpoints in 1.8.0 CHANGELOG
Benno Evers created MESOS-9776: -- Summary: Mention removal of *.json endpoints in 1.8.0 CHANGELOG Key: MESOS-9776 URL: https://issues.apache.org/jira/browse/MESOS-9776 Project: Mesos Issue Type: Improvement Reporter: Benno Evers We should mention in the CHANGELOG and update notes that the *.json that were deprecated in Mesos 0.25 were actually removed in Mesos 1.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`
Benno Evers created MESOS-9761: -- Summary: Mesos UI does not properly account for resources set via `--default-role` Key: MESOS-9761 URL: https://issues.apache.org/jira/browse/MESOS-9761 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Attachments: default_role_ui.png In our cluster, we have two agents configured with "--default_role=slave_public" and 64 cpus each, for a total of 128 cpus allocated to this role. The right side of the screenshot shows one of them. However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor "Limit" does show any resources for this role. See attached screenshot for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9730) Executors cannot reconnect with agents using TLS1.3
[ https://issues.apache.org/jira/browse/MESOS-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829733#comment-16829733 ] Benno Evers commented on MESOS-9730: {noformat} commit 4fa4f77549b43285cac974111a5a3f28828a19d8 Author: Stéphane Cottin Date: Mon Apr 29 13:28:06 2019 +0200 Documented LIBPROCESS_SSL_ENABLE_TLS_V1_3. Updated documentation about `LIBPROCESS_SSL_ENABLE_TLS_V1_3` and TLS1.3. Review: https://reviews.apache.org/r/70563/ commit 712ee298800e257050d01b69abeaf3c4bc7d12ee Author: Stéphane Cottin Date: Mon Apr 29 13:27:04 2019 +0200 Added LIBPROCESS_SSL_ENABLE_TLS_V1_3 environment variable. When building mesos with libopenssl >= 1.1.1, TLS1.3 is enabled by default. This causes major communication issues between executors and agents. This patch adds a new `LIBPROCESS_SSL_ENABLE_TLS_V1_3` env var, disabled by default. It should be changed to enabled by default when full openssl >= 1.1 support will land. Review: https://reviews.apache.org/r/70562/ {noformat} Also backported the patches to 1.8.x branch. > Executors cannot reconnect with agents using TLS1.3 > --- > > Key: MESOS-9730 > URL: https://issues.apache.org/jira/browse/MESOS-9730 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.8.0 >Reporter: Stéphane Cottin >Assignee: Stéphane Cottin >Priority: Major > Labels: integration, ssl > > TLS 1.3 support is enabled by default from openssl >= 1.1.0 > Executors do not reconnect with agents after restart when using TLS 1.3, and > I guess this should also affect master/slave communication. > suggested action : > add a `LIBPROCESS_SSL_ENABLE_TLS_V1_3` environment variable with a `false` > default, and apply `SSL_OP_NO_TLSv1_3` ssl option when building with openssl > >= 1.1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (MESOS-3394) Pull in glog 0.3.6 (when it's released)
[ https://issues.apache.org/jira/browse/MESOS-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-3394: --- Comment: was deleted (was: www.rtat.net) > Pull in glog 0.3.6 (when it's released) > --- > > Key: MESOS-3394 > URL: https://issues.apache.org/jira/browse/MESOS-3394 > Project: Mesos > Issue Type: Task > Components: cmake >Reporter: Andrew Schwartzmeyer >Priority: Major > Labels: arm64, build, cmake, freebsd, mesosphere, windows > > To build on Windows, we have to build glog on Windows. But, glog doesn't > build on Windows, so we had to submit a patch to the project. So, to build on > Windows, we download the patched version directly from the pull request that > was sent to the glog repository on GitHub. > When these patches move upstream, we need to change this to point at the > "real" glog release instead of the pull request. > (For details see the `CMakeLists.txt` in `3rdparty/libprocess/3rdparty`.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9745) Re-enable validation of protobuf unions in `ContainerInfo`
Benno Evers created MESOS-9745: -- Summary: Re-enable validation of protobuf unions in `ContainerInfo` Key: MESOS-9745 URL: https://issues.apache.org/jira/browse/MESOS-9745 Project: Mesos Issue Type: Improvement Reporter: Benno Evers In MESOS-9740, we disabled protobuf union validation for `ContainerInfo` messages, since it was discovered that frameworks generating invalid protobuf of this kind currently exist in the wild. However, that is somewhat unsatisfactory since it re-enables the issue originally described in MESOS-6874, i.e. Mesos not rejecting tasks where the `ContainerInfo` was accidentally malformed. Ideally, we should implement a metric counting the number of tasks with malformed `ContainerInfo`s and re-enable validation after an approprate warning period has passed. Ideally, we should implement -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters
[ https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826149#comment-16826149 ] Benno Evers commented on MESOS-9740: Preliminary review: https://reviews.apache.org/r/70538/ > Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents > from reregistering with 1.8+ masters > --- > > Key: MESOS-9740 > URL: https://issues.apache.org/jira/browse/MESOS-9740 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Joseph Wu >Assignee: Benno Evers >Priority: Blocker > Labels: foundations, mesosphere > > As part of MESOS-6874, the master now validates protobuf unions passed as > part of an {{ExecutorInfo::ContainerInfo}}. This prevents a task from > specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the > {{docker}} field (which is then ignored by the agent). > However, if a task was already launched with an invalid protobuf union, the > same validation will happen when the agent tries to reregister with the > master. In this case, if the master is upgraded to validate protobuf unions, > the agent reregistration will be rejected. > {code} > master.cpp:7201] Dropping re-registration of agent at > slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: > Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the > field `docker` set. > {code} > This bug was found when upgrading a 1.7.x test cluster to 1.8.0. When > MESOS-6874 was committed, I had assumed the invalid protobufs would be rare. > However, on the test cluster, 13/17 agents had at least one invalid > ContainerInfo when reregistering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9736) Error building libgrpc++ on Mac from a source tarball
Benno Evers created MESOS-9736: -- Summary: Error building libgrpc++ on Mac from a source tarball Key: MESOS-9736 URL: https://issues.apache.org/jira/browse/MESOS-9736 Project: Mesos Issue Type: Bug Reporter: Benno Evers The following error was reported by [~tillt] trying to build the `1.8.0-rc2` release candidate on a MacOS machine: {noformat} make[2]: *** No rule to make target `../3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a', needed by `libmesos.la'. Stop. {noformat} Looking into the issue, the following was theory was offered for the cause of the problem: {quote} I have the hunch that this isnt an macOS thing but instead a problem in our build setup which does (not intentionally) try to do certain things in parallel. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9732) Python installation using `make install` fails inside a symlinked directory
Benno Evers created MESOS-9732: -- Summary: Python installation using `make install` fails inside a symlinked directory Key: MESOS-9732 URL: https://issues.apache.org/jira/browse/MESOS-9732 Project: Mesos Issue Type: Bug Reporter: Benno Evers I used to have a symlink pointing from `~/mesos` to `~/src/mesos`. Then I attempted to `make install` from inside the `~/mesos/worktrees/release` directory on a build with python bindings enabled. Now I don't have a symlink anymore. {noformat} bevers@poincare:~$ ls ~/src/mesos 3rdpartycompile install-sh mpi aclocal.m4 config.guessLICENSE NOTICE ar-lib config.sub ltmain.shREADME.md autom4te.cache configure m4 site bin configure.acMakefile.am src bootstrap depcomp Makefile.in support bootstrap.bat docsmesos.pc.in worktrees CHANGELOG Doxyfilemesos.sublime-project cmake etc_issue_orig mesos.sublime-workspace CMakeLists.txt include missing bevers@poincare:~$ ls ~/mesos worktrees bevers@poincare:~$ ls ~/mesos/worktrees/release/build/src/python/dist mesos-1.8.0-py2.7.egg mesos-1.8.0-py2-none-any.whl mesos.cli-1.8.0-py2.7.egg mesos.cli-1.8.0-py2-none-any.whl mesos.executor-1.8.0-cp27-none-linux_x86_64.whl mesos.executor-1.8.0-py2.7-linux-x86_64.egg mesos.interface-1.8.0-py2.7.egg mesos.interface-1.8.0-py2-none-any.whl mesos.native-1.8.0-py2.7.egg mesos.native-1.8.0-py2-none-any.whl mesos.scheduler-1.8.0-cp27-none-linux_x86_64.whl mesos.scheduler-1.8.0-py2.7-linux-x86_64.egg {noformat} The installation itself also fails with a predictable error: {noformat} OSError: [Errno 2] No such file or directory: '/home/bevers/mesos/worktrees/release/build/../src/python/executor/src/mesos/executor' {noformat} Leaving the system in a funny state as a side effect: {noformat} bevers@poincare:~/mesos/worktrees/release/build$ ls . 3rdparty bin config.log config.lt config.status description-pak include libtool Makefile mesos.pc mpi src bevers@poincare:~/mesos/worktrees/release/build$ ls `pwd` src {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9697) Release RPMs are not uploaded to bintray
[ https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814545#comment-16814545 ] Benno Evers edited comment on MESOS-9697 at 4/11/19 9:12 AM: - After some investigation, here's my current understanding of the situation: * The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm packages for centos 6 and 7 as artifacts (using the script `support/packaging/centos/build-rpm-docker.sh`), but does not do anything with them, i.e. there is no connection to bintray. I don't know if there is any public download for the generated artifacts. * The is another job `Mesos/Packaging/CentosRPMs` (https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in the ASF Jenkins that is not run automatically. For its setup, its using the file `support/packaging/Jenkinsfile` from branch `bintray` on `http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I still don't understand where exactly it's taking the source code from) and afterwards upload them to bintray using credentials "karya_bintray_credentials". It was last run by [~karya] on Feb 8, 2018 to produce Mesos 1.5.0 packages. So it looks like this might not actually be broken, but rather just release managers not being aware that they are supposed to manually run this Jenkins job. I'd like to test that theory by triggering a 1.7.0 build of the latter job, but I don't seem to have permissions to do that on the ASF Jenkins. was (Author: bennoe): After some investigation, here's my current understanding of the situation: * The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm packages for centos 6 and 7 as artifacts (using the script `support/packaging/centos/build-rpm-docker.sh`), but does not do anything with them, i.e. there is no connection to bintray. I don't know if there is any public download for the generated artifacts. * The is another job `Mesos/Packaging/CentosRPMs` (https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in the ASF Jenkins that is not run manually. For its setup, its using the file `support/packaging/Jenkinsfile` from branch `bintray` on `http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I still don't understand where exactly it's taking the source code from) and afterwards upload them to bintray using credentials "karya_bintray_credentials". It was last run by [~karya] on Feb 8, 2018 to produce Mesos 1.5.0 packages. So it looks like this might not actually be broken, but rather just release managers not being aware that they are supposed to manually run this Jenkins job. I'd like to test that theory by triggering a 1.7.0 build of the latter job, but I don't seem to have permissions to do that on the ASF Jenkins. > Release RPMs are not uploaded to bintray > > > Key: MESOS-9697 > URL: https://issues.apache.org/jira/browse/MESOS-9697 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.6.2, 1.7.2, 1.8.0 >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Critical > Labels: foundations, integration, jenkins, packaging, rpm > > While we currently build release RPMs, e.g., > [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/], > these artifacts are not uploaded to bintray. Due to that RPM links on the > downloads page [http://mesos.apache.org/downloads/] are broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray
[ https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814545#comment-16814545 ] Benno Evers commented on MESOS-9697: After some investigation, here's my current understanding of the situation: * The ASF Jenkins is successfully running the `Mesos/Packaging/CentOS` job ( https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/ ) to a branch that contains the file `support/jenkins/Jenkinsfile-packaging-centos`, i.e. currently branches 1.7.x, 1.8.x and master. This jenkinsfile creates rpm packages for centos 6 and 7 as artifacts (using the script `support/packaging/centos/build-rpm-docker.sh`), but does not do anything with them, i.e. there is no connection to bintray. I don't know if there is any public download for the generated artifacts. * The is another job `Mesos/Packaging/CentosRPMs` (https://builds.apache.org/job/Mesos/job/Packaging/job/CentosRPMs) defined in the ASF Jenkins that is not run manually. For its setup, its using the file `support/packaging/Jenkinsfile` from branch `bintray` on `http://github.com/karya0/mesos.git`. It is taking parameters `MESOS_RELEASE` and `MESOS_TAG` and will build centos 6/7 rpm packages for that release (I still don't understand where exactly it's taking the source code from) and afterwards upload them to bintray using credentials "karya_bintray_credentials". It was last run by [~karya] on Feb 8, 2018 to produce Mesos 1.5.0 packages. So it looks like this might not actually be broken, but rather just release managers not being aware that they are supposed to manually run this Jenkins job. I'd like to test that theory by triggering a 1.7.0 build of the latter job, but I don't seem to have permissions to do that on the ASF Jenkins. > Release RPMs are not uploaded to bintray > > > Key: MESOS-9697 > URL: https://issues.apache.org/jira/browse/MESOS-9697 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.6.2, 1.7.2, 1.8.0 >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Critical > Labels: foundations, integration, jenkins, packaging, rpm > > While we currently build release RPMs, e.g., > [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/], > these artifacts are not uploaded to bintray. Due to that RPM links on the > downloads page [http://mesos.apache.org/downloads/] are broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray
[ https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812435#comment-16812435 ] Benno Evers commented on MESOS-9697: Changing priority to "Critical", since this does not have an associated target version. (and is thus, technically, not blocking any release) > Release RPMs are not uploaded to bintray > > > Key: MESOS-9697 > URL: https://issues.apache.org/jira/browse/MESOS-9697 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.6.2, 1.7.2, 1.8.0 >Reporter: Benjamin Bannier >Priority: Blocker > Labels: integration, jenkins, packaging, rpm > > While we currently build release RPMs, e.g., > [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/], > these artifacts are not uploaded to bintray. Due to that RPM links on the > downloads page [http://mesos.apache.org/downloads/] are broken. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9565) Unit tests for destroying persistent volumes in SLRP.
[ https://issues.apache.org/jira/browse/MESOS-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812423#comment-16812423 ] Benno Evers commented on MESOS-9565: Status summary. The first 6 reviews of the chain posted above have been submitted, the remaining two are still pending due to the following review comment by [~bbannier]. {quote} These tests seem to have issues when executed under load. When putting extra stress on the system with stress-ng I was able to get e.g., CreateDestroyPersistentVolume to break after only 4 iterations {quote} > Unit tests for destroying persistent volumes in SLRP. > - > > Key: MESOS-9565 > URL: https://issues.apache.org/jira/browse/MESOS-9565 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Labels: mesosphere, storage > > The plan is to add/update the following unit tests to test persistent volume > destroy: > * CreateDestroyDisk > * CreateDestroyDiskWithRecovery > * CreateDestroyPersistentMountVolume > * CreateDestroyPersistentMountVolumeWithRecovery > * CreateDestroyPersistentMountVolumeWithReboot > * CreateDestroyPersistentBlockVolume > * DestroyPersistentMountVolumeFailed > * DestroyUnpublishedPersistentVolume > * DestroyUnpublishedPersistentVolumeWithRecovery > * DestroyUnpublishedPersistentVolumeWithReboot > * RecoverPublishedPersistentVolumeFailed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.
[ https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812416#comment-16812416 ] Benno Evers edited comment on MESOS-9624 at 4/8/19 1:15 PM: Closing this since all related patches seem to have been landed. {noformat} commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7 Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 21:26:04 2019 -0700 Bundled CSI spec 1.1.0. Since the CSI v1 spec proto file depends on certain proto files in the Protobuf library, we have to ensure the Protobuf library's include path is in the proto paths of the `protoc` command when compiling the CSI spec proto file. Specifically in Autotools, this path is passed through the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled protobuf library. Review: https://reviews.apache.org/r/70360 {noformat} {noformat} commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 22:14:32 2019 -0700 Added spec inclusion header and type helpers for CSI v1. Review: https://reviews.apache.org/r/70361 {noformat} was (Author: bennoe): Closing this since all related patches seem to have been landed. {noformat} commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7 Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 21:26:04 2019 -0700 Bundled CSI spec 1.1.0. Since the CSI v1 spec proto file depends on certain proto files in the Protobuf library, we have to ensure the Protobuf library's include path is in the proto paths of the `protoc` command when compiling the CSI spec proto file. Specifically in Autotools, this path is passed through the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled protobuf library. Review: https://reviews.apache.org/r/70360 {noformat} commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 22:14:32 2019 -0700 Added spec inclusion header and type helpers for CSI v1. Review: https://reviews.apache.org/r/70361 {noformat} > Bundle CSI spec v1.0 in Mesos. > -- > > Key: MESOS-9624 > URL: https://issues.apache.org/jira/browse/MESOS-9624 > Project: Mesos > Issue Type: Task > Components: storage >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Critical > Labels: mesosphere, storage > Fix For: 1.8.0 > > > We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of > the source code filesystem layout. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.
[ https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812416#comment-16812416 ] Benno Evers commented on MESOS-9624: Closing this since all related patches seem to have been landed. {noformat} commit 3da54965d02a6bf0e4806bf2d4acebb3310d60f7 Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 21:26:04 2019 -0700 Bundled CSI spec 1.1.0. Since the CSI v1 spec proto file depends on certain proto files in the Protobuf library, we have to ensure the Protobuf library's include path is in the proto paths of the `protoc` command when compiling the CSI spec proto file. Specifically in Autotools, this path is passed through the `PROTOBUF_PROTOCFLAGS` variable when building with an unbundled protobuf library. Review: https://reviews.apache.org/r/70360 {noformat} commit 6ef64a3a6ff34975d58abbb0b78e2b402d39873c Author: Chun-Hung Hsiao chhs...@mesosphere.io Date: Thu Mar 28 22:14:32 2019 -0700 Added spec inclusion header and type helpers for CSI v1. Review: https://reviews.apache.org/r/70361 {noformat} > Bundle CSI spec v1.0 in Mesos. > -- > > Key: MESOS-9624 > URL: https://issues.apache.org/jira/browse/MESOS-9624 > Project: Mesos > Issue Type: Task > Components: storage >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Critical > Labels: mesosphere, storage > > We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of > the source code filesystem layout. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path
[ https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810819#comment-16810819 ] Benno Evers commented on MESOS-8257: I removed the 1.8.0 target designation here and in the linked ticket since it looks like there hasn't been any recent activity here, please feel free to revert as you see fit. > Unified Containerizer "leaks" a target container mount path to the host FS > when the target resolves to an absolute path > --- > > Key: MESOS-8257 > URL: https://issues.apache.org/jira/browse/MESOS-8257 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Jason Lai >Assignee: Jason Lai >Priority: Critical > Labels: bug, containerization, containerizer, mountpath > > If a target path under the root FS provisioned from an image resolves to an > absolute path, it will not appear in the container root FS after > {{pivot_root(2)}} is called. > A typical example is that when the target path is under {{/var/run}} (e.g. > {{/var/run/some-dir}}), which is usually a symlink to an absolute path of > {{/run}} in Debian images, the target path will get resolved as and created > at {{/run/some-dir}} in the host root FS, after the container root FS gets > provisioned. The target path will get unmounted after {{pivot_root(2)}} as it > is part of the old root (host FS). > A workaround is to use {{/run}} instead of {{/var/run}}, but absolute > symlinks need to be resolved within the scope of the container root FS path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9677) RPM packages should be built with launcher sealing
[ https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810759#comment-16810759 ] Benno Evers commented on MESOS-9677: On the `memfd_create()` manpage I can read: {quote} The memfd_create() system call first appeared in Linux 3.17 {quote} According to Wikipedia, CentOS 7 uses kernels from the 3.10 series: https://en.wikipedia.org/wiki/CentOS#Latest_version_information So I'm not sure if it will really be safe to enable this per default on CentOS 7. [~gilbert], can you clarify this? > RPM packages should be built with launcher sealing > -- > > Key: MESOS-9677 > URL: https://issues.apache.org/jira/browse/MESOS-9677 > Project: Mesos > Issue Type: Task > Components: build >Affects Versions: 1.8.0 >Reporter: Benjamin Bannier >Priority: Major > Labels: integration, mesosphere, packaging, rpm, storage > > We should consider enabling launcher sealing in the Mesos RPM packages. Since > this feature is built conditionally, it is hard to write e.g., module code > against Mesos packages since required functions might be missing (e.g., > [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9] > cannot be linked against the default RPM package anymore). The RPM's target > platform centos7 should include a recent enough kernel for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9313) Document speculative offer operation semantics for framework writers.
[ https://issues.apache.org/jira/browse/MESOS-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809972#comment-16809972 ] Benno Evers commented on MESOS-9313: I'm not so sure that framework authors can just treat this as an opaque implementation detail, because I'd assume the `reason` field would be different between a task launching on reserved resources that were not reserved on the agent, and a task failing for other reasons. Additionally, I think it's just better user experience to get people to understand *why* certain state transitions can happen, as opposed to just saying nothing is ever certain so deal with it. That said, it doesn't look like anyone is currently working on this so I'm removing the 1.8 target version designation from this task. > Document speculative offer operation semantics for framework writers. > - > > Key: MESOS-9313 > URL: https://issues.apache.org/jira/browse/MESOS-9313 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: James DeFelice >Priority: Major > Labels: mesosphere, operation-feedback, operations > > It recently came to my attention that a subset of offer operations (e.g. > RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos > master. Meaning that the master will apply the resource conversion internally > **before** the conversion is checkpointed on the agent. The master may then > re-offer the converted resource to a framework -- even though the agent may > still not have checkpointed the resource conversion. If the checkpointing > process on the agent fails, then subsequent operations issued for the > falsely-offered resource will fail. Because the master essentially "lied" to > the framework about the true state of the supposedly-converted resource. > It's also been explained to me that this case is expected to be rare. > However, it *can* impact the design/implementation of framework state > machines and so it's critical that this information be documented clearly - > outside of the C++ code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9675) Docker Manifest V2 Schema2 Support.
[ https://issues.apache.org/jira/browse/MESOS-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9675: -- Assignee: Gilbert Song > Docker Manifest V2 Schema2 Support. > --- > > Key: MESOS-9675 > URL: https://issues.apache.org/jira/browse/MESOS-9675 > Project: Mesos > Issue Type: Epic > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song >Priority: Blocker > Labels: containerization > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.
[ https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809815#comment-16809815 ] Benno Evers commented on MESOS-8068: Removed the 1.8.0 target version since it's not going to completed for that version, feel free to revert as you see fit. > Non-revocable bursting over quota guarantees via limits. > > > Key: MESOS-8068 > URL: https://issues.apache.org/jira/browse/MESOS-8068 > Project: Mesos > Issue Type: Epic > Components: allocation >Reporter: Benjamin Mahler >Priority: Major > Labels: multitenancy, resource-management > > Prior to introducing a revocable tier of allocation (see MESOS-4441), there > is a notion of whether a role can burst over its quota guarantee. > We currently apply implicit limits in the following way: > No quota guarantee set: (guarantee 0, no limit) > Quota guarantee set: (guarantee G, limit G) > That is, we only allow support burst-only without guarantee and > guarantee-only without burst. We do not support bursting over some non-zero > guarantee: (guarantee G, limit L >= G). > The idea here is that we should make these implicit limits explicit to > clarify for users the distinction between guarantees and limits, and to > support bursting over the guarantee. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7428) Report exit code of tasks from default and command executors
[ https://issues.apache.org/jira/browse/MESOS-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809793#comment-16809793 ] Benno Evers commented on MESOS-7428: I'm removing the 1.8.0 target version since this hasn't been updated for a while. Please feel free to revert as you see fit. > Report exit code of tasks from default and command executors > > > Key: MESOS-7428 > URL: https://issues.apache.org/jira/browse/MESOS-7428 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li >Assignee: Eric Chung >Priority: Major > > Use case: some tasks should only be retried if the exit code matches certain > user requirement. > Based on [~gilbert], we already checkpoint the exit code in containerizer > now, and we need to clarify how to report exit code for executor containers > v.s. nested containers, and we should do this consistently for command and > default executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7776) Document `MESOS_CONTAINER_IP`
[ https://issues.apache.org/jira/browse/MESOS-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809776#comment-16809776 ] Benno Evers commented on MESOS-7776: I'm removing the target version designation for now, since it looks like this is currently not being worked on. Please revert as you see fit. > Document `MESOS_CONTAINER_IP` > -- > > Key: MESOS-7776 > URL: https://issues.apache.org/jira/browse/MESOS-7776 > Project: Mesos > Issue Type: Documentation > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Major > > We introduced `MESOS_CONTAINER_IP` to inform tasks launched by the > default-executor to inform the tasks about their container IP. This was done > primarily to break the dependency of the containers on `LIBPROCESS_IP` to > learn their IP addresses which was misleading. > This change need to be documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call
[ https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808914#comment-16808914 ] Benno Evers commented on MESOS-7974: Re-targeted to 1.9.0. > Accept "application/recordio" type is rejected for master operator API > SUBSCRIBE call > - > > Key: MESOS-7974 > URL: https://issues.apache.org/jira/browse/MESOS-7974 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: James DeFelice >Assignee: Joseph Wu >Priority: Major > Labels: mesosphere > > The agent operator API supports for "application/recordio" for things like > attach-container-output, which streams objects back to the caller. I expected > the master operator API SUBSCRIBE call to work the same way, w/ > Accept/Content-Type headers for "recordio" and > Message-Accept/Message-Content-Type headers for json (or protobuf). This was > not the case. > Looking again at the master operator API documentation, SUBSCRIBE docs > illustrate usage Accept and Content-Type headers for the "application/json" > type. Not a "recordio" type. So my experience, as per the docs, seems > expected. However, this is counter-intuitive since the whole point of adding > the new Message-prefixed headers was to help callers consistently request > (and differentiate) streaming responses from non-streaming responses in the > v1 API. > Please fix the master operator API implementation to also support the > Message-prefixed headers w/ Accept/Content-Type set to "recordio". > Observed on ubuntu w/ mesos package version 1.2.1-2.0.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9082) Avoid two trips through the master mailbox for state.json requests.
[ https://issues.apache.org/jira/browse/MESOS-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9082: -- Assignee: (was: Benno Evers) > Avoid two trips through the master mailbox for state.json requests. > --- > > Key: MESOS-9082 > URL: https://issues.apache.org/jira/browse/MESOS-9082 > Project: Mesos > Issue Type: Task >Reporter: Alexander Rukletsov >Priority: Major > Labels: foundations, mesosphere, performance > > Currently, a state.json request travels through the master's mailbox twice: > before authorization and after. This increases the overall state.json > response time by around 30%. > To remove one mailbox trip, we can perform the initial portion (validation > and authorization) of state and /state off the master actor by using a > top-level {{Route}}, then dispatch onto the master actor only for json / > protobuf serialization. This should drop the authorization time down to near > 0 if it's indeed mostly queuing delay. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8148) Enforce text attribute value specification for zone and region values
[ https://issues.apache.org/jira/browse/MESOS-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-8148: -- Assignee: (was: Benno Evers) > Enforce text attribute value specification for zone and region values > - > > Key: MESOS-8148 > URL: https://issues.apache.org/jira/browse/MESOS-8148 > Project: Mesos > Issue Type: Improvement >Reporter: Tim Harper >Priority: Major > > Mesos has a specification for characters allowed by attribute values: > http://mesos.apache.org/documentation/latest/attributes-resources/ > The specification is as follows: > {code} > scalar : floatValue > floatValue : ( intValue ( "." intValue )? ) | ... > intValue : [0-9]+ > range : "[" rangeValue ( "," rangeValue )* "]" > rangeValue : scalar "-" scalar > set : "{" text ( "," text )* "}" > text : [a-zA-Z0-9_/.-] > {code} > Marathon is [implementing IN and IS > constraints|https://docs.google.com/document/d/e/2PACX-1vSFvPol0pcHC2Web7EaNU0oSDS5wrOWSgFcmuslYBtISV2NB2JZ_D-B4wpWy_Vutaf08m2LX6WZVy6s/pub], > and includes plans to support further attribute types as it makes sense to > do so (IE {{{a,b} IS {b,a}}}, {{5 IN [0-10]}}). In order > to do this, Marathon has adopted the Mesos attribute value specification and > will enforce it in the validation layer. As an example, it will be possible > to write things like: > {code:java} > "constraints": [ > ["attribute", "IN", "{value-a,value-b,value-c}"] > ] > {code} > Additionally, Marathon allows one to specify constraints on non-attribute > properties, such as region, hostname, or zone. If somebody specified a zone > value with a comma, then the user would not be able to use the Mesos set > value type specification to describe a set of zones in which an app should be > deployed, and, as a consequence, would result in additional complexity (IE: > Marathon would need to implement an escaping mechanism for this case). > Ideally, the character space is confined to begin with. It the text type > specification is sufficient, then, it seems simpler to re-use it rather than > create another one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9615) Example framework for feedback on agent default resources
[ https://issues.apache.org/jira/browse/MESOS-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805002#comment-16805002 ] Benno Evers commented on MESOS-9615: {noformat} commit 1915150c6a83cd95197e25a68a6adf9b3ef5fb11 Author: Benno Evers Date: Fri Mar 22 17:51:34 2019 +0100 Added new example framework for operation feedback. This adds a new example framework showcasing a possible implementation of the newly added operation feedback API. Review: https://reviews.apache.org/r/70282 {noformat} > Example framework for feedback on agent default resources > - > > Key: MESOS-9615 > URL: https://issues.apache.org/jira/browse/MESOS-9615 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Benno Evers >Priority: Major > Labels: foundations, mesosphere > > We need a framework that can be used to test operations on agent default > resources which request operation feedback. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.
[ https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804978#comment-16804978 ] Benno Evers commented on MESOS-9687: Interface extension landed in: {noformat} commit 8cba86825449c35733a0b4cf0d14284055c2cc30 (HEAD -> master, origin/master) Author: Andrei Sekretenko Date: Fri Mar 29 14:23:57 2019 +0100 Extended the glog LogSink interface to be able to log microseconds. Extended the LogSink interface to be able to log microseconds. This makes possible to solve a problem with modules implementing custom LogSink which currently log 00 instead of microseconds. This is a backport of this patch: https://github.com/google/glog/pull/441 to glog 0.3.3 Review: https://reviews.apache.org/r/70334/ {noformat} Modules now can use the new interface method {noformat} virtual void send(LogSeverity severity, const char* full_filename, const char* base_filename, int line, const struct ::tm* tm_time, const char* message, size_t message_len, int32 usecs) {noformat} to log including microseconds. > Add the glog patch to pass microseconds via the LogSink interface. > -- > > Key: MESOS-9687 > URL: https://issues.apache.org/jira/browse/MESOS-9687 > Project: Mesos > Issue Type: Task >Reporter: Andrei Sekretenko >Priority: Major > > Currently, custom LogSink implementations in the modules (for example, this > one: > [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] > ) > are logging `00` instead of microseconds in the timestamp - simply > because the LogSink interface in glog has no place for microseconds. > The proposed glog fix is here: [https://github.com/google/glog/pull/441] > Getting this into glog release might take a long time (they released 0.4.0 > recently, but the previous release 0.3.5 was two years ago), therefore it > makes sense to add this patch into Mesos build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.
[ https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9687: -- Assignee: Benno Evers > Add the glog patch to pass microseconds via the LogSink interface. > -- > > Key: MESOS-9687 > URL: https://issues.apache.org/jira/browse/MESOS-9687 > Project: Mesos > Issue Type: Task >Reporter: Andrei Sekretenko >Assignee: Benno Evers >Priority: Major > > Currently, custom LogSink implementations in the modules (for example, this > one: > [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] > ) > are logging `00` instead of microseconds in the timestamp - simply > because the LogSink interface in glog has no place for microseconds. > The proposed glog fix is here: [https://github.com/google/glog/pull/441] > Getting this into glog release might take a long time (they released 0.4.0 > recently, but the previous release 0.3.5 was two years ago), therefore it > makes sense to add this patch into Mesos build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.
[ https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9687: -- Assignee: Andrei Sekretenko (was: Benno Evers) > Add the glog patch to pass microseconds via the LogSink interface. > -- > > Key: MESOS-9687 > URL: https://issues.apache.org/jira/browse/MESOS-9687 > Project: Mesos > Issue Type: Task >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Major > > Currently, custom LogSink implementations in the modules (for example, this > one: > [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] > ) > are logging `00` instead of microseconds in the timestamp - simply > because the LogSink interface in glog has no place for microseconds. > The proposed glog fix is here: [https://github.com/google/glog/pull/441] > Getting this into glog release might take a long time (they released 0.4.0 > recently, but the previous release 0.3.5 was two years ago), therefore it > makes sense to add this patch into Mesos build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9690) Framework registration can silently fail w/o visible error
[ https://issues.apache.org/jira/browse/MESOS-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804905#comment-16804905 ] Benno Evers commented on MESOS-9690: The authentication issues mentioned in the original ticket turned out to be a red herring, so I updated the ticket description and labels. > Framework registration can silently fail w/o visible error > -- > > Key: MESOS-9690 > URL: https://issues.apache.org/jira/browse/MESOS-9690 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: foundations > > When running a v1 framework the master can sometimes respond with "503 > Service Unavailable" to a SUBSCRIBE request, without any log message hinting > at what might be wrong even at log level `GLOG_v=4`. For example, this is > from an attempt to run the `OperationFeedbackFramework` against `mesos-local`: > {noformat} > I0328 18:17:53.273442 7793 scheduler.cpp:600] Sending SUBSCRIBE call to > http://127.0.1.1:36423/master/api/v1/scheduler > I0328 18:17:53.273653 7797 leveldb.cpp:347] Persisting action (14 bytes) to > leveldb took 3.185352ms > I0328 18:17:53.273695 7797 replica.cpp:712] Persisted action NOP at position > 0 > I0328 18:17:53.274099 7798 containerizer.cpp:1123] Recovering isolators > I0328 18:17:53.274602 7794 replica.cpp:695] Replica received learned notice > for position 0 from log-network(1)@127.0.1.1:36423 > I0328 18:17:53.274829 7798 containerizer.cpp:1162] Recovering provisioner > I0328 18:17:53.275249 7795 process.cpp:3588] Handling HTTP event for process > 'master' with path: '/master/api/v1/scheduler' > I0328 18:17:53.276659 7792 provisioner.cpp:494] Provisioner recovery complete > I0328 18:17:53.277318 7796 slave.cpp:7602] Recovering executors > I0328 18:17:53.277470 7796 slave.cpp:7755] Finished recovery > I0328 18:17:53.277743 7794 leveldb.cpp:347] Persisting action (16 bytes) to > leveldb took 3.110989ms > I0328 18:17:53.27 7794 replica.cpp:712] Persisted action NOP at position > 0 > I0328 18:17:53.278400 7795 http.cpp:1105] HTTP POST for > /master/api/v1/scheduler from 127.0.0.1:45952 > I0328 18:17:53.278426 7793 task_status_update_manager.cpp:181] Pausing > sending task status updates > I0328 18:17:53.278453 7794 log.cpp:570] Writer started with ending position 0 > I0328 18:17:53.278425 7798 status_update_manager_process.hpp:379] Pausing > operation status update manager > I0328 18:17:53.278431 7796 slave.cpp:1258] New master detected at > master@127.0.1.1:36423 > I0328 18:17:53.278502 7796 slave.cpp:1312] No credentials provided. > Attempting to register without authentication > I0328 18:17:53.278560 7796 slave.cpp:1323] Detecting new master > W0328 18:17:53.279768 7791 scheduler.cpp:697] Received '503 Service > Unavailable' () for SUBSCRIBE > {noformat} > Regardless of the actual issue that caused the error response, I think at the > very least, > - the `mesos::scheduler::Mesos` class should either have a way to provide > some feedback to the user or retry itself, not silently swallow the error > - out documentation should mention the possibility of this call returning > errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8241) Add metrics for offer operation feedback
[ https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804560#comment-16804560 ] Benno Evers commented on MESOS-8241: {noformat} commit ede2a94ebaf9710516816bae7d012d926c533a59 Author: Benno Evers Date: Thu Feb 28 18:02:56 2019 +0100 Added unit tests for offer operation feedback metrics. This adds a set of checks to verify the metrics introduced in the previous commit are working as intended. Review: https://reviews.apache.org/r/70117 commit 18c401563c33022240fede63fbe3ec9b7bf4c385 Author: Benno Evers Date: Thu Feb 28 18:03:27 2019 +0100 Added metrics for offer operation feedback. This commit adds additional metrics counting the number of operations in each state. Unit tests are added in the subsequent commit. Review: https://reviews.apache.org/r/70116 commit af2c47a5e680b5c3140fd7d4639750f476f1627c Author: Benno Evers Date: Thu Mar 7 17:51:22 2019 +0100 Added helper to test for metrics values. This patch adds a new helper function to check whether a given metric has some specified value. Review: https://reviews.apache.org/r/70156 commit 5e4aa14a2b6c5c753248e642289c04a267aca074 Author: Benno Evers Date: Thu Feb 28 18:01:47 2019 +0100 Updated comment about operations. Review: https://reviews.apache.org/r/70115 {noformat} > Add metrics for offer operation feedback > > > Key: MESOS-8241 > URL: https://issues.apache.org/jira/browse/MESOS-8241 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Benno Evers >Priority: Blocker > Labels: foundations, mesosphere, operation-feedback > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9690) Framework registration on mesos-local fails w/o error unless http_framework_authenticators flag is set
Benno Evers created MESOS-9690: -- Summary: Framework registration on mesos-local fails w/o error unless http_framework_authenticators flag is set Key: MESOS-9690 URL: https://issues.apache.org/jira/browse/MESOS-9690 Project: Mesos Issue Type: Bug Reporter: Benno Evers When running a v1 framework against mesos-local without setting the "--http_framework_authenticators=basic" flag, the master will respond with "503 Service Unavailable" to a SUBSCRIBE request, without any log message hinting at what might be wrong even at log level `GLOG_v=4`: {noformat} I0328 18:17:53.273442 7793 scheduler.cpp:600] Sending SUBSCRIBE call to http://127.0.1.1:36423/master/api/v1/scheduler I0328 18:17:53.273653 7797 leveldb.cpp:347] Persisting action (14 bytes) to leveldb took 3.185352ms I0328 18:17:53.273695 7797 replica.cpp:712] Persisted action NOP at position 0 I0328 18:17:53.274099 7798 containerizer.cpp:1123] Recovering isolators I0328 18:17:53.274602 7794 replica.cpp:695] Replica received learned notice for position 0 from log-network(1)@127.0.1.1:36423 I0328 18:17:53.274829 7798 containerizer.cpp:1162] Recovering provisioner I0328 18:17:53.275249 7795 process.cpp:3588] Handling HTTP event for process 'master' with path: '/master/api/v1/scheduler' I0328 18:17:53.276659 7792 provisioner.cpp:494] Provisioner recovery complete I0328 18:17:53.277318 7796 slave.cpp:7602] Recovering executors I0328 18:17:53.277470 7796 slave.cpp:7755] Finished recovery I0328 18:17:53.277743 7794 leveldb.cpp:347] Persisting action (16 bytes) to leveldb took 3.110989ms I0328 18:17:53.27 7794 replica.cpp:712] Persisted action NOP at position 0 I0328 18:17:53.278400 7795 http.cpp:1105] HTTP POST for /master/api/v1/scheduler from 127.0.0.1:45952 I0328 18:17:53.278426 7793 task_status_update_manager.cpp:181] Pausing sending task status updates I0328 18:17:53.278453 7794 log.cpp:570] Writer started with ending position 0 I0328 18:17:53.278425 7798 status_update_manager_process.hpp:379] Pausing operation status update manager I0328 18:17:53.278431 7796 slave.cpp:1258] New master detected at master@127.0.1.1:36423 I0328 18:17:53.278502 7796 slave.cpp:1312] No credentials provided. Attempting to register without authentication I0328 18:17:53.278560 7796 slave.cpp:1323] Detecting new master W0328 18:17:53.279768 7791 scheduler.cpp:697] Received '503 Service Unavailable' () for SUBSCRIBE {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9666) Specifying custom CXXFLAGS breaks Mesos build
Benno Evers created MESOS-9666: -- Summary: Specifying custom CXXFLAGS breaks Mesos build Key: MESOS-9666 URL: https://issues.apache.org/jira/browse/MESOS-9666 Project: Mesos Issue Type: Bug Reporter: Benno Evers The environment variable CXXFLAGS (as well CFLAGS and CPPFLAGS) is intended to give the user a way to add custom compiler flags to the build at both configure-time and build-time. For example, a user wishing to use the address-sanitizer feature for a development build could run configure like {nocode} ./configure CXXFLAGS="-fsanitize=address" {nocode} or a user wishing to investigate a particular binary might want to rebuild that framework with additional debug information: {nocode} make -C src/ dynamic-reservation-framework CXXFLAGS="-g3 -O0" {nocode} Therefore, providing custom CXXFLAGS should not break the build. However, we currently add some essential flags (like '-std=c++11') into CXXFLAGS, and a user specifying custom CXXFLAGS has to replicate all of these before he can provide his own. Instead, we should try to restrict CXXFLAGS to some harmless default (e.g. '-g -O2') and move essential flags into some other variable MESOS_CXXFLAGS that is always added to the mesos build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6874) Agent silently ignores FS isolation when protobuf is malformed
[ https://issues.apache.org/jira/browse/MESOS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796412#comment-16796412 ] Benno Evers commented on MESOS-6874: {noformat} commit 93aca1eb0efcec941e19e976f683a35ecd9a840b Author: Andrei Sekretenko Date: Tue Mar 19 18:55:55 2019 +0100 Validated the match between Type and *Infos in the ContainerInfo. [...] {noformat} > Agent silently ignores FS isolation when protobuf is malformed > -- > > Key: MESOS-6874 > URL: https://issues.apache.org/jira/browse/MESOS-6874 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.1.0 >Reporter: Michael Gummelt >Assignee: Andrei Sekretenko >Priority: Minor > Labels: foundations, newbie > Time Spent: 40m > Remaining Estimate: 0h > > cc [~vinodkone] > I accidentally set my Mesos ContainerInfo to include a DockerInfo instead of > a MesosInfo: > {code} > executorInfoBuilder.setContainer( > Protos.ContainerInfo.newBuilder() > .setType(Protos.ContainerInfo.Type.MESOS) > .setDocker(Protos.ContainerInfo.DockerInfo.newBuilder() > > .setImage(podSpec.getContainer().get().getImageName())) > {code} > I would have expected a validation error before or during containerization, > but instead, the agent silently decided to ignore filesystem isolation > altogether, and launch my executor on the host filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9660) Documentation should mention constraints for `ACCEPT` calls
Benno Evers created MESOS-9660: -- Summary: Documentation should mention constraints for `ACCEPT` calls Key: MESOS-9660 URL: https://issues.apache.org/jira/browse/MESOS-9660 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Our current documentation does not mention any constraints on the `ACCEPT` scheduler api call. However, in addition to the trivial constraints (i.e. all operations must have valid resources and have required fields set), we also have a number of non-obvious constraints that should be documented. One example is that all offer_ids in this call must belong to offers of the same agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9657) Launching a command task twice can crash the agent
Benno Evers created MESOS-9657: -- Summary: Launching a command task twice can crash the agent Key: MESOS-9657 URL: https://issues.apache.org/jira/browse/MESOS-9657 Project: Mesos Issue Type: Bug Reporter: Benno Evers When launching a command task, we verify that the framework has no existing executor for that task: {noformat} // We are dealing with command task; a new command executor will be // launched. CHECK(executor == nullptr); {noformat} and afterwards an executor is created with the same executor id as the task id: {noformat} // (slave.cpp) // Either the master explicitly requests launching a new executor // or we are in the legacy case of launching one if there wasn't // one already. Either way, let's launch executor now. if (executor == nullptr) { Try added = framework->addExecutor(executorInfo); [...] {noformat} This means that if we relaunch the task with the same task id before the executor is removed, it will crash the agent: {noformat} F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr *** Check failure stack trace: *** @ 0x7feb29a407af google::LogMessage::Flush() @ 0x7feb29a43c3f google::LogMessageFatal::~LogMessageFatal() @ 0x7feb28a5a886 mesos::internal::slave::Slave::__run() @ 0x7feb28af4f0e _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEclEOS3_ @ 0x7feb2998a620 process::ProcessBase::consume() @ 0x7feb29987675 process::ProcessManager::resume() @ 0x7feb299a2d2b _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8E6_M_runEv @ 0x7feb2632f523 (unknown) @ 0x7feb25e40594 start_thread @ 0x7feb25b73e6f __GI___clone Aborted (core dumped) {noformat} Instead of crashing, the agent should just drop the task with an appropriate error in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9656) Empty reservations fail with confusing error message
Benno Evers created MESOS-9656: -- Summary: Empty reservations fail with confusing error message Key: MESOS-9656 URL: https://issues.apache.org/jira/browse/MESOS-9656 Project: Mesos Issue Type: Bug Reporter: Benno Evers When attempting to apply a reserve operation containing empty resources, the operation fails during validation with the error message: {noformat} W0315 11:17:37.687129 25931 master.cpp:2292] Dropping UNRESERVE operation from framework e4cd5335-8af5-4db2-b6f8-07adbef1c6a3- (Operation Feedback Framework (C++)): Invalid resources: The resources have multiple resource providers: {noformat} Instead, the error message should say that the reservation does not contain any resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9652) URL handler lookup might miss root handlers
Benno Evers created MESOS-9652: -- Summary: URL handler lookup might miss root handlers Key: MESOS-9652 URL: https://issues.apache.org/jira/browse/MESOS-9652 Project: Mesos Issue Type: Bug Reporter: Benno Evers When looking up url handlers, libprocess is looking for the longest URL prefix that corresponds to a http endpoint handler registered by the handling process. For example if a process did setup route `/foo` and `/foo/bar`, an incoming http request for `/foo/bar/baz` would be dispatched onto the `/foo/bar` handler. However, if a process registers a route `/` the lookup will only succeed if the request is exactly for `/`, and a request for `/baz` will return a 404 Not Found response. The root cause of this is the implementation of the handler lookup: {noformat} // ProcessBase::consume(HttpEvent&&) name = strings::trim(name, strings::PREFIX, "/"); [...] while (Path(name, '/').dirname() != name) { [...] } {noformat} where `dirname()` returns "." when given an input string that does not contain any `/` as `name`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9650) Document the semantics of operation pipelining
Benno Evers created MESOS-9650: -- Summary: Document the semantics of operation pipelining Key: MESOS-9650 URL: https://issues.apache.org/jira/browse/MESOS-9650 Project: Mesos Issue Type: Improvement Reporter: Benno Evers In our `Accept` protobuf, frameworks can specify multiple offer operations that are to be executed on the received offer: https://github.com/apache/mesos/blob/40abcefab4f2887e61786365b46bc22155a2d1ff/include/mesos/scheduler/scheduler.proto#L317 However, the semantics of specifying multiple operations in this way are currently not documented anywhere I could find, except for a comment on that protobuf that the master will be "performing the specified operations in a sequential manner." In particular, it is unclear what will happen if any operation in the sequence fails, or at which particular points during the operation the master will move on to the next (i.e. if we have [RESERVE, LAUNCH, RESERVE], when exactly does the second reserve happen), and if there are any restrictions on combining operations in this way. While all of this can be figured out by reading the master source code, we should add some user-facing documentation about this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9645) Add a way to access a subset of metrics
Benno Evers created MESOS-9645: -- Summary: Add a way to access a subset of metrics Key: MESOS-9645 URL: https://issues.apache.org/jira/browse/MESOS-9645 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Currently, the only way to access libprocess metrics is via the `metrics/snapshot` endpoint, which returns the current values of all installed metrics. If the caller is only interested in a specific metric, or a subset of the metrics, this is wasteful in two ways: First the process has to do extra work to collect these metrics, and second the caller has to do extra work to filter out the unneeded metrics. Ideally libprocess could use the request path to implement filtering such that e.g. a request to {noformat} wget http://mesos.master:5050/metrics/allocator/mesos/ {noformat} would return all metrics whose name begins with "allocator/mesos/", would I'm not sure that this is currently implementable. Alternatively, a request parameter could be added to the same effect: {noformat} wget http://mesos.master:5050/metrics/snapshot?prefix=allocator/mesos/ {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9644) Marking an Agent as Gone Breaks Metrics Process in Unit Tests
Benno Evers created MESOS-9644: -- Summary: Marking an Agent as Gone Breaks Metrics Process in Unit Tests Key: MESOS-9644 URL: https://issues.apache.org/jira/browse/MESOS-9644 Project: Mesos Issue Type: Bug Reporter: Benno Evers When an agent is marked as gone, the master will tell that agent to shut down which it tries via {noformat} // slave.cpp:974 terminate(self()); {noformat} However, terminating a process will only call `Slave::finalize()`, but *not* the destructor `Slave::~Slave()`. In a standalone slave, this doesn't matter since terminating the slave process will cause the OS process to immediately exit as well. However, in unit tests that is not the case, and since the slave was never properly destructed its metrics keys are still contained in the global metrics object. The pull gauges will then cause a deadlock the next time a metrics snapshot is requested, since their dispatches will be silently (for VLOG < 2) dropped: {noformat} I0311 11:08:53.329043 34499 authorization.cpp:135] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' I0311 11:08:53.329067 34499 clock.cpp:435] Clock of local-authorizer(2)@66.70.182.167:35815 updated to 2019-03-11 15:08:53.273557888+00:00 I0311 11:08:53.329121 34499 process.cpp:2880] Resuming local-authorizer(2)@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00 I0311 11:08:53.329160 34496 process.cpp:2880] Resuming __auth_handlers__(2)@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00 I0311 11:08:53.329260 34496 process.cpp:2880] Resuming metrics@66.70.182.167:35815 at 2019-03-11 15:08:53.273557888+00:00 I0311 11:08:53.353018 34486 process.cpp:2803] Dropping event for process slave(1)@66.70.182.167:35815 I0311 11:08:53.353040 34486 process.cpp:2803] Dropping event for process slave(1)@66.70.182.167:35815 I0311 11:08:53.353063 34486 process.cpp:2803] Dropping event for process slave(1)@66.70.182.167:35815 I0311 11:08:53.353080 34486 process.cpp:2803] Dropping event for process slave(1)@66.70.182.167:35815 I0311 11:08:53.353097 34486 process.cpp:2803] Dropping event for process slave(1)@66.70.182.167:35815 [...] {noformat} It's not immediately clear to me what the correct fix for this would be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8241) Add metrics for offer operation feedback
[ https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783566#comment-16783566 ] Benno Evers commented on MESOS-8241: I've opened a review for the scope that is outline in the comment above at: https://reviews.apache.org/r/70116/ Some ideas I've had for further metrics that might become interesting: Master-wide versions of the per-framework metrics we currently collect about operations types: - master/operations/create_disk/finished - master/operations/create_disk/dropped - [...] A counter to see how many user-provided operations failed validation: - master/invalid_operations A per-framework counter for the number of unacknowledged operations. A counter for the total number of operation update retries. > Add metrics for offer operation feedback > > > Key: MESOS-8241 > URL: https://issues.apache.org/jira/browse/MESOS-8241 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Benno Evers >Priority: Major > Labels: foundations, mesosphere, operation-feedback > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9611) Add `/machines` endpoint to show mapping between machines and agents
Benno Evers created MESOS-9611: -- Summary: Add `/machines` endpoint to show mapping between machines and agents Key: MESOS-9611 URL: https://issues.apache.org/jira/browse/MESOS-9611 Project: Mesos Issue Type: Improvement Reporter: Benno Evers It is currently quite hard to get information about the machines known to the master. This can result in situations that are hard to debug for silly reasons, e.g. mistyping a machine id when posting a maintenance schedule. It would be nice to have an endpoint that displays the current mapping between machine id's and agents to the user. This could become a new endpoint like `/machines` or `/machine/info`, or added as part of an existing one like `/mainenance/status`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9588) Add a way to view current offer filters
Benno Evers created MESOS-9588: -- Summary: Add a way to view current offer filters Key: MESOS-9588 URL: https://issues.apache.org/jira/browse/MESOS-9588 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Looking at just mesos, it's currently not possible to see which offer filters are active for which amount of time. The closest one can get is to check whether a filter currently exists, either by looking at via the `metrics/snapshot` if per-frameworks metrics are enabled or by scanning the master logs for this message {noformat} VLOG(1) << "Filtered offer with " << resources << " on agent " << slaveId << " for role " << role << " of framework " << frameworkId; {noformat} However, that does not tell the user how long the filter was there, which resources it contains and how long it will stay. Maybe MESOS-8621 would be a viable way to surface this information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9585) Agent host IP can differ between subpages in the WebUI
Benno Evers created MESOS-9585: -- Summary: Agent host IP can differ between subpages in the WebUI Key: MESOS-9585 URL: https://issues.apache.org/jira/browse/MESOS-9585 Project: Mesos Issue Type: Bug Reporter: Benno Evers Attachments: mesos_agent_ip.webm Apparently, the WebUI receives the agent host ip from different sources between the "Agents" tab and the information page for an individual agent. For example, in the attached video the host ip of the given agent is once given as 172.31.3.68 (the correct ip) and once as 172.31.10.48. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9584) Inactive frameworks show incorrect "Registered Time" in Web UI
Benno Evers created MESOS-9584: -- Summary: Inactive frameworks show incorrect "Registered Time" in Web UI Key: MESOS-9584 URL: https://issues.apache.org/jira/browse/MESOS-9584 Project: Mesos Issue Type: Bug Reporter: Benno Evers Attachments: image-2019-02-19-16-48-04-927.png Currently, inactive frameworks have their "Registered Time" shown as `1970-01-01` in the WebUI (see attached screenshot): !image-2019-02-19-16-48-04-927.png! Instead, this should probably be displayed as "-" to indicate that this field does not have a useful value for these frameworks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771167#comment-16771167 ] Benno Evers commented on MESOS-9490: [~bmahler], seems I remembered wrong, after re-running the test above it's actually not a CHECK failure but just a normal error: {noformat} [ RUN ] MasterLoadTest.AcceptEncoding I0218 10:45:26.316328 70511 cluster.cpp:174] Creating default 'local' authorizer I0218 10:45:26.318068 70572 master.cpp:414] Master 67635eb2-df26-4db8-a5e4-a5f3aa9f3ebc (core1.hw.ca1.mesosphere.com) started on 66.70.182.167:46839 I0218 10:45:26.318110 70572 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/qKeUnl/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/qKeUnl/master" --zk_session_timeout="10secs" I0218 10:45:26.319782 70572 master.cpp:466] Master only allowing authenticated frameworks to register I0218 10:45:26.319829 70572 master.cpp:472] Master only allowing authenticated agents to register I0218 10:45:26.319839 70572 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I0218 10:45:26.319851 70572 credentials.hpp:37] Loading credentials for authentication from '/tmp/qKeUnl/credentials' I0218 10:45:26.320096 70572 master.cpp:522] Using default 'crammd5' authenticator I0218 10:45:26.320171 70572 authenticator.cpp:520] Initializing server SASL I0218 10:45:26.325443 70572 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0218 10:45:26.325582 70572 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0218 10:45:26.325675 70572 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0218 10:45:26.325704 70572 master.cpp:603] Authorization enabled I0218 10:45:26.329525 70572 master.cpp:2103] Elected as the leading master! I0218 10:45:26.329560 70572 master.cpp:1638] Recovering from registrar I0218 10:45:26.331326 70526 registrar.cpp:383] Successfully fetched the registry (0B) in 1.668864ms I0218 10:45:26.331449 70526 registrar.cpp:487] Applied 1 operations in 38387ns; attempting to update the registry I0218 10:45:26.331748 70530 registrar.cpp:544] Successfully updated the registry in 259072ns I0218 10:45:26.331821 70530 registrar.cpp:416] Successfully recovered registrar I0218 10:45:26.331980 70530 master.cpp:1752] Recovered 0 agents from the registry (172B); allowing 10mins for agents to reregister I0218 10:45:26.334493 70554 http.cpp:1105] HTTP GET for /master//state from 66.70.182.167:59082 I0218 10:45:26.335484 70552 http.cpp:1122] HTTP GET for /master//state from 66.70.182.167:59082: '200 OK' after 2.06899ms ../../src/tests/master_load_tests.cpp:570: Failure (response).failure(): Failed to decode response I0218 10:45:26.336654 70511 master.cpp:1109] Master terminating [ FAILED ] MasterLoadTest.AcceptEncoding (22 ms) {noformat} > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > wit
[jira] [Comment Edited] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234 ] Benno Evers edited comment on MESOS-9490 at 2/15/19 11:57 AM: -- [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} If I remember correctly, running this test leads to a segfault due to some internal CHECK failure. was (Author: bennoe): [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > with a decode error in this case. > For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234 ] Benno Evers commented on MESOS-9490: [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > with a decode error in this case. > For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9575) Mesos Web UI can't display relative timestamps in the future
Benno Evers created MESOS-9575: -- Summary: Mesos Web UI can't display relative timestamps in the future Key: MESOS-9575 URL: https://issues.apache.org/jira/browse/MESOS-9575 Project: Mesos Issue Type: Bug Reporter: Benno Evers The `relativeDate()` function used by the Mesos WebUI (https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=src/webui/assets/libs/relative-date.js;hb=HEAD) is only able to handle dates in the past. All dates in the future are rendered as "just now". This can be especially confusing when posting maintenance windows, where usually both dates are in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9569) Missing master-side validation of UpdateOperationStatusMessage
Benno Evers created MESOS-9569: -- Summary: Missing master-side validation of UpdateOperationStatusMessage Key: MESOS-9569 URL: https://issues.apache.org/jira/browse/MESOS-9569 Project: Mesos Issue Type: Bug Reporter: Benno Evers The master is currently not validating incoming `UpdateOperationStatusMessage`s, and is performing `CHECK()`s on the values of certain protobuf fields of the message. This means a malformed HTTP request can trigger a master crash. This can be reproduced e.g. by executing code like this on a master host: {noformat} import urllib.request rq = urllib.request.Request("http://localhost:5050/master/mesos.internal.UpdateOperationStatusMessage";, headers={"Libprocess-From": "foo@127.0.1.1:5052"}, method="POST", data=b'\x1a\x02\x10\x01*\x05\n\x03xxx') rsp = urllib.request.urlopen(rq).read() {noformat} (where the posted data is just an UpdateOperationStatusMessage protobuf without a slave_id serialized as string) {noformat} F0213 13:14:22.507489 16492 master.cpp:8413] Check failed: update.has_slave_id() External resource provider is not supported yet {noformat} Looking at other internal messages, some of them already have a validation step implemented (i.e. RegisterSlaveMessage), so probably we should probably add something similar for this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9521) MasterAPITest.OperationUpdatesUponAgentGone is flaky
[ https://issues.apache.org/jira/browse/MESOS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742276#comment-16742276 ] Benno Evers commented on MESOS-9521: Review: https://reviews.apache.org/r/69726/ The warning is known, but due to the caveat that is printed right below the warning {noformat} NOTE: You can safely ignore the above warning unless this call should not happen. Do not suppress it by blindly adding an EXPECT_CALL() if you don't mean to enforce the call. See https://github.com/google/googletest/blob/master/googlemock/docs/CookBook.md#knowing-when-to-expect for details. {noformat} I left it, because the test does not really care about whether `disconnect()` is called or not. > MasterAPITest.OperationUpdatesUponAgentGone is flaky > > > Key: MESOS-9521 > URL: https://issues.apache.org/jira/browse/MESOS-9521 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.8.0 > Environment: Fedora28, cmake w/ SSL >Reporter: Benjamin Bannier >Priority: Major > Labels: flaky, flaky-test > > The recently added test {{MasterAPITest.OperationUpdatesUponAgentGone}} is > flaky, e.g., > {noformat}../src/tests/api_tests.cpp:5051: Failure > Value of: resources.empty() > Actual: true > Expected: false > ../3rdparty/libprocess/src/../include/process/gmock.hpp:504: Failure > Actual function call count doesn't match EXPECT_CALL(filter->mock, filter(to, > testing::A()))... > Expected args: message matcher (32-byte object 24-00 00-00 00-00 00-00 24-00 00-00 00-00 00-00 41-63 74-75 61-6C 20-66>, > 1-byte object ) > Expected: to be called once >Actual: never called - unsatisfied and active > {noformat} > I am able to reproduce this reliable in less than 10 iterations when running > the test in repetition under additional system stress. > Even if the test does not fail it produces the following gmock warning, > {noformat} > GMOCK WARNING: > Uninteresting mock function call - returning directly. > Function call: disconnected() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9394) Maintenance of machine A causes "Removing offers" for machine B.
[ https://issues.apache.org/jira/browse/MESOS-9394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740647#comment-16740647 ] Benno Evers commented on MESOS-9394: Both the analysis and the proposed change look correct to me - the current behaviour certainly does not match what the documentation at http://mesos.apache.org/documentation/latest/maintenance/#scheduling-maintenance suggests. [~carlone], if you want to keep credit for the fix I'd suggest to go ahead and post a patch to reviewboard, otherwise if you prefer I can also go ahead and do that for you. > Maintenance of machine A causes "Removing offers" for machine B. > > > Key: MESOS-9394 > URL: https://issues.apache.org/jira/browse/MESOS-9394 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: longfei >Priority: Major > Labels: maintenance > > If I schedule machine A in a maintenance call, the logic in > "___updateMaintenanceSchedule" will check all the master's machines. > Another machine(say machine B) not in the maintenance schedule will be set to > UP Mode and call "updateUnavailability". This results in removing all offers > of slaves on machine B. > If I am using these offers to run some tasks, these tasks would be lost for > REASON_INVALID_OFFERS. > I think a maintenance schedule should not affect machines not in it. Is that > right? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9472) Unblock operation feedback on agent default resources.
[ https://issues.apache.org/jira/browse/MESOS-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9472: -- Assignee: Benno Evers > Unblock operation feedback on agent default resources. > -- > > Key: MESOS-9472 > URL: https://issues.apache.org/jira/browse/MESOS-9472 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Gastón Kleiman >Assignee: Benno Evers >Priority: Major > Labels: foundations, mesosphere, operation-feedback > > # Remove {{CHECK}} marked with a TODO in {{Master::updateOperationStatus()}}. > # Update {{Master::acknowledgeOperationStatus()}}, remove the CHECK requiring > a resource provider ID. > # Remove validation in {{Option validate(mesos::scheduler::Call& call, > const Option& principal)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8783) Transition pending operations to OPERATION_UNREACHABLE when an agent is removed.
[ https://issues.apache.org/jira/browse/MESOS-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734500#comment-16734500 ] Benno Evers commented on MESOS-8783: Opened a review for the first paragraph here: https://reviews.apache.org/r/69669/ The second part needs a bit more consideration, and should probably be done in a separate ticket. It might be not necessary to send updates from the master when the agent reconnects, since at that point the agent can send the updated operation statuses itself. > Transition pending operations to OPERATION_UNREACHABLE when an agent is > removed. > > > Key: MESOS-8783 > URL: https://issues.apache.org/jira/browse/MESOS-8783 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: Gastón Kleiman >Assignee: Benno Evers >Priority: Critical > Labels: foundations, mesosphere > Fix For: 1.8.0 > > > Pending operations on an agent should be transitioned to > `OPERATION_UNREACHABLE` when an agent is marked unreachable. We should also > make sure that we pro-actively send operation status updates for these > operations when the agent becomes unreachable. > We should also make sure that we send new operation updates if/when the agent > reconnects - perhaps this is already accomplished with the existing operation > update logic in the agent? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9506) Master will leaks operations when agents are removed
Benno Evers created MESOS-9506: -- Summary: Master will leaks operations when agents are removed Key: MESOS-9506 URL: https://issues.apache.org/jira/browse/MESOS-9506 Project: Mesos Issue Type: Bug Reporter: Benno Evers Usually, offer operations are removed when the framework acknowledges a terminal operation status update. However, currently only operations on registered agents can be acknowledged, so operations on agents which don't come back will be permanently leaked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9506) Master will leaks operations when agents are removed
[ https://issues.apache.org/jira/browse/MESOS-9506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732203#comment-16732203 ] Benno Evers commented on MESOS-9506: https://reviews.apache.org/r/69597 > Master will leaks operations when agents are removed > > > Key: MESOS-9506 > URL: https://issues.apache.org/jira/browse/MESOS-9506 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > Usually, offer operations are removed when the framework acknowledges > a terminal operation status update. > However, currently only operations on registered agents can be > acknowledged, so operations on agents which don't come back will be > permanently leaked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9494) Add a unit test for the interaction between request batching and response compression
Benno Evers created MESOS-9494: -- Summary: Add a unit test for the interaction between request batching and response compression Key: MESOS-9494 URL: https://issues.apache.org/jira/browse/MESOS-9494 Project: Mesos Issue Type: Improvement Reporter: Benno Evers As discussed in https://reviews.apache.org/r/69064/ , we should try to add a unit test that verifies that simultaneous requests with different accept encoding headers produce different responses. It could look like this: {noformat} TEST_F(MasterLoadTest, AcceptEncoding) { MockAuthorizer authorizer; prepareCluster(&authorizer); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; Headers acceptRawHeaders = {{"Accept-Encoding", "raw"}}; RequestDescriptor descriptor1; descriptor1.endpoint = "/state"; descriptor1.headers = authHeaders + acceptGzipHeaders; RequestDescriptor descriptor2 = descriptor1; descriptor2.headers = authHeaders + acceptRawHeaders; auto responses = launchSimultaneousRequests({descriptor1, descriptor2}); foreachpair ( const RequestDescriptor& request, Future& response, responses) { AWAIT_READY(response); ASSERT_SOME(request.headers.get("Accept-Encoding")); if (request.headers.get("Accept-Encoding").get() == "gzip") { ASSERT_SOME(response->headers.get("Content-Encoding")); EXPECT_EQ(response->headers.get("Content-Encoding").get(), "gzip"); } else { EXPECT_NONE(response->headers.get("Content-Encoding")); } } // Ensure that we actually hit the metrics code path while executing // the test. JSON::Object metrics = Metrics(); ASSERT_TRUE(metrics.values["master/http_cache_hits"].is()); ASSERT_GT( metrics.values["master/http_cache_hits"].as().as(), 0u); } {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8782) Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent gone.
[ https://issues.apache.org/jira/browse/MESOS-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725230#comment-16725230 ] Benno Evers commented on MESOS-8782: Review: https://reviews.apache.org/r/69575/ > Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent > gone. > --- > > Key: MESOS-8782 > URL: https://issues.apache.org/jira/browse/MESOS-8782 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: Gastón Kleiman >Assignee: Benno Evers >Priority: Critical > Labels: foundations > Fix For: 1.8.0 > > > The master should transition operations to the state > {{OPERATION_GONE_BY_OPERATOR}} when an agent is marked gone, sending an > operation status update to the frameworks that created them. > We should also remove them from {{Master::frameworks}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9490) Support accepting gzipped responses in libprocess
Benno Evers created MESOS-9490: -- Summary: Support accepting gzipped responses in libprocess Key: MESOS-9490 URL: https://issues.apache.org/jira/browse/MESOS-9490 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Currently all libprocess endpoints support the serving of gzipped responses when the client is requesting this with an `Accept-Encoding: gzip` header. However, libprocess does not support receiving gzipped responses, failing with a decode error in this case. For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9484) GroupTest.GroupDataWithDisconnect is flaky
Benno Evers created MESOS-9484: -- Summary: GroupTest.GroupDataWithDisconnect is flaky Key: MESOS-9484 URL: https://issues.apache.org/jira/browse/MESOS-9484 Project: Mesos Issue Type: Bug Environment: Mac OSX w/ libevent Reporter: Benno Evers Observed the following error in our CI: {noformat} ../../src/tests/group_tests.cpp:129: Failure data.get() is NONE {noformat} Full log: {noformat} [ RUN ] GroupTest.GroupDataWithDisconnect I1214 15:06:53.386937 398710208 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 51193 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-14 15:06:53,387:69505(0x739ee000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:51193 sessionTimeout=1 watcher=0x11a65f9a0 sessionId=0 sessionPasswd= context=0x7fcd06163550 flags=0 2018-12-14 15:06:53,387:69505(0x74415000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:51193] 2018-12-14 15:06:53,389:69505(0x74415000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:51193], sessionId=0x167aef9004a, negotiated timeout=1 I1214 15:06:53.389168 60743680 group.cpp:341] Group process (zookeeper-group(40)@10.0.49.4:49309) connected to ZooKeeper I1214 15:06:53.389210 60743680 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I1214 15:06:53.389227 60743680 group.cpp:419] Trying to create path '/test' in ZooKeeper I1214 15:06:53.392253 398710208 zookeeper_test_server.cpp:116] Shutting down ZooKeeperTestServer on port 51193 2018-12-14 15:06:53,393:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1782: Socket [127.0.0.1:51193] zk retcode=-4, errno=64(Host is down): failed while receiving a server response I1214 15:06:53.393187 59133952 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... I1214 15:06:53.393661 59670528 group.cpp:700] Trying to get '/test/00' in ZooKeeper 2018-12-14 15:06:53,393:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:51193] zk retcode=-4, errno=61(Connection refused): server refused to accept the client I1214 15:06:53.395321 398710208 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 51193 W1214 15:07:04.003191 59670528 group.cpp:495] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=167aef9004a) expiration I1214 15:07:04.003652 59670528 group.cpp:511] ZooKeeper session expired 2018-12-14 15:07:04,004:69505(0x738e8000):ZOO_INFO@zookeeper_close@2579: Freeing zookeeper resources for sessionId=0x167aef9004a 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-14 15:07:04,004:69505(0x739ee000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:51193 sessionTimeout=1 watcher=0x11a65f9
[jira] [Created] (MESOS-9483) ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors is flaky
Benno Evers created MESOS-9483: -- Summary: ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors is flaky Key: MESOS-9483 URL: https://issues.apache.org/jira/browse/MESOS-9483 Project: Mesos Issue Type: Bug Environment: Mac OSX w/ libevent Reporter: Benno Evers Observed a failure with the following error: {noformat} ../../src/tests/master_contender_detector_tests.cpp:409: Failure Failed to wait 15secs for group1.join("data") {noformat} Full log: {noformat} [ RUN ] ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors I1214 15:03:56.036525 398710208 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 50199 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-14 15:03:56,036:69505(0x7396b000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:50199 sessionTimeout=1 watcher=0x11a65f9a0 sessionId=0 sessionPasswd= context=0x7fcd061125a0 flags=0 2018-12-14 15:03:56,037:69505(0x74415000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50199] 2018-12-14 15:03:56,039:69505(0x74415000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:50199], sessionId=0x167aef64b83, negotiated timeout=1 I1214 15:03:56.039242 60207104 group.cpp:341] Group process (zookeeper-group(14)@10.0.49.4:49309) connected to ZooKeeper I1214 15:03:56.039286 60207104 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I1214 15:03:56.039309 60207104 group.cpp:395] Authenticating with ZooKeeper using digest 2018-12-14 15:04:05,989:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 6619ms 2018-12-14 15:04:05,989:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1702: Socket [127.0.0.1:50199] zk retcode=-7, errno=60(Operation timed out): connection to 127.0.0.1:50199 timed out (exceeded timeout by 3284ms) 2018-12-14 15:04:05,989:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 6619ms I1214 15:04:05.990031 60207104 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ... 2018-12-14 15:04:09,332:69505(0x74415000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 9963ms 2018-12-14 15:04:09,332:69505(0x74415000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50199] 2018-12-14 15:04:09,333:69505(0x74415000):ZOO_ERROR@handle_socket_error_msg@1800: Socket [127.0.0.1:50199] zk retcode=-112, errno=70(Stale NFS file handle): sessionId=0x167aef64b83 has expired. I1214 15:04:09.333552 59670528 group.cpp:511] ZooKeeper session expired 2018-12-14 15:04:09,333:69505(0x738e8000):ZOO_INFO@zookeeper_close@2579: Freeing zookeeper resources for sessionId=0x167aef64b83 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-14 15:04:09,333:69505(0x7375f000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-14 15:04:09,333:69505(0x737
[jira] [Created] (MESOS-9478) ZooKeeperTest.Create is flaky
Benno Evers created MESOS-9478: -- Summary: ZooKeeperTest.Create is flaky Key: MESOS-9478 URL: https://issues.apache.org/jira/browse/MESOS-9478 Project: Mesos Issue Type: Bug Environment: Mac OSX w/ libeven Reporter: Benno Evers Observed the following {noformat} ../../src/tests/zookeeper_tests.cpp:124 Expected: ZNODEEXISTS Which is: -110 To be equal to: nonOwnerZk.create("/foo/bar/baz", "", zookeeper::EVERYONE_READ_CREATOR_ALL, 0, nullptr, true) Which is: -9 {noformat} Full log: {noformat} [ RUN ] ZooKeeperTest.Create I1213 18:43:49.478912 222864832 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 57250 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-13 18:43:49,479:66260(0x75a71000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:57250 sessionTimeout=1 watcher=0x10fea4f00 sessionId=0 sessionPasswd= context=0x7fe4d5e7c680 flags=0 2018-12-13 18:43:49,479:66260(0x7659e000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:57250] 2018-12-13 18:43:49,480:66260(0x7659e000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:57250], sessionId=0x167aa994066, negotiated timeout=1 2018-12-13 18:43:52,819:66260(0x7659e000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-13 18:43:52,823:66260(0x75c7d000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:57250 sessionTimeout=1 watcher=0x10fea4f00 sessionId=0 sessionPasswd= context=0x7fe4d5cf7a20 flags=0 2018-12-13 18:43:52,823:66260(0x76d36000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:57250] 2018-12-13 18:43:52,824:66260(0x76d36000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:57250], sessionId=0x167aa9940660001, negotiated timeout=1 2018-12-13 18:44:05,891:66260(0x7659e000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 9735ms 2018-12-13 18:44:05,891:66260(0x7659e000):ZOO_ERROR@handle_socket_error_msg@1702: Socket [127.0.0.1:57250] zk retcode=-7, errno=60(Operation timed out): connection to 127.0.0.1:57250 timed out (exceeded timeout by 6402ms) 2018-12-13 18:44:05,891:66260(0x7659e000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 9735ms 2018-12-13 18:44:05,892:66260(0x76d36000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 9736ms 2018-12-13 18:44:05,892:66260(0x76d36000):ZOO_ERROR@handle_socket_error_msg@1702: Socket [127.0.0.1:57250] zk retcode=-7, errno=60(Operation timed out): connection to 127.0.0.1:57250 timed out (exceeded timeout by 6402ms) 2018-12-13 18:44:05,892:66260(0x76d36000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 9736m
[jira] [Commented] (MESOS-9247) MasterAPITest.EventAuthorizationFiltering is flaky
[ https://issues.apache.org/jira/browse/MESOS-9247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721328#comment-16721328 ] Benno Evers commented on MESOS-9247: Observed the same failure today on a CentOS 7 build. > MasterAPITest.EventAuthorizationFiltering is flaky > -- > > Key: MESOS-9247 > URL: https://issues.apache.org/jira/browse/MESOS-9247 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Greg Mann >Assignee: Till Toenshoff >Priority: Minor > Labels: flaky, flaky-test, integration, mesosphere > Attachments: MasterAPITest.EventAuthorizationFiltering.txt > > > Saw this failure on a CentOS 6 SSL build in our internal CI. Build log > attached. For some reason, it seems that the initial {{TASK_ADDED}} event is > missed: > {code} > ../../src/tests/api_tests.cpp:2922 > Expected: v1::master::Event::TASK_ADDED > Which is: TASK_ADDED > To be equal to: event->get().type() > Which is: TASK_UPDATED > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9468) SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask is flaky
Benno Evers created MESOS-9468: -- Summary: SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask is flaky Key: MESOS-9468 URL: https://issues.apache.org/jira/browse/MESOS-9468 Project: Mesos Issue Type: Bug Environment: Mac OSX with ssl enabled Reporter: Benno Evers The following test failure was observed in an internal CI run: {noformat} ../../src/tests/slave_tests.cpp:6341: Failure Actual function call count doesn't match EXPECT_CALL(*slave.get()->mock(), _shutdownExecutor(_, _))... Expected: to be called once Actual: never called - unsatisfied and active {noformat} Full log: {noformat} [ RUN ] SlaveTest.AgentFailoverTerminatesHTTPExecutorWithNoTask I1210 16:20:13.298667 338650560 cluster.cpp:173] Creating default 'local' authorizer I1210 16:20:13.36 238522368 master.cpp:414] Master 4c470ddd-dc29-4d9c-9b46-8e7a8b6c7801 (Jenkinss-Mac-mini.local) started on 10.0.49.4:54069 I1210 16:20:13.300034 238522368 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/master" --zk_session_timeout="10secs" I1210 16:20:13.300215 238522368 master.cpp:466] Master only allowing authenticated frameworks to register I1210 16:20:13.300227 238522368 master.cpp:472] Master only allowing authenticated agents to register I1210 16:20:13.300237 238522368 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I1210 16:20:13.300246 238522368 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ntg04w/credentials' I1210 16:20:13.300427 238522368 master.cpp:522] Using default 'crammd5' authenticator I1210 16:20:13.300489 238522368 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1210 16:20:13.300559 238522368 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1210 16:20:13.300607 238522368 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1210 16:20:13.300657 238522368 master.cpp:603] Authorization enabled I1210 16:20:13.300863 237985792 whitelist_watcher.cpp:77] No whitelist given I1210 16:20:13.300884 239058944 hierarchical.cpp:175] Initialized hierarchical allocator process I1210 16:20:13.302809 235302912 master.cpp:2089] Elected as the leading master! I1210 16:20:13.302834 235302912 master.cpp:1644] Recovering from registrar I1210 16:20:13.302875 237985792 registrar.cpp:339] Recovering registrar I1210 16:20:13.303133 237985792 registrar.cpp:383] Successfully fetched the registry (0B) in 08ns I1210 16:20:13.303207 237985792 registrar.cpp:487] Applied 1 operations in 24653ns; attempting to update the registry I1210 16:20:13.303490 237985792 registrar.cpp:544] Successfully updated the registry in 258048ns I1210 16:20:13.303539 237985792 registrar.cpp:416] Successfully recovered registrar I1210 16:20:13.303692 236376064 master.cpp:1758] Recovered 0 agents from the registry (155B); allowing 10mins for agents to reregister I1210 16:20:13.303723 235839488 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover W1210 16:20:13.306483 338650560 process.cpp:2829] Attempted to spawn already running process files@10.0.49.4:54069 I1210 16:20:13.307142 338650560 containerizer.cpp:305] Using isolation { environment_secre
[jira] [Created] (MESOS-9467) ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky
Benno Evers created MESOS-9467: -- Summary: ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky Key: MESOS-9467 URL: https://issues.apache.org/jira/browse/MESOS-9467 Project: Mesos Issue Type: Bug Environment: Mac OSX with ssl enabled Reporter: Benno Evers The following error was observed in an internal CI run: {noformat} ../../src/tests/master_contender_detector_tests.cpp:872: Failure Failed to wait 15secs for detected {noformat} Full log: {noformat} [ RUN ] ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster I1210 16:18:13.068011 338650560 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 54990 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-10 16:18:13,068:28813(0x7e2f6000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:54990 sessionTimeout=1 watcher=0x116d03e00 sessionId=0 sessionPasswd= context=0x7fd3883958d0 flags=0 2018-12-10 16:18:13,068:28813(0x7ed1d000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:54990] I1210 16:18:13.069262 236376064 contender.cpp:152] Joining the ZK group 2018-12-10 16:18:13,070:28813(0x7ed1d000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:54990], sessionId=0x1679aa0ddc9, negotiated timeout=1 I1210 16:18:13.070789 239058944 group.cpp:341] Group process (zookeeper-group(28)@10.0.49.4:54069) connected to ZooKeeper I1210 16:18:13.070853 239058944 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I1210 16:18:13.070868 239058944 group.cpp:419] Trying to create path '/mesos' in ZooKeeper I1210 16:18:13.073835 235839488 contender.cpp:268] New candidate (id='0') has entered the contest for leadership I1210 16:18:13.074319 237985792 detector.cpp:152] Detected a new leader: (id='0') I1210 16:18:13.074406 237449216 group.cpp:700] Trying to get '/mesos/json.info_00' in ZooKeeper I1210 16:18:13.075139 239058944 zookeeper.cpp:262] A new leading master (UPID=@0.152.150.128:1) is detected 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-12-10 16:18:13,075:28813(0x7e273000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:54990 sessionTimeout=1 watcher=0x116d03e00 sessionId=0 sessionPasswd= context=0x7fd3886b40e0 flags=0 2018-12-10 16:18:13,075:28813(0x7f944000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:54990] I1210 16:18:13.076236 238522368 contender.cpp:152] Joining the ZK group 2018-12-10 16:18:13,077:28813(0x7f944000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:54990], sessionId=0x1679aa0ddc90001, negotiated timeout=1 I1210 16:18:13.077278 239058944 group.cpp:341] Group process (zookeeper-group
[jira] [Created] (MESOS-9466) FetcherCacheTest.LocalCachedMissing is flaky
Benno Evers created MESOS-9466: -- Summary: FetcherCacheTest.LocalCachedMissing is flaky Key: MESOS-9466 URL: https://issues.apache.org/jira/browse/MESOS-9466 Project: Mesos Issue Type: Bug Environment: Mac OSX with ssl enabled Reporter: Benno Evers Observed the following failure in an internal CI run: {noformat} ../../src/tests/fetcher_cache_tests.cpp:722: Failure Failed to wait 15secs for awaitFinished(task.get()) {noformat} Full log: {noformat} [ RUN ] FetcherCacheTest.LocalCachedMissing I1210 16:16:09.364095 338650560 cluster.cpp:173] Creating default 'local' authorizer I1210 16:16:09.365344 237985792 master.cpp:414] Master 57f28035-e5fa-4e2a-8b8c-1caf1f9c85ca (Jenkinss-Mac-mini.local) started on 10.0.49.4:54069 I1210 16:16:09.365368 237985792 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/master" --zk_session_timeout="10secs" I1210 16:16:09.365530 237985792 master.cpp:466] Master only allowing authenticated frameworks to register I1210 16:16:09.365541 237985792 master.cpp:472] Master only allowing authenticated agents to register I1210 16:16:09.365550 237985792 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I1210 16:16:09.365559 237985792 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/OBl7Zi/credentials' I1210 16:16:09.365763 237985792 master.cpp:522] Using default 'crammd5' authenticator I1210 16:16:09.365819 237985792 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1210 16:16:09.365888 237985792 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1210 16:16:09.365967 237985792 http.cpp:1017] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1210 16:16:09.366027 237985792 master.cpp:603] Authorization enabled I1210 16:16:09.366263 239058944 whitelist_watcher.cpp:77] No whitelist given I1210 16:16:09.366286 237449216 hierarchical.cpp:175] Initialized hierarchical allocator process I1210 16:16:09.368378 237985792 master.cpp:2089] Elected as the leading master! I1210 16:16:09.368408 237985792 master.cpp:1644] Recovering from registrar I1210 16:16:09.368455 235839488 registrar.cpp:339] Recovering registrar I1210 16:16:09.368711 235839488 registrar.cpp:383] Successfully fetched the registry (0B) in 224us I1210 16:16:09.368775 235839488 registrar.cpp:487] Applied 1 operations in 23922ns; attempting to update the registry I1210 16:16:09.369017 235839488 registrar.cpp:544] Successfully updated the registry in 218112ns I1210 16:16:09.369065 235839488 registrar.cpp:416] Successfully recovered registrar I1210 16:16:09.369207 238522368 master.cpp:1758] Recovered 0 agents from the registry (155B); allowing 10mins for agents to reregister I1210 16:16:09.369225 236912640 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover W1210 16:16:09.369658 338650560 process.cpp:2829] Attempted to spawn already running process version@10.0.49.4:54069 I1210 16:16:09.370749 338650560 containerizer.cpp:305] Using isolation { environment_secret, filesystem/posix, posix/mem, posix/cpu } I1210 16:16:09.371047 338650560 provisioner.cpp:298] Using default backend 'copy' W1210 16:16:09.372812 338650560 process.cpp:2829] Attempted to
[jira] [Created] (MESOS-9465) ProcessRemoteLinkTest.RemoteStaleLinkRelink is flaky again
Benno Evers created MESOS-9465: -- Summary: ProcessRemoteLinkTest.RemoteStaleLinkRelink is flaky again Key: MESOS-9465 URL: https://issues.apache.org/jira/browse/MESOS-9465 Project: Mesos Issue Type: Bug Environment: Mac OSX with SSL enabled Reporter: Benno Evers The test failed with the following error in our internal CI: {noformat} [ RUN ] ProcessRemoteLinkTest.RemoteStaleLinkRelink [warn] kq_init: detected broken kqueue; not using.: No such process WARNING: Logging before InitGoogleLogging() is written to STDERR I1210 10:34:07.134811 351110592 process.cpp:1239] libprocess is initialized on 10.0.49.4:58630 with 8 worker threads I1210 10:34:07.137801 109821952 test_linkee.cpp:73] EXIT with status 0: ../../../3rdparty/libprocess/src/tests/process_tests.cpp:1176: Failure Mock function called more times than expected - returning directly. Function call: exited(@0x7f9ef7f0d888 (1)@10.0.49.4:58631) Expected: to be called once Actual: called twice - over-saturated and active W1210 10:34:07.139040 95457280 process.cpp:838] Failed to recv on socket 8 to peer 'unknown': Connection reset by peer [ FAILED ] ProcessRemoteLinkTest.RemoteStaleLinkRelink (22 ms) {noformat} Interestingly, looking at some context from the same CI run, it looks like many similar tests also had severe issues but still succeeded: {noformat} [ RUN ] ProcessRemoteLinkTest.RemoteDoubleLinkRelink [warn] kq_init: detected broken kqueue; not using.: No such process WARNING: Logging before InitGoogleLogging() is written to STDERR I1210 10:34:06.945520 368641472 process.cpp:1239] libprocess is initialized on 10.0.49.4:58618 with 8 worker threads W1210 10:34:06.948437 95457280 process.cpp:838] Failed to recv on socket 8 to peer 'unknown': Connection reset by peer W1210 10:34:06.948755 95457280 process.cpp:1423] Failed to recv on socket 11 to peer 'unknown': Connection reset by peer [ OK ] ProcessRemoteLinkTest.RemoteDoubleLinkRelink (21 ms) [ RUN ] ProcessRemoteLinkTest.RemoteLinkLeak [warn] kq_init: detected broken kqueue; not using.: No such process WARNING: Logging before InitGoogleLogging() is written to STDERR I1210 10:34:06.966291 379131328 process.cpp:1239] libprocess is initialized on 10.0.49.4:58623 with 8 worker threads W1210 10:34:07.055934 300283328 process.cpp:1587] Failed to link to '10.0.49.4:58624', create socket: Failed to create socket: Too many open files W1210 10:34:07.096643 95457280 process.cpp:838] Failed to recv on socket 8 to peer 'unknown': Connection reset by peer [ OK ] ProcessRemoteLinkTest.RemoteLinkLeak (148 ms) [ RUN ] ProcessRemoteLinkTest.RemoteUseStaleLink [warn] kq_init: detected broken kqueue; not using.: No such process WARNING: Logging before InitGoogleLogging() is written to STDERR I1210 10:34:07.114372 219854272 process.cpp:1239] libprocess is initialized on 10.0.49.4:58626 with 8 worker threads W1210 10:34:07.117367 95457280 process.cpp:838] Failed to recv on socket 8 to peer 'unknown': Connection reset by peer [ OK ] ProcessRemoteLinkTest.RemoteUseStaleLink (20 ms) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7217) CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716997#comment-16716997 ] Benno Evers commented on MESOS-7217: Same again on Centos 7 - I'm starting to see a pattern ;) {noformat} Expected: (0.30) >= (cpuTime), actual: 0.3 vs 0.3 {noformat} > CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky. > > > Key: MESOS-7217 > URL: https://issues.apache.org/jira/browse/MESOS-7217 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.1, 1.8.0 > Environment: ubuntu-14.04, centos-7 >Reporter: Till Toenshoff >Priority: Major > Labels: containerizer, flaky, flaky-test, mesosphere, test > > The test CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs appears to be flaky > on Ubuntu 14.04. > When failing, the test shows the following: > {noformat} > 14:05:48 [ RUN ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs > 14:05:48 I0306 14:05:48.704794 27340 cluster.cpp:158] Creating default > 'local' authorizer > 14:05:48 I0306 14:05:48.716588 27340 leveldb.cpp:174] Opened db in > 11.681905ms > 14:05:48 I0306 14:05:48.718921 27340 leveldb.cpp:181] Compacted db in > 2.309404ms > 14:05:48 I0306 14:05:48.718945 27340 leveldb.cpp:196] Created db iterator in > 3075ns > 14:05:48 I0306 14:05:48.718951 27340 leveldb.cpp:202] Seeked to beginning of > db in 558ns > 14:05:48 I0306 14:05:48.718955 27340 leveldb.cpp:271] Iterated through 0 > keys in the db in 257ns > 14:05:48 I0306 14:05:48.718966 27340 replica.cpp:776] Replica recovered with > log positions 0 -> 0 with 1 holes and 0 unlearned > 14:05:48 I0306 14:05:48.719113 27361 recover.cpp:451] Starting replica > recovery > 14:05:48 I0306 14:05:48.719172 27361 recover.cpp:477] Replica is in EMPTY > status > 14:05:48 I0306 14:05:48.719460 27361 replica.cpp:673] Replica in EMPTY > status received a broadcasted recover request from > __req_res__(6807)@10.179.217.143:53643 > 14:05:48 I0306 14:05:48.719537 27363 recover.cpp:197] Received a recover > response from a replica in EMPTY status > 14:05:48 I0306 14:05:48.719625 27365 recover.cpp:568] Updating replica > status to STARTING > 14:05:48 I0306 14:05:48.720384 27361 master.cpp:380] Master > cb9586dc-a080-41eb-b5b8-88274f84a20a (ip-10-179-217-143.ec2.internal) started > on 10.179.217.143:53643 > 14:05:48 I0306 14:05:48.720404 27361 master.cpp:382] Flags at startup: > --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/tzyTvK/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="100secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/tzyTvK/master" --zk_session_timeout="10secs" > 14:05:48 I0306 14:05:48.720553 27361 master.cpp:432] Master only allowing > authenticated frameworks to register > 14:05:48 I0306 14:05:48.720559 27361 master.cpp:446] Master only allowing > authenticated agents to register > 14:05:48 I0306 14:05:48.720562 27361 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > 14:05:48 I0306 14:05:48.720566 27361 credentials.hpp:37] Loading credentials > for authentication from '/tmp/tzyTvK/credentials' > 14:05:48 I0306 14:05:48.720655 27361 master.cpp:504] Using default 'crammd5' > authenticator > 14:05:48 I0306 14:05:48.720700 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-readonly' > 14:05:48 I0306 14:05:48.720767 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-readwrite' > 14:05:48 I0306 14:05:48.720808 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-scheduler' > 14:05:48 I0306 14:05:48.720875 27361 master.cpp:584] Authorization enabled > 14:05:48 I0306 14:05:48.720995 27360 whitelist_watcher.cpp:77] No whitelist > given
[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716992#comment-16716992 ] Benno Evers commented on MESOS-8096: Observed the same today in `MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0`: {noformat} [ RUN ] MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0 [...] I1210 18:51:52.317384 2570 default_executor.cpp:1126] Killing task 2506c623-0270-4126-aa0c-8eeda080e50d running in child container a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec with SIGTERM signal I1210 18:51:52.317389 2570 default_executor.cpp:1137] Scheduling escalation to SIGKILL in 3secs from now I1210 18:51:52.317608 2570 default_executor.cpp:1126] Killing task 40e69403-db71-4902-af53-746d445a7489 running in child container a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2 with SIGTERM signal I1210 18:51:52.317620 2570 default_executor.cpp:1137] Scheduling escalation to SIGKILL in 3secs from now I1210 18:51:52.318428 15462 process.cpp:3588] Handling HTTP event for process 'slave(1107)' with path: '/slave(1107)/api/v1' I1210 18:51:52.318593 15461 process.cpp:3588] Handling HTTP event for process 'slave(1107)' with path: '/slave(1107)/api/v1' *** Aborted at 1544467912 (unix time) try "date -d @1544467912" if you are using GNU date *** I1210 18:51:52.319488 15461 http.cpp:1157] HTTP POST for /slave(1107)/api/v1 from 172.16.10.38:60672 I1210 18:51:52.319586 15461 http.cpp:1157] HTTP POST for /slave(1107)/api/v1 from 172.16.10.38:60673 I1210 18:51:52.319697 15461 http.cpp:2797] Processing KILL_NESTED_CONTAINER call for container 'a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec' I1210 18:51:52.319808 15461 http.cpp:2797] Processing KILL_NESTED_CONTAINER call for container 'a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2' I1210 18:51:52.319927 15461 containerizer.cpp:2839] Sending Terminated to container a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec in RUNNING state I1210 18:51:52.320010 15460 containerizer.cpp:2839] Sending Terminated to container a1b3cf45-7361-484f-8095-4ae69dd5e777.c544a951-c629-492a-bc09-b1a6c72740e2 in RUNNING state PC: @ 0x7fd51d72d013 mesos::v1::scheduler::Mesos::send() *** SIGSEGV (@0x0) received by PID 23718 (TID 0x7fd50f38b700) from PID 0; stack trace: *** @ 0x7fd4e614aabc (unknown) @ 0x7fd4e614f751 (unknown) @ 0x7fd4e6142f58 (unknown) @ 0x7fd51a3ae890 (unknown) @ 0x7fd51d72d013 mesos::v1::scheduler::Mesos::send() @ 0x558cee3c1808 _ZNK5mesos8internal5tests2v19scheduler23SendAcknowledgeActionP2INS_2v111FrameworkIDENS5_7AgentIDEE10gmock_ImplIFvPNS5_9scheduler5MesosERKNSA_12Event_UpdateEEE17gmock_PerformImplISC_SF_N7testing8internal12ExcessiveArgESL_SL_SL_SL_SL_SL_SL_EEvRKSt5tupleIJSC_SF_EET_T0_T1_T2_T3_T4_T5_T6_T7_T8_ @ 0x558cee3c1990 _ZN5mesos8internal5tests2v19scheduler23SendAcknowledgeActionP2INS_2v111FrameworkIDENS5_7AgentIDEE10gmock_ImplIFvPNS5_9scheduler5MesosERKNSA_12Event_UpdateEEE7PerformERKSt5tupleIJSC_SF_EE @ 0x558cee2c430f _ZN7testing8internal12DoBothActionI17PromiseArgActionPILi1EPN7process7PromiseIN5mesos2v19scheduler12Event_UpdateNS5_8internal5tests2v19scheduler23SendAcknowledgeActionP2INS6_11FrameworkIDENS6_7AgentID4ImplIFvPNS7_5MesosERKS8_EE7PerformERKSt5tupleIJSN_SP_EE @ 0x558cee2e9f57 testing::internal::FunctionMockerBase<>::UntypedPerformAction() @ 0x558cef7b184f testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() @ 0x558cee3d075d mesos::internal::tests::scheduler::MockHTTPScheduler<>::events() @ 0x558cee34cda0 std::_Function_handler<>::_M_invoke() @ 0x7fd51d731098 process::AsyncExecutorProcess::execute<>() @ 0x7fd51d74061b _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeISA_SaISA_ESE_SK_RSE_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSQ_FSN_T1_T2_EOT3_OT4_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteIS14_EEOSI_OSE_PNS1_11ProcessBaseEE_IS17_SI_SE_S1B_EEEDTclcl7forwardISN_Efp_Espcl7forwardIT0_Efp0_EEEOSN_DpOS1D_ @ 0x7fd51e5205d1 process::ProcessBase::consume() @ 0x7fd51e537543 process::ProcessManager::resume() @ 0x7fd51e53d116 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7fd51ab89970 (unknown) @ 0x7fd51a3a7064 start_thread @ 0x7fd51a0dc62d (unknown) E1210 18:51:52.501421 2574 default_executor.cpp:801] Connection for waiting on child container a1b3cf45-7361-484f-8095-4ae69dd5e777.17e50b81-46a7-4225-9c33-a0bf024618ec of task '2506c623-0270-4126-aa0c-8eeda080e50d' interrupted: Disconnected {noformat} > Enqueueing events in MockH
[jira] [Created] (MESOS-9453) Libprocess does not handle "identity" encoding rules
Benno Evers created MESOS-9453: -- Summary: Libprocess does not handle "identity" encoding rules Key: MESOS-9453 URL: https://issues.apache.org/jira/browse/MESOS-9453 Project: Mesos Issue Type: Bug Reporter: Benno Evers [RFC 7231|https://tools.ietf.org/html/rfc7231#section-5.3.4], as well as the relevant [libprocess comment|https://github.com/apache/mesos/blob/dad74012fa02a7fbf61b09968d9b7e9c730b1c97/3rdparty/libprocess/src/http.cpp#L315-L325] mention special handling of the "identity" encoding. However, this is currently ignored in mesos, which can lead to incorrect behaviour in combination with MESOS-9451: {noformat} $ nc localhost 5050 GET /tasks HTTP/1.1 Accept-Encoding: gzip, identity;q=0 HTTP/1.1 200 OK Date: Wed, 05 Dec 2018 11:02:24 GMT Content-Type: application/json Content-Length: 12 {"tasks":[]} {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9451) Libprocess endpoints can ignore required gzip compression
[ https://issues.apache.org/jira/browse/MESOS-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709943#comment-16709943 ] Benno Evers commented on MESOS-9451: Good point, I've opened MESOS-9453 for our lack of handling of the "identity" encoding. > Libprocess endpoints can ignore required gzip compression > - > > Key: MESOS-9451 > URL: https://issues.apache.org/jira/browse/MESOS-9451 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently, libprocess decides whether a response should be compressed by the > following conditional: > {noformat} > if (response.type == http::Response::BODY && > response.body.length() >= GZIP_MINIMUM_BODY_LENGTH && > !headers.contains("Content-Encoding") && > request.acceptsEncoding("gzip")) { > [...] > {noformat} > However, this implies that a request sent with the header "Accept-Encoding: > gzip" can not rely on actually getting a gzipped response, e.g. when the > response size is below the threshold: > {noformat} > $ nc localhost 5050 > GET /tasks HTTP/1.1 > Accept-Encoding: gzip > HTTP/1.1 200 OK > Date: Tue, 04 Dec 2018 12:49:56 GMT > Content-Type: application/json > Content-Length: 12 > {"tasks":[]} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9451) Libprocess endpoints can ignore required gzip compression
Benno Evers created MESOS-9451: -- Summary: Libprocess endpoints can ignore required gzip compression Key: MESOS-9451 URL: https://issues.apache.org/jira/browse/MESOS-9451 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, libprocess decides whether a response should be compressed by the following conditional: {noformat} if (response.type == http::Response::BODY && response.body.length() >= GZIP_MINIMUM_BODY_LENGTH && !headers.contains("Content-Encoding") && request.acceptsEncoding("gzip")) { [...] {noformat} However, this implies that a request sent with the header "Accept-Encoding: gzip" can not rely on actually getting a gzipped response, e.g. when the response size is below the threshold: {noformat} $ nc localhost 5050 GET /tasks HTTP/1.1 Accept-Encoding: gzip HTTP/1.1 200 OK Date: Tue, 04 Dec 2018 12:49:56 GMT Content-Type: application/json Content-Length: 12 {"tasks":[]} {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8045) Update Mesos executables output if there is a typo
[ https://issues.apache.org/jira/browse/MESOS-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-8045: -- Resolution: Fixed Assignee: Benno Evers This is resolved by MESOS-8728, now we only print the full help string when the "--help" option is specified. > Update Mesos executables output if there is a typo > -- > > Key: MESOS-8045 > URL: https://issues.apache.org/jira/browse/MESOS-8045 > Project: Mesos > Issue Type: Improvement >Reporter: Armand Grillet >Assignee: Benno Evers >Priority: Minor > > Current output if a user makes a typo while using one of the Mesos > executables: > {code} > build (master) $ ./bin/mesos-master.sh --ip=127.0.0.1 --workdir=/tmp > Failed to load unknown flag 'workdir' > Usage: mesos-master [options] > --acls=VALUE >The value could be a JSON-formatted string of ACLs > >or a file path containing the JSON-formatted ACLs used > >for authorization. Path could be of the form `file:///path/to/file` > >or `/path/to/file`. > >Note that if the flag `--authorizers` is provided with a value > >different than `local`, the ACLs contents > >will be ignored. > >See the ACLs protobuf in acls.proto for the expected format. > >Example: > >{ > > "register_frameworks": [ > >{ > > "principals": { "type": "ANY" }, > > "roles": { "values": ["a"] } > >} > > ], > > "run_tasks": [ > >{ > > "principals": { "values": ["a", "b"] }, > > "users": { "values": ["c"] } > >} > > ], > > "teardown_frameworks": [ > >{ > > "principals": { "values": ["a", "b"] }, > > "framework_principals": { "values": ["c"] } > >} > > ], > > "set_quotas": [ > >{ > > "principals": { "values": ["a"] }, > > "roles": { "values": ["a", "b"] } > >} > > ], > > "remove_quotas": [ > >{ >
[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming
[ https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707413#comment-16707413 ] Benno Evers commented on MESOS-9022: Confirmed, this is caused by the same underlying problem as MESOS-9000 and should be solved by https://reviews.apache.org/r/67575/ . > Race condition in task updates could cause missing event in streaming > - > > Key: MESOS-9022 > URL: https://issues.apache.org/jira/browse/MESOS-9022 > Project: Mesos > Issue Type: Bug > Components: HTTP API, master >Affects Versions: 1.6.0 >Reporter: Evelyn Liu >Assignee: Benno Evers >Priority: Blocker > Labels: events, foundations, mesos, mesosphere, race-condition, > streaming > > Master sends update event of {{TASK_STARTING}} when task's latest state is > already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, > {{sendSubscribersUpdate}} is set to {{false}} because of > [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805]. > The subscriber would not receive update event of {{TASK_FAILED}}. > This happened when a task failed very fast. Is there a race condition while > handling task updates? > {{*master log:*}} > {code:java} > I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, > status update state: TASK_STARTING) > I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_STARTING) > I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED > (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update > TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_FAILED) > I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for > status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky
[ https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696221#comment-16696221 ] Benno Evers commented on MESOS-9272: https://reviews.apache.org/r/69436 > SlaveTest.DefaultExecutorCommandInfo is flaky > - > > Key: MESOS-9272 > URL: https://issues.apache.org/jira/browse/MESOS-9272 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run (4499): > {noformat} > ../../src/tests/cluster.cpp:697 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 } > {noformat} > Full log: > {noformat} > [ RUN ] SlaveTest.DefaultExecutorCommandInfo > I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' > authorizer > I0927 01:48:44.247200 11037 master.cpp:413] Master > 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started > on 172.16.10.254:33398 > I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs" > I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing > authenticated frameworks to register > I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing > authenticated agents to register > I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for > authentication from '/tmp/7SQ2cR/credentials' > I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' > authenticator > I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled > I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given > I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master! > I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar > I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar > I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the > registry (0B) in 168960ns > I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in > 6362ns; attempting to update the registry > I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the > registry in 94208ns > I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered > registrar > I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of > hierarchical allocator: nothing to recover > I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the > registry (176B); allowing 10mins for agents to reregister > W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already > running process files@172.16.10.254:33398 > I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default
[jira] [Assigned] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky
[ https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9272: -- Assignee: Benno Evers > SlaveTest.DefaultExecutorCommandInfo is flaky > - > > Key: MESOS-9272 > URL: https://issues.apache.org/jira/browse/MESOS-9272 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run (4499): > {noformat} > ../../src/tests/cluster.cpp:697 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 } > {noformat} > Full log: > {noformat} > [ RUN ] SlaveTest.DefaultExecutorCommandInfo > I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' > authorizer > I0927 01:48:44.247200 11037 master.cpp:413] Master > 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started > on 172.16.10.254:33398 > I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs" > I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing > authenticated frameworks to register > I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing > authenticated agents to register > I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for > authentication from '/tmp/7SQ2cR/credentials' > I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' > authenticator > I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled > I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given > I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master! > I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar > I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar > I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the > registry (0B) in 168960ns > I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in > 6362ns; attempting to update the registry > I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the > registry in 94208ns > I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered > registrar > I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of > hierarchical allocator: nothing to recover > I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the > registry (176B); allowing 10mins for agents to reregister > W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already > running process files@172.16.10.254:33398 > I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default 'local' > authorizer > I0927 01:48:44.251428 11035 slave.c
[jira] [Commented] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky
[ https://issues.apache.org/jira/browse/MESOS-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696219#comment-16696219 ] Benno Evers commented on MESOS-9272: Caused by: https://issues.apache.org/jira/browse/MESOS-9413 > SlaveTest.DefaultExecutorCommandInfo is flaky > - > > Key: MESOS-9272 > URL: https://issues.apache.org/jira/browse/MESOS-9272 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run (4499): > {noformat} > ../../src/tests/cluster.cpp:697 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 } > {noformat} > Full log: > {noformat} > [ RUN ] SlaveTest.DefaultExecutorCommandInfo > I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' > authorizer > I0927 01:48:44.247200 11037 master.cpp:413] Master > 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started > on 172.16.10.254:33398 > I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs" > I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing > authenticated frameworks to register > I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing > authenticated agents to register > I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for > authentication from '/tmp/7SQ2cR/credentials' > I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' > authenticator > I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled > I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given > I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master! > I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar > I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar > I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the > registry (0B) in 168960ns > I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in > 6362ns; attempting to update the registry > I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the > registry in 94208ns > I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered > registrar > I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of > hierarchical allocator: nothing to recover > I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the > registry (176B); allowing 10mins for agents to reregister > W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already > running process files@172.16.10.254:33398 > I0927 01:48:44.251050 11015 cluster.
[jira] [Created] (MESOS-9413) Composing containerizer has no way to wait for container removal
Benno Evers created MESOS-9413: -- Summary: Composing containerizer has no way to wait for container removal Key: MESOS-9413 URL: https://issues.apache.org/jira/browse/MESOS-9413 Project: Mesos Issue Type: Bug Reporter: Benno Evers Inside the composing containerizer, destruction is ultimately implemented like this: {noformat} return container->containerizer->destroy(containerId) .onAny(defer(self(), [=](const Future>&) { if (containers_.contains(containerId)) { delete containers_.at(containerId); containers_.erase(containerId); } })); {noformat} This means that code trying to ensure that every container is killed like this {noformat} foreach (const ContainerID& containerId, containers.get()) { process::Future> termination = containerizer->destroy(containerId); AWAIT(termination); } ASSERT_TRUE(containerizer->empty()); {noformat} is inherently racy, because the call to `empty()` might happen before the removal that gets deferred in the `.onAny()`-callback is executed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9400) Allow application to learn the number of libprocess worker threads.
Benno Evers created MESOS-9400: -- Summary: Allow application to learn the number of libprocess worker threads. Key: MESOS-9400 URL: https://issues.apache.org/jira/browse/MESOS-9400 Project: Mesos Issue Type: Improvement Reporter: Benno Evers The number of worker threads used by libprocess is usually dependen on the number of CPU cores on the machine, but can be overwritten using the environment variable `LIBPROCESS_NUM_WORKER_THREADS`. However, as far as I could tell there is currently no way for applications using libprocess to learn the current number of worker threads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9391) Parallel test runner can exhaust system resource in combination with libtool wrappers
Benno Evers created MESOS-9391: -- Summary: Parallel test runner can exhaust system resource in combination with libtool wrappers Key: MESOS-9391 URL: https://issues.apache.org/jira/browse/MESOS-9391 Project: Mesos Issue Type: Bug Reporter: Benno Evers Using the default autotools build currently enables both the parallel test runner (--enable-parallel-test-execution) and the use of libtool wrapper scripts (--enable-libtool-wrappers). These have an unfortunate interaction where the wrapper scripts will actually call the linker on first invocation, and the parallel test runner will run with the `nrproc` parallel tests, leading to this many parallel invocations of the linker for a huge link, which can completely exhaust available resources on the host machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9390) Warnings in AdaptedOperation prevent clang build
Benno Evers created MESOS-9390: -- Summary: Warnings in AdaptedOperation prevent clang build Key: MESOS-9390 URL: https://issues.apache.org/jira/browse/MESOS-9390 Project: Mesos Issue Type: Bug Environment: Fedora 28 Reporter: Benno Evers Trying to build the latest mesos master using clang-8 as a compiler, the following warnings can be observed: {noformat} ../../src/resource_provider/registrar.cpp:387:5: error: explicitly defaulted move constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted] AdaptedOperation(AdaptedOperation&&) = default; ^ ../../src/resource_provider/registrar.cpp:374:28: note: move constructor of 'AdaptedOperation' is implicitly deleted because base class 'master::RegistryOperation' has a deleted move constructor class AdaptedOperation : public master::RegistryOperation ^ ../../src/master/registrar.hpp:45:27: note: copy constructor of 'RegistryOperation' is implicitly deleted because base class 'process::Promise' has an inaccessible copy constructor class RegistryOperation : public process::Promise ^ ../../src/resource_provider/registrar.cpp:389:23: error: explicitly defaulted move assignment operator is implicitly deleted [-Werror,-Wdefaulted-function-deleted] AdaptedOperation& operator=(AdaptedOperation&&) = default; ^ ../../src/resource_provider/registrar.cpp:374:28: note: move assignment operator of 'AdaptedOperation' is implicitly deleted because base class 'master::RegistryOperation' has a deleted move assignment operator class AdaptedOperation : public master::RegistryOperation ^ ../../src/master/registrar.hpp:45:27: note: copy assignment operator of 'RegistryOperation' is implicitly deleted because base class 'process::Promise' has an inaccessible copy assignment operator class RegistryOperation : public process::Promise ^ 2 errors generated. {noformat} I tried looking into this, but I can't make sense of the warnings, the required move constructor and move assignment operator seem to be correctly declared in `Promise`: {noformat} // 3rdparty/libprocess/include/process/future.hpp template class Promise { public: Promise(); virtual ~Promise(); explicit Promise(const T& t); Promise(Promise&& that) = default; Promise& operator=(Promise&&) = default; [...] }; {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9389) Cannot build python support using clang 8
Benno Evers created MESOS-9389: -- Summary: Cannot build python support using clang 8 Key: MESOS-9389 URL: https://issues.apache.org/jira/browse/MESOS-9389 Project: Mesos Issue Type: Bug Environment: Fedora 28 w/ autotools build and clang Reporter: Benno Evers Trying to compile latest mesos master with python support enabled on a Fedora 28 machine leads to the following configuration error: {noformat} $ ../configure CC=clang CXX=clang++ [...] checking whether we can build usable Python eggs... clang-8: error: unknown argument: '-fstack-clash-protection' clang-8: error: unknown argument: '-fstack-clash-protection' error: command 'clang' failed with exit status 1 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9224) De-duplicate read-only requests to master based on principal.
[ https://issues.apache.org/jira/browse/MESOS-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686836#comment-16686836 ] Benno Evers commented on MESOS-9224: After discussions with Alex and Greg, we failed to identify a way to deteministically trigger the batching functionality without either introducing some inherent test flakiness or major modifications of both mesos and testing code. The main problems I was running into: - A correctly working cache should, ideally, be undetectable from the outside, so there's the question of how to verify that the test code actually was hitting the cache. We thought about introducing new endpoints dynamically that just count how often they've been accessed, but it seems not currently possible to introduce new routes or replace existing ones at runtime. Additionally, this has the problem that the dynamically introduced routes would not be cached. - The routines used to implemenet the de-duplication are currently all private. We can introduce public getters and setters or just directly open up master internals for use in tests, but that seems like a code smell. It's also hard to use `protected` here, because instantiating a new master instance is a messy process requiring lots of support code, all of which would need to be duplicated to use a sub-class of the mesos master. - Ideally, we should use the actual http pipeline used by mesos in our unit tests, including libprocess authentication and routing, so even if we could somehow directly access the mesos-master http internals its questionable if we should do it. I'm currently working on an alternate, slightly probabilistic kind of test that tries to launch many requests at once and verifies that they still returning the correct answers. > De-duplicate read-only requests to master based on principal. > - > > Key: MESOS-9224 > URL: https://issues.apache.org/jira/browse/MESOS-9224 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: performance > > "Identical" read-only requests can be batched and answered together. With > batching available (MESOS-9158), we can now deduplicate requests based on > principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9224) De-duplicate read-only requests to master based on principal.
[ https://issues.apache.org/jira/browse/MESOS-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655743#comment-16655743 ] Benno Evers commented on MESOS-9224: A review chain with the required changes can be found at https://reviews.apache.org/r/68131 The one thing that is still missing is a set of unit tests, which is appended to the chain as a wip-commit but proves to be unexpectly hard due to the fuzzy interface between http handler and master, and the caching being deeply buried in the master internals. > De-duplicate read-only requests to master based on principal. > - > > Key: MESOS-9224 > URL: https://issues.apache.org/jira/browse/MESOS-9224 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: performance > > "Identical" read-only requests can be batched and answered together. With > batching available (MESOS-9158), we can now deduplicate requests based on > principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error
Benno Evers created MESOS-9329: -- Summary: CMake build on Fedora 28 fails due to libevent error Key: MESOS-9329 URL: https://issues.apache.org/jira/browse/MESOS-9329 Project: Mesos Issue Type: Bug Reporter: Benno Evers Trying to build Mesos using cmake with the options {noformat} cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1 {noformat} fails due to the following: {noformat} [ 1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c: In function ‘bio_bufferevent_new’: /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3: error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’} b->init = 0; ^~ /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c: At top level: /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1: error: variable ‘methods_bufferevent’ has initializer but incomplete type static BIO_METHOD methods_bufferevent = { [...] {noformat} Since the autotools build does not have issues when enabling libevent and ssl, it seems most likely that the `libevent-2.1.5-beta` version used by default in the cmake build is somehow connected to the error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9328) Mock slave in mesos tests does not compile using gcc 8
Benno Evers created MESOS-9328: -- Summary: Mock slave in mesos tests does not compile using gcc 8 Key: MESOS-9328 URL: https://issues.apache.org/jira/browse/MESOS-9328 Project: Mesos Issue Type: Bug Reporter: Benno Evers Attempting to compile the mesos tests on a Fedora 28 machine using gcc 8 results in the following error: {noformat} ../../3rdparty/libprocess/include/process/future.hpp: In instantiation of ‘process::Future::Future(const U&) [with U = const testing::MatcherInterface&>&>*; T = Nothing]’: /usr/include/c++/8/type_traits:920:12: required from ‘struct std::is_constructible&, const testing::MatcherInterface&>&>*&>’ /usr/include/c++/8/type_traits:126:12: required from ‘struct std::__and_&, const testing::MatcherInterface&>&>*&> >’ /usr/include/c++/8/tuple:485:68: required from ‘static constexpr bool std::_TC<, _Elements>::_MoveConstructibleTuple() [with _UElements = {const testing::MatcherInterface&>&>*&}; bool = true; _Elements = {const process::Future&}]’ /usr/include/c++/8/tuple:641:59: required by substitution of ‘template&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_MoveConstructibleTuple<_UElements ...>() && std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) && (1 >= 1)), bool>::type > constexpr std::tuple&>::tuple(_UElements&& ...) [with _UElements = {const testing::MatcherInterface&>&>*&}; typename std::enable_if<((std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_MoveConstructibleTuple<_UElements ...>() && std::_TC<((1 == sizeof... (_UElements)) && std::_TC<(sizeof... (_UElements) == 1), const process::Future&>::_NotSameTuple<_UElements ...>()), const process::Future&>::_ImplicitlyMoveConvertibleTuple<_UElements ...>()) && (1 >= 1)), bool>::type = 1]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-matchers.h:485:10: required from ‘testing::Matcher testing::MakeMatcher(const testing::MatcherInterface*) [with T = const std::tuple&>&]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-matchers.h:3732:43: required from ‘testing::Matcher testing::A() [with T = const std::tuple&>&]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:893:47: required from ‘testing::internal::TypedExpectation::TypedExpectation(testing::internal::FunctionMockerBase*, const char*, int, const string&, const ArgumentMatcherTuple&) [with F = void(const process::Future&); testing::internal::string = std::__cxx11::basic_string; testing::internal::TypedExpectation::ArgumentMatcherTuple = std::tuple&> >]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: required from ‘testing::internal::TypedExpectation& testing::internal::FunctionMockerBase::AddNewExpectation(const char*, int, const string&, const ArgumentMatcherTuple&) [with F = void(const process::Future&); testing::internal::string = std::__cxx11::basic_string; testing::internal::FunctionMockerBase::ArgumentMatcherTuple = std::tuple&> >]’ ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1273:43: required from ‘testing::internal::TypedExpectation& testing::internal::MockSpec::InternalExpectedAt(const char*, int, const char*, const char*) [with F = void(const process::Future&)]’ ../../src/tests/mock_slave.cpp:141:3: required from here ../../3rdparty/libprocess/include/process/future.hpp:1092:3: error: no matching function for call to ‘process::Future::set(const testing::MatcherInterface&>&>* const&)’ set(u); ^~~ ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: candidate: ‘bool process::Future::set(const T&) [with T = Nothing]’ bool Future::set(const T& t) ^ ../../3rdparty/libprocess/include/process/future.hpp:1761:6: note: no known conversion for argument 1 from ‘const testing::MatcherInterface&>&>* const’ to ‘const Nothing&’ ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: candidate: ‘bool process::Future::set(T&&) [with T = Nothing]’ bool Future::set(T&& t) ^ ../../3rdparty/libprocess/include/process/future.hpp:1754:6: note: no known conversion for argument 1 from ‘const testing::MatcherInterface&>&>* const’ to ‘Nothing&&’ make[1]: *** [Makefile:10735: tests/mesos_tests-mock_slave.o] Error 1 {noformat} The offending line looks like this: {noformat} // mock_slave.cpp:141 EXPECT_CALL(*this, __recover(_)) .WillRepeatedly(Invoke(this, &MockSlave::unmocked___recover)); {noformat} >From a first glance, it looks like it is caused by additional compile-tim
[jira] [Commented] (MESOS-9323) Relocation errros against symbol id::UUID::random()
[ https://issues.apache.org/jira/browse/MESOS-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653343#comment-16653343 ] Benno Evers commented on MESOS-9323: After further investigation, this was caused by mixing `g++` as default compiiler with `lld` as default linker. I'm able to reproduce the unique symbol and the DTPOFF32 relocation using this example program: {noformat} $ cat thread_local.cpp #include class C { public: static void* foo() { static thread_local void* generator = nullptr; return generator; } }; void* cfoo() { return C::foo(); } $ g++ thread-local.cpp -c -O2 -fPIC {noformat} But this in itself doesn't seem to be enough to trigger the error, so I still don't know the actual root cause of this problem. > Relocation errros against symbol id::UUID::random() > --- > > Key: MESOS-9323 > URL: https://issues.apache.org/jira/browse/MESOS-9323 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > Trying to build Mesos on a Fedora 28 machine using a combination of gcc 8.1 > and lld 8-trunk results in the following error: > {noformat} > ld: error: can't create dynamic relocation R_X86_64_DTPOFF32 against symbol: > id::UUID::random()::generator in readonly segment; recompile object files > with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output > >>> defined in > >>> ./.libs/libmesos_no_3rdparty.a(libmesos_no_3rdparty_la-checker_process.o) > >>> referenced by uuid.hpp:43 (../../3rdparty/stout/include/stout/uuid.hpp:43) > >>> > >>> lt15-libmesos_no_3rdparty_la-manager.o:(mesos::internal::ResourceProviderManagerProcess::newResourceProviderId()) > >>> in archive ./.libs/libmesos_no_3rdparty.a > ld: error: too many errors emitted, stopping now (use -error-limit=0 to see > all errors) > {noformat} > Both the linker and compiler flags already included `-fPIC`, so this part of > the error message seems bogus. > I'm not sure if this an issue of the compiler generating invalid object files > or the linker misunderstanding the created artifacts. However, the symbol > `id::UUID::random()::generator` is a very special case because it is a > function-local static in an inline function, causing gcc to generate a > special `GNU_UNIQUE` symbol, and also a thread-local variable leading to the > DTPOFF32 relocation. > It seems like this combination of uncommon things is somehow tripping up one > of the involved tools. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9302) Mesos fails to build on Fedora 28
[ https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651692#comment-16651692 ] Benno Evers commented on MESOS-9302: Opened https://reviews.apache.org/r/69043/ to fix the issue by passing `-Wno-error` to the `cares` build. > Mesos fails to build on Fedora 28 > - > > Key: MESOS-9302 > URL: https://issues.apache.org/jira/browse/MESOS-9302 > Project: Mesos > Issue Type: Bug > Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Fedora 28 >Reporter: Benno Evers >Priority: Major > Labels: build-failure > > Trying to compile a fresh Mesos checkout on a Fedora 28 system with the > following configuration flags: > {noformat} > ../configure --enable-debug --enable-optimize --disable-java --disable-python > --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror > {noformat} > and the following compiler > {noformat} > [bev...@core1.hw.ca1 build]$ gcc --version > gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Copyright (C) 2018 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > {noformat} > fails the build due to two warnings (even though --disable-werror was passed): > {noformat} > make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0' > [C] Compiling third_party/cares/cares/ares_init.c > third_party/cares/cares/ares_init.c: In function ‘ares_dup’: > third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in > ‘strncpy’ call is the same expression as the source; did you mean to use the > size of the destination? [-Werror=sizeof-pointer-memaccess] >sizeof(src->local_dev_name)); > ^ > third_party/cares/cares/ares_init.c: At top level: > cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ > [-Werror] > cc1: all warnings being treated as errors > make[4]: *** [Makefile:2635: > /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o] > Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9323) Relocation errros against symbol id::UUID::random()
Benno Evers created MESOS-9323: -- Summary: Relocation errros against symbol id::UUID::random() Key: MESOS-9323 URL: https://issues.apache.org/jira/browse/MESOS-9323 Project: Mesos Issue Type: Bug Reporter: Benno Evers Trying to build Mesos on a Fedora 28 machine using a combination of gcc 8.1 and lld 8-trunk results in the following error: {noformat} ld: error: can't create dynamic relocation R_X86_64_DTPOFF32 against symbol: id::UUID::random()::generator in readonly segment; recompile object files with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output >>> defined in >>> ./.libs/libmesos_no_3rdparty.a(libmesos_no_3rdparty_la-checker_process.o) >>> referenced by uuid.hpp:43 (../../3rdparty/stout/include/stout/uuid.hpp:43) >>> >>> lt15-libmesos_no_3rdparty_la-manager.o:(mesos::internal::ResourceProviderManagerProcess::newResourceProviderId()) >>> in archive ./.libs/libmesos_no_3rdparty.a ld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors) {noformat} Both the linker and compiler flags already included `-fPIC`, so this part of the error message seems bogus. I'm not sure if this an issue of the compiler generating invalid object files or the linker misunderstanding the created artifacts. However, the symbol `id::UUID::random()::generator` is a very special case because it is a function-local static in an inline function, causing gcc to generate a special `GNU_UNIQUE` symbol, and also a thread-local variable leading to the DTPOFF32 relocation. It seems like this combination of uncommon things is somehow tripping up one of the involved tools. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9302) Mesos fails to build on Fedora 28
Benno Evers created MESOS-9302: -- Summary: Mesos fails to build on Fedora 28 Key: MESOS-9302 URL: https://issues.apache.org/jira/browse/MESOS-9302 Project: Mesos Issue Type: Bug Reporter: Benno Evers Trying to compile a fresh Mesos checkout on a Fedora 28 system with the following configuration flags: {noformat} ../configure --enable-debug --enable-optimize --disable-java --disable-python --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror {noformat} and the following compiler {noformat} [bev...@core1.hw.ca1 build]$ gcc --version gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. {noformat} fails the build due to two warnings (even though --disable-werror was passed): {noformat} make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0' [C] Compiling third_party/cares/cares/ares_init.c third_party/cares/cares/ares_init.c: In function ‘ares_dup’: third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in ‘strncpy’ call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess] sizeof(src->local_dev_name)); ^ third_party/cares/cares/ares_init.c: At top level: cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ [-Werror] cc1: all warnings being treated as errors make[4]: *** [Makefile:2635: /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o] Error 1 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9292) Rejected quotas should include a reason in their error message
Benno Evers created MESOS-9292: -- Summary: Rejected quotas should include a reason in their error message Key: MESOS-9292 URL: https://issues.apache.org/jira/browse/MESOS-9292 Project: Mesos Issue Type: Improvement Reporter: Benno Evers If we reject a quota request due to not having enough available resources, we fail with the following error: {noformat} Not enough available cluster capacity to reasonably satisfy quota request; the force flag can be used to override this check {noformat} but we don't print *which* resource was not available. This can be confusing to operators when the quota was attempted to be set for multiple resources at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9286) ZooKeeperTest.LeaderContender is flaky
Benno Evers created MESOS-9286: -- Summary: ZooKeeperTest.LeaderContender is flaky Key: MESOS-9286 URL: https://issues.apache.org/jira/browse/MESOS-9286 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run in a Mac environment. {noformat} ../../src/tests/zookeeper_tests.cpp:307 Failed to wait 15secs for lostCandidacy {noformat} Sadly, the full build log was lost before it could be investigated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9285) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume is flaky
Benno Evers created MESOS-9285: -- Summary: DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume is flaky Key: MESOS-9285 URL: https://issues.apache.org/jira/browse/MESOS-9285 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run (4432) in a Debian 8 environment: {noformat} ../../src/tests/containerizer/docker_volume_isolator_tests.cpp:947 Failed to wait 15secs for statusStarting {noformat} Sadly, the full log seems to have been lost before it could be investigated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7217) CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635541#comment-16635541 ] Benno Evers commented on MESOS-7217: Observed again today in run 4432 in a CentOS 7 environment. > CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs is flaky. > > > Key: MESOS-7217 > URL: https://issues.apache.org/jira/browse/MESOS-7217 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.1 > Environment: ubuntu-14.04 >Reporter: Till Toenshoff >Priority: Major > Labels: flaky, flaky-test, mesosphere, test > > The test CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs appears to be flaky > on Ubuntu 14.04. > When failing, the test shows the following: > {noformat} > 14:05:48 [ RUN ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs > 14:05:48 I0306 14:05:48.704794 27340 cluster.cpp:158] Creating default > 'local' authorizer > 14:05:48 I0306 14:05:48.716588 27340 leveldb.cpp:174] Opened db in > 11.681905ms > 14:05:48 I0306 14:05:48.718921 27340 leveldb.cpp:181] Compacted db in > 2.309404ms > 14:05:48 I0306 14:05:48.718945 27340 leveldb.cpp:196] Created db iterator in > 3075ns > 14:05:48 I0306 14:05:48.718951 27340 leveldb.cpp:202] Seeked to beginning of > db in 558ns > 14:05:48 I0306 14:05:48.718955 27340 leveldb.cpp:271] Iterated through 0 > keys in the db in 257ns > 14:05:48 I0306 14:05:48.718966 27340 replica.cpp:776] Replica recovered with > log positions 0 -> 0 with 1 holes and 0 unlearned > 14:05:48 I0306 14:05:48.719113 27361 recover.cpp:451] Starting replica > recovery > 14:05:48 I0306 14:05:48.719172 27361 recover.cpp:477] Replica is in EMPTY > status > 14:05:48 I0306 14:05:48.719460 27361 replica.cpp:673] Replica in EMPTY > status received a broadcasted recover request from > __req_res__(6807)@10.179.217.143:53643 > 14:05:48 I0306 14:05:48.719537 27363 recover.cpp:197] Received a recover > response from a replica in EMPTY status > 14:05:48 I0306 14:05:48.719625 27365 recover.cpp:568] Updating replica > status to STARTING > 14:05:48 I0306 14:05:48.720384 27361 master.cpp:380] Master > cb9586dc-a080-41eb-b5b8-88274f84a20a (ip-10-179-217-143.ec2.internal) started > on 10.179.217.143:53643 > 14:05:48 I0306 14:05:48.720404 27361 master.cpp:382] Flags at startup: > --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/tzyTvK/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="100secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/tzyTvK/master" --zk_session_timeout="10secs" > 14:05:48 I0306 14:05:48.720553 27361 master.cpp:432] Master only allowing > authenticated frameworks to register > 14:05:48 I0306 14:05:48.720559 27361 master.cpp:446] Master only allowing > authenticated agents to register > 14:05:48 I0306 14:05:48.720562 27361 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > 14:05:48 I0306 14:05:48.720566 27361 credentials.hpp:37] Loading credentials > for authentication from '/tmp/tzyTvK/credentials' > 14:05:48 I0306 14:05:48.720655 27361 master.cpp:504] Using default 'crammd5' > authenticator > 14:05:48 I0306 14:05:48.720700 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-readonly' > 14:05:48 I0306 14:05:48.720767 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-readwrite' > 14:05:48 I0306 14:05:48.720808 27361 http.cpp:887] Using default 'basic' > HTTP authenticator for realm 'mesos-master-scheduler' > 14:05:48 I0306 14:05:48.720875 27361 master.cpp:584] Authorization enabled > 14:05:48 I0306 14:05:48.720995 27360 whitelist_watcher.cpp:77] No whitelist > given > 14:05:48 I0306 14:05:48.721005 27364 hierarchical.cpp:149] Initialized > hierarchical allocator pr
[jira] [Created] (MESOS-9280) Allow specification of static reservations relative to the total resources
Benno Evers created MESOS-9280: -- Summary: Allow specification of static reservations relative to the total resources Key: MESOS-9280 URL: https://issues.apache.org/jira/browse/MESOS-9280 Project: Mesos Issue Type: Improvement Reporter: Benno Evers The current user interface for creating static reservations is described here: http://mesos.apache.org/documentation/latest/reservation/ In summary, to create a static reservation, an operator needs to subdivide the available resources on an agent into reserved an unreserved resources, like this: {noformat} mesos-slave --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096" [...] {noformat} However, this can result in some awkward interactions when trying to change static reservations 1) *Requirement of an explicit upper bound*. By default, an agent will offer all CPU's and all Memory of its host machine. However, an agent with the above configuration running on a machine with e.g. 32 cpus will still only offer 12 of them, 8 for `ads` and 4 for general use. For an operator planning to deploy configuration to a diverse set of machines, it seems to be required to write a script to get the total amount of available resources, and ensure that it is re-run periodically to capture hardware changes - duplicating functionality that Mesos already offers out-of-the-box. 2) *Interaction with ranges*. A configuration like {noformat} mesos-slave --resources="ports:[0-32655];ports(__internal):[22-22]" [...] {noformat} will lead to the master still offering port 22 to all frameworks, because the master thinks that the reserved port is an additional item of the "ports" resource. On the other hand, a configuration like {noformat} mesos-slave --resources="ports(__internal):[22-22]" [...] {noformat} leaves the master knowing only about the existence of the single, reserved port 22. Again, for an operator planning to reserve this port across a range of diverse agents, the only way seems to write a script parsing and processing the existing configuration and then slicing up the ranges like this: {noformat} mesos-slave --resources="ports:[0-21],[23-32655];ports(__internal):[22-22]" [...] {noformat} Ideally, it would be possible to specify static reservations as a subtraction from the total, i.e. being able to say "Reserve 4 GiB of memory for role X" instead of saying "Reserve 4GiB for role X and 4GiB for general use". Doing so would probably require introducing some additional syntax to the resource specification strings. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9276) SlaveRecoveryTest/0.Reboot is flaky
Benno Evers created MESOS-9276: -- Summary: SlaveRecoveryTest/0.Reboot is flaky Key: MESOS-9276 URL: https://issues.apache.org/jira/browse/MESOS-9276 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run: (4502) {noformat} ../../src/tests/slave_recovery_tests.cpp:2746: Failure Failed to wait 15secs for executorStatus {noformat} Full log: {noformat} [ RUN ] SlaveRecoveryTest/0.Reboot I0927 12:33:33.620496 2560127808 cluster.cpp:173] Creating default 'local' authorizer I0927 12:33:33.621817 75808768 master.cpp:413] Master b351e786-2364-4c2e-bb10-1efc3c97e509 (Jenkinss-Mac-mini.local) started on 10.0.49.4:65455 I0927 12:33:33.621845 75808768 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/master" --zk_session_timeout="10secs" I0927 12:33:33.622007 75808768 master.cpp:465] Master only allowing authenticated frameworks to register I0927 12:33:33.622015 75808768 master.cpp:471] Master only allowing authenticated agents to register I0927 12:33:33.622020 75808768 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I0927 12:33:33.622026 75808768 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/DW8BvT/credentials' I0927 12:33:33.622184 75808768 master.cpp:521] Using default 'crammd5' authenticator I0927 12:33:33.622243 75808768 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0927 12:33:33.622328 75808768 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 12:33:33.622391 75808768 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 12:33:33.622442 75808768 master.cpp:602] Authorization enabled I0927 12:33:33.622640 74735616 whitelist_watcher.cpp:77] No whitelist given I0927 12:33:33.622643 75272192 hierarchical.cpp:182] Initialized hierarchical allocator process I0927 12:33:33.624191 77418496 master.cpp:2083] Elected as the leading master! I0927 12:33:33.624217 77418496 master.cpp:1638] Recovering from registrar I0927 12:33:33.624264 76881920 registrar.cpp:339] Recovering registrar I0927 12:33:33.624541 76881920 registrar.cpp:383] Successfully fetched the registry (0B) in 255232ns I0927 12:33:33.624619 76881920 registrar.cpp:487] Applied 1 operations in 27286ns; attempting to update the registry I0927 12:33:33.624822 76881920 registrar.cpp:544] Successfully updated the registry in 172032ns I0927 12:33:33.624892 76881920 registrar.cpp:416] Successfully recovered registrar I0927 12:33:33.625068 75272192 master.cpp:1752] Recovered 0 agents from the registry (155B); allowing 10mins for agents to reregister I0927 12:33:33.625089 77955072 hierarchical.cpp:220] Skipping recovery of hierarchical allocator: nothing to recover I0927 12:33:33.626883 2560127808 containerizer.cpp:305] Using isolation { environment_secret, filesystem/posix, posix/mem, posix/cpu } I0927 12:33:33.627074 2560127808 provisioner.cpp:298] Using default backend 'copy' W0927 12:33:33.628770 2560127808 process.cpp:2810] Attempted to spawn already running process files@10.0.49.4:65455 I0927 12:33:33.629148 2560127808 cluster.cpp:485] Creating default 'local' authorizer I0927 12:33:33.630077 75272192 slave.cpp:267] Mesos agent started on (525)@10.0.49.4:65455 I0927 12:33:33.630103 75272192
[jira] [Commented] (MESOS-9079) Test MasterTestPrePostReservationRefinement.LaunchGroup/0 is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631771#comment-16631771 ] Benno Evers commented on MESOS-9079: Observed the same for the `/1` variant: (Run 4504) {noformat} [ RUN ] bool/MasterTestPrePostReservationRefinement.LaunchGroup/1 I0927 16:41:07.341975 2560127808 cluster.cpp:173] Creating default 'local' authorizer I0927 16:41:07.343353 96841728 master.cpp:413] Master d8823df0-8625-4d84-9980-2c64d226d6f8 (Jenkinss-Mac-mini.local) started on 10.0.49.4:56698 I0927 16:41:07.343381 96841728 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/master" --zk_session_timeout="10secs" I0927 16:41:07.343574 96841728 master.cpp:465] Master only allowing authenticated frameworks to register I0927 16:41:07.343582 96841728 master.cpp:471] Master only allowing authenticated agents to register I0927 16:41:07.343588 96841728 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I0927 16:41:07.343603 96841728 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/BKYcbZ/credentials' I0927 16:41:07.343760 96841728 master.cpp:521] Using default 'crammd5' authenticator I0927 16:41:07.343873 96841728 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0927 16:41:07.343940 96841728 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 16:41:07.344009 96841728 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 16:41:07.344054 96841728 master.cpp:602] Authorization enabled I0927 16:41:07.344269 95232000 whitelist_watcher.cpp:77] No whitelist given I0927 16:41:07.344270 93085696 hierarchical.cpp:182] Initialized hierarchical allocator process I0927 16:41:07.345897 95232000 master.cpp:2083] Elected as the leading master! I0927 16:41:07.345923 95232000 master.cpp:1638] Recovering from registrar I0927 16:41:07.345969 93085696 registrar.cpp:339] Recovering registrar I0927 16:41:07.346166 93085696 registrar.cpp:383] Successfully fetched the registry (0B) in 175872ns I0927 16:41:07.346269 93085696 registrar.cpp:487] Applied 1 operations in 23524ns; attempting to update the registry I0927 16:41:07.346478 93085696 registrar.cpp:544] Successfully updated the registry in 183040ns I0927 16:41:07.346536 93085696 registrar.cpp:416] Successfully recovered registrar I0927 16:41:07.346678 93622272 master.cpp:1752] Recovered 0 agents from the registry (155B); allowing 10mins for agents to reregister I0927 16:41:07.346702 94695424 hierarchical.cpp:220] Skipping recovery of hierarchical allocator: nothing to recover W0927 16:41:07.349237 2560127808 process.cpp:2810] Attempted to spawn already running process files@10.0.49.4:56698 I0927 16:41:07.349918 2560127808 containerizer.cpp:305] Using isolation { environment_secret, filesystem/posix, posix/mem, posix/cpu } I0927 16:41:07.350147 2560127808 provisioner.cpp:298] Using default backend 'copy' I0927 16:41:07.351030 2560127808 cluster.cpp:485] Creating default 'local' authorizer I0927 16:41:07.352041 93622272 slave.cpp:267] Mesos agent started on (905)@10.0.49.4:56698 I0927 16:41:07.352071 93622272 slave.cpp:268] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/var/folders/6w/rw03zh013y38ys6cyn8qppf80
[jira] [Created] (MESOS-9273) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume is flaky
Benno Evers created MESOS-9273: -- Summary: DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume is flaky Key: MESOS-9273 URL: https://issues.apache.org/jira/browse/MESOS-9273 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run (4499): {noformat} ../../src/tests/containerizer/docker_volume_isolator_tests.cpp:1361 Failed to wait 15secs for statusStarting {noformat} Full log: {noformat} [ RUN ] DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithReadOnlyVolume I0927 01:52:53.770812 13860 cluster.cpp:173] Creating default 'local' authorizer I0927 01:52:53.771752 3593 master.cpp:413] Master 1c890578-e87d-41a2-bb4c-5ed9b7e0d8ec (ip-172-16-10-139.ec2.internal) started on 172.16.10.139:46305 I0927 01:52:53.771773 3593 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/X4P8mF/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/X4P8mF/master" --zk_session_timeout="10secs" I0927 01:52:53.771903 3593 master.cpp:465] Master only allowing authenticated frameworks to register I0927 01:52:53.771914 3593 master.cpp:471] Master only allowing authenticated agents to register I0927 01:52:53.771920 3593 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I0927 01:52:53.771926 3593 credentials.hpp:37] Loading credentials for authentication from '/tmp/X4P8mF/credentials' I0927 01:52:53.771996 3593 master.cpp:521] Using default 'crammd5' authenticator I0927 01:52:53.772053 3593 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0927 01:52:53.772120 3593 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 01:52:53.772158 3593 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 01:52:53.772189 3593 master.cpp:602] Authorization enabled I0927 01:52:53.772347 3597 hierarchical.cpp:182] Initialized hierarchical allocator process I0927 01:52:53.772367 3594 whitelist_watcher.cpp:77] No whitelist given I0927 01:52:53.773003 3594 master.cpp:2083] Elected as the leading master! I0927 01:52:53.773023 3594 master.cpp:1638] Recovering from registrar I0927 01:52:53.773063 3594 registrar.cpp:339] Recovering registrar I0927 01:52:53.773201 3596 registrar.cpp:383] Successfully fetched the registry (0B) in 117760ns I0927 01:52:53.773241 3596 registrar.cpp:487] Applied 1 operations in 8146ns; attempting to update the registry I0927 01:52:53.773360 3596 registrar.cpp:544] Successfully updated the registry in 102912ns I0927 01:52:53.773396 3596 registrar.cpp:416] Successfully recovered registrar I0927 01:52:53.773474 3596 master.cpp:1752] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0927 01:52:53.773562 3597 hierarchical.cpp:220] Skipping recovery of hierarchical allocator: nothing to recover I0927 01:52:53.774943 13860 isolator.cpp:144] Initialized the docker volume information root directory at '/run/mesos/isolators/docker/volume' I0927 01:52:53.776796 13860 linux_launcher.cpp:144] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher sh: 1: hadoop: not found I0927 01:52:53.859550 13860 fetcher.cpp:66] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512 I0927 01:52:53.859833 13860 registry_puller.cpp:128] Creating registry puller with docker registry 'https://registry-1.docker.io' I0927 01:52:53.860913 138
[jira] [Created] (MESOS-9272) SlaveTest.DefaultExecutorCommandInfo is flaky
Benno Evers created MESOS-9272: -- Summary: SlaveTest.DefaultExecutorCommandInfo is flaky Key: MESOS-9272 URL: https://issues.apache.org/jira/browse/MESOS-9272 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run (4499): {noformat} ../../src/tests/cluster.cpp:697 Value of: containers->empty() Actual: false Expected: true Failed to destroy containers: { 743f1b4c-8ce0-4fd4-b952-a7bbc9788775 } {noformat} Full log: {noformat} [ RUN ] SlaveTest.DefaultExecutorCommandInfo I0927 01:48:44.246218 11015 cluster.cpp:173] Creating default 'local' authorizer I0927 01:48:44.247200 11037 master.cpp:413] Master 56a99d2f-f8c8-4d21-a8f7-df452833cce0 (ip-172-16-10-254.ec2.internal) started on 172.16.10.254:33398 I0927 01:48:44.247223 11037 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/7SQ2cR/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7SQ2cR/master" --zk_session_timeout="10secs" I0927 01:48:44.247354 11037 master.cpp:465] Master only allowing authenticated frameworks to register I0927 01:48:44.247364 11037 master.cpp:471] Master only allowing authenticated agents to register I0927 01:48:44.247370 11037 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I0927 01:48:44.247375 11037 credentials.hpp:37] Loading credentials for authentication from '/tmp/7SQ2cR/credentials' I0927 01:48:44.247453 11037 master.cpp:521] Using default 'crammd5' authenticator I0927 01:48:44.247488 11037 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0927 01:48:44.247519 11037 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 01:48:44.247541 11037 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 01:48:44.247668 11037 master.cpp:602] Authorization enabled I0927 01:48:44.247741 11036 hierarchical.cpp:182] Initialized hierarchical allocator process I0927 01:48:44.247782 11036 whitelist_watcher.cpp:77] No whitelist given I0927 01:48:44.248339 11036 master.cpp:2083] Elected as the leading master! I0927 01:48:44.248358 11036 master.cpp:1638] Recovering from registrar I0927 01:48:44.248430 11036 registrar.cpp:339] Recovering registrar I0927 01:48:44.248623 11037 registrar.cpp:383] Successfully fetched the registry (0B) in 168960ns I0927 01:48:44.248658 11037 registrar.cpp:487] Applied 1 operations in 6362ns; attempting to update the registry I0927 01:48:44.248767 11037 registrar.cpp:544] Successfully updated the registry in 94208ns I0927 01:48:44.248795 11037 registrar.cpp:416] Successfully recovered registrar I0927 01:48:44.248880 11036 hierarchical.cpp:220] Skipping recovery of hierarchical allocator: nothing to recover I0927 01:48:44.248901 11037 master.cpp:1752] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister W0927 01:48:44.250870 11015 process.cpp:2810] Attempted to spawn already running process files@172.16.10.254:33398 I0927 01:48:44.251050 11015 cluster.cpp:485] Creating default 'local' authorizer I0927 01:48:44.251428 11035 slave.cpp:267] Mesos agent started on (662)@172.16.10.254:33398 I0927 01:48:44.251672 11015 scheduler.cpp:189] Version: 1.8.0 I0927 01:48:44.251443 11035 slave.cpp:268] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/SlaveTest_DefaultExecutorCommandInfo_DsiR0M/store/appc" --authenticate_http_executors="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticatee="cra
[jira] [Created] (MESOS-9271) DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP is flaky
Benno Evers created MESOS-9271: -- Summary: DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP is flaky Key: MESOS-9271 URL: https://issues.apache.org/jira/browse/MESOS-9271 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed in an internal CI run (4498): {noformat} ../../src/tests/health_check_tests.cpp:2080 Failed to wait 15secs for statusHealthy {noformat} Full log: {noformat} [ RUN ] NetworkProtocol/DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP/1 I0927 00:57:43.336710 27845 docker.cpp:1659] Running docker -H unix:///var/run/docker.sock inspect zhq527725/https-server:latest I0927 00:57:43.340283 27845 docker.cpp:1659] Running docker -H unix:///var/run/docker.sock inspect alpine:latest I0927 00:57:43.343433 27845 docker.cpp:1659] Running docker -H unix:///var/run/docker.sock inspect alpine:latest I0927 00:57:43.857142 27845 cluster.cpp:173] Creating default 'local' authorizer I0927 00:57:43.858705 19628 master.cpp:413] Master f9e9ac63-826d-4d08-b216-c5f352afc25d (ip-172-16-10-217.ec2.internal) started on 172.16.10.217:32836 I0927 00:57:43.858727 19628 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/QIaitl/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/QIaitl/master" --zk_session_timeout="10secs" I0927 00:57:43.858912 19628 master.cpp:465] Master only allowing authenticated frameworks to register I0927 00:57:43.858942 19628 master.cpp:471] Master only allowing authenticated agents to register I0927 00:57:43.858948 19628 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I0927 00:57:43.858955 19628 credentials.hpp:37] Loading credentials for authentication from '/tmp/QIaitl/credentials' I0927 00:57:43.859072 19628 master.cpp:521] Using default 'crammd5' authenticator I0927 00:57:43.859141 19628 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0927 00:57:43.859200 19628 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 00:57:43.859246 19628 http.cpp:1037] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 00:57:43.859268 19628 master.cpp:602] Authorization enabled I0927 00:57:43.859541 19629 hierarchical.cpp:182] Initialized hierarchical allocator process I0927 00:57:43.859582 19629 whitelist_watcher.cpp:77] No whitelist given I0927 00:57:43.860060 19628 master.cpp:2083] Elected as the leading master! I0927 00:57:43.860078 19628 master.cpp:1638] Recovering from registrar I0927 00:57:43.860117 19628 registrar.cpp:339] Recovering registrar I0927 00:57:43.860285 19628 registrar.cpp:383] Successfully fetched the registry (0B) in 144128ns I0927 00:57:43.860328 19628 registrar.cpp:487] Applied 1 operations in 8246ns; attempting to update the registry I0927 00:57:43.860527 19624 registrar.cpp:544] Successfully updated the registry in 167168ns I0927 00:57:43.860571 19624 registrar.cpp:416] Successfully recovered registrar I0927 00:57:43.860698 19625 master.cpp:1752] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0927 00:57:43.860761 19625 hierarchical.cpp:220] Skipping recovery of hierarchical allocator: nothing to recover W0927 00:57:43.863813 27845 process.cpp:2810] Attempted to spawn already running process files@172.16.10.217:32836 I0927 00:57:43.863989 27845 cluster.cpp:485] Creating default 'local' authorizer I0927 00:57:43.864542 19628 slave.cpp:267] Mesos agent started on (1170)@172