[VOTE] Release Apache Mesos 1.0.2 (rc2)

2016-10-31 Thread Vinod Kone
Hi all,


Please vote on releasing the following candidate as Apache Mesos 1.0.2.


This is a bug fix release.


The CHANGELOG for the release is available at:

https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.0.2-rc2




The candidate for Mesos 1.0.2 release is available at:

https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc2/mesos-1.0.2.tar.gz


The tag to be voted on is 1.0.2-rc2:

https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.0.2-rc2


The MD5 checksum of the tarball can be found at:

https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc2/mesos-1.0.2.tar.gz.md5


The signature of the tarball can be found at:

https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc2/mesos-1.0.2.tar.gz.asc


The PGP key used to sign the release is here:

https://dist.apache.org/repos/dist/release/mesos/KEYS


The JAR is up in Maven in a staging repository here:

https://repository.apache.org/content/repositories/orgapachemesos-1164


Please vote on releasing this package as Apache Mesos 1.0.2!


The vote is open until Thu Nov  3 16:34:20 PDT 2016 and passes if a
majority of at least 3 +1 PMC votes are cast.


[ ] +1 Release this package as Apache Mesos 1.0.2

[ ] -1 Do not release this package because ...


Thanks,


Re: Jenkins Build Labels

2016-10-31 Thread Vinod Kone
I can't remember now what the issue was with ubuntu-us1 and untuntu-6. I
can remove the labels and see if things continue to work ok.

Are all `Hadoop` nodes now tagged with `docker` as well?

On Mon, Oct 31, 2016 at 12:23 PM, Daniel Takamori  wrote:

> Greetings from the Infra Team,
> We've noticed that some subset of your builds use the label:
> "(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6)"; recently we've completed
> puppetizing the build nodes so they should all match.  Are there problems
> with ubuntu-us1 and ubuntu-6 that haven't been addressed?
>
> Cheers,
> -Pono
>


Jenkins Build Labels

2016-10-31 Thread Daniel Takamori
Greetings from the Infra Team,
We've noticed that some subset of your builds use the label: 
"(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6)"; recently we've completed 
puppetizing the build nodes so they should all match.  Are there problems with 
ubuntu-us1 and ubuntu-6 that haven't been addressed?

Cheers,
-Pono


Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose,GLOG_v=1 MESOS_VERBOSE=1,ubuntu:14.04,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6) #2853

2016-10-31 Thread Alex R
>From the log:
I1031 13:57:54.254748 29433 sched.cpp:820] Sending SUBSCRIBE call to
master@172.17.0.2:34385
I1031 13:58:18.482736 29433 sched.cpp:853] Will retry registration in
3.204867476secs if necessary

Looks like this VM experienced a lag.

On 31 October 2016 at 15:04, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)/2853/changes>
>
> Changes:
>
> [alexr] Added Hubert Asamer to contributors list.
>
> --
> [...truncated 215386 lines...]
> I1031 14:02:18.818614 29414 containerizer.cpp:201] Using isolation:
> posix/cpu,posix/mem,filesystem/posix,network/cni
> W1031 14:02:18.819149 29414 backend.cpp:76] Failed to create 'aufs'
> backend: AufsBackend requires root privileges, but is running as user mesos
> W1031 14:02:18.819303 29414 backend.cpp:76] Failed to create 'bind'
> backend: BindBackend requires root privileges
> I1031 14:02:18.821573 29434 slave.cpp:208] Mesos agent started on (636)@
> 172.17.0.2:34385
> I1031 14:02:18.821596 29434 slave.cpp:209] Flags at startup: --acls=""
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc"
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true"
> --authenticatee="crammd5" --authentication_backoff_factor="1secs"
> --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos" 
> --container_disk_watch_interval="15secs"
> --containerizers="mesos" --credential="/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_kOzKJN/credential" --default_role="*"
> --disk_watch_interval="1mins" --docker="docker"
> --docker_kill_orphans="true" --docker_registry="https://
> registry-1.docker.io" --docker_remove_delay="6hrs"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
> --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_
> dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_kOzKJN/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname_lookup="true"
> --http_authenticators="basic" --http_command_executor="false"
> --http_credentials="/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_kOzKJN/http_credentials" 
> --image_provisioner_backend="copy"
> --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem"
> --launcher="posix" --launcher_dir="/mesos/mesos-1.2.0/_build/src"
> --logbufsecs="0" --logging_level="INFO" 
> --max_completed_executors_per_framework="150"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="10ms" --resources="cpus:2;gpus:0;
> mem:1024;disk:1024;ports:[31000-32000]" --revocable_cpu_low_priority="true"
> --runtime_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_kOzKJN"
> --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> --switch_user="true" --systemd_enable_support="true"
> --systemd_runtime_directory="/run/systemd/system" --version="false"
> --work_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_CWNg3F"
> I1031 14:02:18.822060 29434 credentials.hpp:86] Loading credential for
> authentication from '/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_kOzKJN/credential'
> I1031 14:02:18.822191 29434 slave.cpp:346] Agent using credential for:
> test-principal
> I1031 14:02:18.822214 29434 credentials.hpp:37] Loading credentials for
> authentication from '/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_kOzKJN/http_credentials'
> I1031 14:02:18.822412 29434 http.cpp:887] Using default 'basic' HTTP
> authenticator for realm 'mesos-agent-readonly'
> I1031 14:02:18.822532 29434 http.cpp:887] Using default 'basic' HTTP
> authenticator for realm 'mesos-agent-readwrite'
> I1031 14:02:18.823534 29434 slave.cpp:533] Agent resources: cpus(*):2;
> mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I1031 14:02:18.823609 29434 slave.cpp:541] Agent attributes: [  ]
> I1031 14:02:18.823626 29434 slave.cpp:546] Agent hostname: 114ab47ff4d5
> I1031 14:02:18.824906 29446 state.cpp:57] Recovering state from
> '/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_CWNg3F/meta'
> I1031 14:02:18.825286 29441 status_update_manager.cpp:203] Recovering
> status update manager
> I1031 14:02:18.825700 29443 containerizer.cpp:557] Recovering containerizer
> I1031 

Transition TASK_KILLING -> TASK_RUNNING

2016-10-31 Thread Alex Rukletsov
We've recently discovered a bug that may lead to a task being transitioned
from killing to running state. More information about it in MESOS-6457 [1].
We plan to fix it in 1.2.0 and will backport it to all supported versions.

[1] https://issues.apache.org/jira/browse/MESOS-6457


Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose --enable-libevent --enable-ssl,GLOG_v=1 MESOS_VERBOSE=1,centos:7,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6) #2852

2016-10-31 Thread Neil Conway
I spent a little while looking into this. The
"PersistentVolumeEndpointsTest.OfferCreateThenEndpointRemove" test
fails on the following expectations:

https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1562
https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1564
https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1573

Which all seem quite innocent: similar or identical preamble code
occurs in many test cases. Looking at the log, it seems the scheduler
begins the authentication process but authentication times out:

12:27:56.527899 31618 sched.cpp:226] Version: 1.2.0
12:27:56.528548 31638 sched.cpp:330] New master detected at
master@172.17.0.2:48653
12:27:56.528661 31638 sched.cpp:396] Authenticating with master
master@172.17.0.2:48653
12:27:56.528681 31638 sched.cpp:403] Using default CRAM-MD5 authenticatee
12:28:01.529717 31637 sched.cpp:526] Authentication timed out
12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
master master@172.17.0.2:48653: Authentication discarded

In the scheduler driver, we fail the "authenticating" future at
12:28:01, but it is ~9 seconds before the associated `onAny` callback
is invoked to schedule retrying authentication; by the time the retry
backoff timeout expires, we've exceeded the 15 second Future timeout
in the test case.

Note that between 12:28:01.5 and 12:28:10.8, there is essentially
nothing happening:

W1031 12:28:01.529717 31637 sched.cpp:526] Authentication timed out
W1031 12:28:01.529752 31645 master.cpp:6789] Authentication timed out
I1031 12:28:10.794798 31652 status_update_manager.cpp:203] Recovering
status update manager
W1031 12:28:10.795033 31645 master.cpp:6769] Failed to authenticate
scheduler-877be3e9-9dc1-4de1-bf3e-a19b77b3d124@172.17.0.2:48653:
Authentication discarded
I1031 12:28:10.794939 31647 authenticator.cpp:432] Authentication
session cleanup for crammd5-authenticatee(655)@172.17.0.2:48653
I1031 12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
master master@172.17.0.2:48653: Authentication discarded

So I think the most likely culprit is VM lag.

We can try to workaround this by increasing some of the timeouts for
the test expectation futures, but of course that is just a kludge: if
we're going to experience random ~9.5 second VM-wide pauses, the tests
are likely to continue to be flaky unless we make more widespread
changes (e.g., increasing ALL expectation futures timeouts).

Neil


On Mon, Oct 31, 2016 at 8:34 AM, Apache Jenkins Server
 wrote:
> See 
> 
>
> Changes:
>
> [alexr] Updated the stale comment in agent flags.
>
> --
> [...truncated 219320 lines...]
> W1031 12:32:10.921492 31618 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges, but is running as user mesos
> W1031 12:32:10.921664 31618 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1031 12:32:10.925060 31647 slave.cpp:208] Mesos agent started on 
> (635)@172.17.0.2:48653
> I1031 12:32:10.925091 31647 slave.cpp:209] Flags at startup: --acls="" 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" 
> --credential="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_j6HfxC/credential"
>  --default_role="*" --disk_watch_interval="1mins" --docker="docker" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/Endpoint_SlaveEndpointTest_AuthorizedRequest_1_j6HfxC/fetch"
>  --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
> --gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" 
> 

[VOTE] Release Apache Mesos 1.1.0 (rc2)

2016-10-31 Thread Till Toenshoff
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.0.


1.1.0 includes the following:

 * [MESOS-2449] - **Experimental** support for launching a group of tasks
via a new `LAUNCH_GROUP` Offer operation. Mesos will guarantee that either
all tasks or none of the tasks in the group are delivered to the executor.
Executors receive the task group via a new `LAUNCH_GROUP` event.

  * [MESOS-2533] - **Experimental** support for HTTP and HTTPS health checks.
Executors may now use the updated `HealthCheck` protobuf to implement
HTTP(S) health checks. Both default executors (command and docker) leverage
`curl` binary for sending HTTP(S) requests and connect to `127.0.0.1`,
hence a task must listen on all interfaces. On Linux, For BRIDGE and USER
modes, docker executor enters the task's network namespace.

  * [MESOS-3421] - **Experimental** Support sharing of resources across
containers. Currently persistent volumes are the only resources allowed to
be shared.

  * [MESOS-3567] - **Experimental** support for TCP health checks. Executors
may now use the updated `HealthCheck` protobuf to implement TCP health
checks. Both default executors (command and docker) connect to `127.0.0.1`,
hence a task must listen on all interfaces. On Linux, For BRIDGE and USER
modes, docker executor enters the task's network namespace.

  * [MESOS-4324] - Allow access to persistent volumes as read-only or read-write
by tasks. Mesos doesn't allow persistent volumes to be created as read-only
but in 1.1 it starts allow tasks to use the volumes as read-only. This is
mainly motivated by shared persistent volumes but applies to regular
persistent volumes as well.

  * [MESOS-5275] - **Experimental** support for linux capabilities. Frameworks
or operators now have fine-grained control over the capabilities that a
container may have. This allows a container to run as root, but not have all
the privileges associated with the root user (e.g., CAP_SYS_ADMIN).

  * [MESOS-5344] -- **Experimental** support for partition-aware Mesos
frameworks. In previous Mesos releases, when an agent is partitioned from
the master and then reregisters with the cluster, all tasks running on the
agent are terminated and the agent is shutdown. In Mesos 1.1, partitioned
agents will no longer be shutdown when they reregister with the master. By
default, tasks running on such agents will still be killed (for backward
compatibility); however, frameworks can opt-in to the new PARTITION_AWARE
capability. If they do this, their tasks will not be killed when a partition
is healed. This allows frameworks to define their own policies for how to
handle partitioned tasks. Enabling the PARTITION_AWARE capability also
introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED,
TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states are
intended to eventually replace the TASK_LOST state.

  * [MESOS-6077] - **Experimental** A new default executor is introduced which
frameworks can use to launch task groups as nested containers. All the
nested containers share resources likes cpu, memory, network and volumes.

  * [MESOS-6014] - **Experimental** A new port-mapper CNI plugin, the
`mesos-cni-port-mapper` has been introduced. For Mesos containers, with the
CNI port-mapper plugin, users can now expose container ports through host
ports using DNAT. This is especially useful when Mesos containers are
attached to isolated CNI networks such as private bridge networks, and the
services running in the container needs to be exposed outside these
isolated networks.


The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.0-rc2


The candidate for Mesos 1.1.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc2/mesos-1.1.0.tar.gz

The tag to be voted on is 1.1.0-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.0-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc2/mesos-1.1.0.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc2/mesos-1.1.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1162

Please vote on releasing this package as Apache Mesos 1.1.0!

The vote is open until Thu Nov  3 14:46:55 CET 2016 and passes if a majority of 
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 

[GitHub] mesos pull request #174: Update contributors.yaml -> Hubert Asamer

2016-10-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/mesos/pull/174


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] mesos pull request #174: Update contributors.yaml -> Hubert Asamer

2016-10-31 Thread rukletsov
Github user rukletsov commented on a diff in the pull request:

https://github.com/apache/mesos/pull/174#discussion_r85733844
  
--- Diff: docs/contributors.yaml ---
@@ -597,3 +597,12 @@
 - xingz...@cn.ibm.com
   jira_user: dongdong
   reviewboard_user: dongdong
+  
--- End diff --

This line contains trailing whitespaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] mesos pull request #174: Update contributors.yaml -> Hubert Asamer

2016-10-31 Thread rukletsov
Github user rukletsov commented on a diff in the pull request:

https://github.com/apache/mesos/pull/174#discussion_r85733901
  
--- Diff: docs/contributors.yaml ---
@@ -597,3 +597,12 @@
 - xingz...@cn.ibm.com
   jira_user: dongdong
   reviewboard_user: dongdong
+  
+- name: Hubert Asamer
--- End diff --

Entries are sorted alphabetically in this file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Test failures in Apache Jenkins

2016-10-31 Thread Alex Rukletsov
Folks,

I observe a lot of flaky tests in Apache Jenkins. They seem rather random
and not tied to particular machines (saw failures on H1 and on H2).
Moreover, there are no tickets for them and I haven't seen any of those
failures in our internal CI.

Does anyone have an idea about any recent changes in test harness,
libprocess or whatever that could lead to this? It's probably not related
to MESOS-6180 , because
not all failures are future timeout induced.

For example, in the last day I saw these guys failing:
ReconciliationTest.RecoveredAgent [1]
MasterTest.TaskLabels [2]
RoleTest.ImplicitRoleRegister [3]
ReconciliationTest.ImplicitTerminalTask [4]
ReservationTest.BadACLDropReserve [5]
ReservationTest.CompatibleCheckpointedResources [6]
ContentType/SchedulerHttpApiTest.Subscribe/0 [7]

[1] https://goo.gl/cs88BD
[2] https://goo.gl/gTzKUV
[3] https://goo.gl/7pGaQG
[4] https://goo.gl/ccq38D
[5] https://goo.gl/0R1eOO
[6] https://goo.gl/xKQzUt
[7] https://goo.gl/HZmiGJ