Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-30 Thread Gastón Kleiman
On Tue, May 30, 2017 at 3:43 PM, Neil Conway  wrote:

> On Tue, May 30, 2017 at 2:36 PM, Vinod Kone  wrote:
> > Ran on ASF CI.
> >
> > Found following issues.
> >
> > Failed test: CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
> >  Release/36/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos:7,label_exp=(docker%7C%7CHadoop)
> &&(!ubuntu-us1)&&(!ubuntu-eu2)/console>
>
> Attached is the test log for this failure. From a quick look, seems as
> though the agent starts to launch the task, including forking the
> child process, but no subsequent task status updates or error messages
> are observed. Gaston, have you seen this before?
>
> I filed https://issues.apache.org/jira/browse/MESOS-7589 to track this.
>

Nope, I haven't seen this before.

It looks like the executor wasn't launched or something went wrong at a
very early stage of the executor's launch.

This is what the logs from a successful run look like:

I0530 20:19:52.679759 11282 launcher.cpp:140] Forked child with pid '11304'
>> for container '703bded9-43de-4950-986f-0eaec7bbe664'
>
> I0530 20:19:52.692080 11268 fetcher.cpp:324] Starting to fetch URIs for
>> container: 703bded9-43de-4950-986f-0eaec7bbe664, directory:
>> /tmp/CommandExecutorCheckTest_CommandCheckDeliveredAndReconciled_fqE44y/slaves/7c74b451-3e64-4c95-8570-ad3b5b60962b-S0/frameworks/7c74b451-3e64-4c95-8570-ad3b5b60962b-/executors/55b387b7-c1f1-47d5-9f19-2d676cd4ef2e/runs/703bded9-43de-4950-986f-0eaec7bbe664
>
> I0530 20:19:53.517201 11268 hierarchical.cpp:2095] Filtered offer with
>> cpus(*):1.9; mem(*):992; disk(*):992; ports(*):[31000-32000] on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0 for role * of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-
>
> I0530 20:19:53.517283 11268 hierarchical.cpp:1861] No allocations performed
>
> I0530 20:19:53.517318 11268 hierarchical.cpp:1951] No inverse offers to
>> send out!
>
> I0530 20:19:53.517367 11268 hierarchical.cpp:1445] Performed allocation
>> for 1 agents in 654478ns
>
> I0530 20:19:54.518506 11277 hierarchical.cpp:2095] Filtered offer with
>> cpus(*):1.9; mem(*):992; disk(*):992; ports(*):[31000-32000] on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0 for role * of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-
>
> I0530 20:19:54.518584 11277 hierarchical.cpp:1861] No allocations performed
>
> I0530 20:19:54.518616 11277 hierarchical.cpp:1951] No inverse offers to
>> send out!
>
> I0530 20:19:54.518662 11277 hierarchical.cpp:1445] Performed allocation
>> for 1 agents in 568919ns
>
> I0530 20:19:55.520282 11276 hierarchical.cpp:2095] Filtered offer with
>> cpus(*):1.9; mem(*):992; disk(*):992; ports(*):[31000-32000] on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0 for role * of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-
>
> I0530 20:19:55.520364 11276 hierarchical.cpp:1861] No allocations performed
>
> I0530 20:19:55.520409 11276 hierarchical.cpp:1951] No inverse offers to
>> send out!
>
> I0530 20:19:55.520463 11276 hierarchical.cpp:1445] Performed allocation
>> for 1 agents in 621097ns
>
> I0530 20:19:56.521884 11271 hierarchical.cpp:2095] Filtered offer with
>> cpus(*):1.9; mem(*):992; disk(*):992; ports(*):[31000-32000] on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0 for role * of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-
>
> I0530 20:19:56.521966 11271 hierarchical.cpp:1861] No allocations performed
>
> I0530 20:19:56.522001 11271 hierarchical.cpp:1951] No inverse offers to
>> send out!
>
> I0530 20:19:56.522047 11271 hierarchical.cpp:1445] Performed allocation
>> for 1 agents in 597601ns
>
> I0530 20:19:57.523320 11267 hierarchical.cpp:2095] Filtered offer with
>> cpus(*):1.9; mem(*):992; disk(*):992; ports(*):[31000-32000] on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0 for role * of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-
>
> I0530 20:19:57.523398 11267 hierarchical.cpp:1861] No allocations performed
>
> I0530 20:19:57.523432 11267 hierarchical.cpp:1951] No inverse offers to
>> send out!
>
> I0530 20:19:57.523478 11267 hierarchical.cpp:1445] Performed allocation
>> for 1 agents in 573802ns
>
> I0530 20:19:57.641808 11358 exec.cpp:162] Version: 1.4.0
>
> I0530 20:19:57.648437 11268 slave.cpp:3809] Got registration for executor
>> '55b387b7-c1f1-47d5-9f19-2d676cd4ef2e' of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b- from executor(1)@
>> 192.99.40.208:43797
>
> I0530 20:19:57.650589 11279 slave.cpp:2526] Sending queued task
>> '55b387b7-c1f1-47d5-9f19-2d676cd4ef2e' to executor
>> '55b387b7-c1f1-47d5-9f19-2d676cd4ef2e' of framework
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b- at executor(1)@
>> 192.99.40.208:43797
>
> I0530 20:19:57.650804 11356 exec.cpp:237] Executor registered on agent
>> 7c74b451-3e64-4c95-8570-ad3b5b60962b-S0
>
> Received SUBSCRIBED event
>
> Subscribed execu

Coverity Scan: Analysis completed for Mesos

2017-05-30 Thread scan-admin

Your request for analysis of Mesos has been completed successfully.
The results are available at 
https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRZ-2B0hUmbDL5L44V5w491gwG_yCAaqzzx-2F-2BA2mRMpk03t3x9hscHw355FKzcsrEtTtpHLGIeZ-2BhgsTUQpK5WT8ysqC2k4gqhmIi6A55gPfdTRhHbFVyR-2FsrpzzOLE2-2F5qRN2zOtq7rfCuvPkUZX18M1w8eyccylq6jWfopDD6kViXh07VFhkSP1nSvnlC8907hZaOoB4gZuH-2B80ciG6ZUohm1EFavkD9ksJLBRRjHaya-2FQbJG-2Fj-2FL80OGtUFJ6Ta9AvA-3D

Analysis Summary:
   New defects found: 15
   Defects eliminated: 0

If you have difficulty understanding any defects, email us at 
scan-ad...@coverity.com,
or post your question to StackOverflow
at 
https://u2389337.ct.sendgrid.net/wf/click?upn=OgIsEqWzmIl4S-2FzEUMxLXL-2BukuZt9UUdRZhgmgzAKchwAzH1nH3073xDEXNRgHN6q227lMNIWoOb8ZgSjAjKcg-3D-3D_yCAaqzzx-2F-2BA2mRMpk03t3x9hscHw355FKzcsrEtTtpHLGIeZ-2BhgsTUQpK5WT8ysqTMiri8nPSG1wc4RzzkAgsVo2QuNKxOn4hukikuAGKXHR9i3SNbqDlgeRjTTjI9Dg04oZmptxBZNGrZDDnROsGFRZB8oc3KXrJy-2BJjrbm2Ejb04NrKor9AkPZoOF0oLeOoHo4ff92-2BbfmNbk-2FPQUg2sQId36dgsfJYDH1YoyBFZk-3D


Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-30 Thread Neil Conway
On Tue, May 30, 2017 at 2:36 PM, Vinod Kone  wrote:
> Ran on ASF CI.
>
> Found following issues.
>
> Failed test: CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
> 

Attached is the test log for this failure. From a quick look, seems as
though the agent starts to launch the task, including forking the
child process, but no subsequent task status updates or error messages
are observed. Gaston, have you seen this before?

I filed https://issues.apache.org/jira/browse/MESOS-7589 to track this.

> Failed test: OneWayPartitionTest.MasterToSlave
> 

Looking into this now.

Neil
[ RUN  ] CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
I0525 16:55:27.473031  2250 cluster.cpp:162] Creating default 'local' authorizer
I0525 16:55:27.475637  2280 master.cpp:436] Master 
b97c30ee-bdab-4879-ba55-5d32f822c038 (305d67e5598a) started on 172.17.0.2:40622
I0525 16:55:27.475666  2280 master.cpp:438] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/UqWhgS/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.3.0/_inst/share/mesos/webui" 
--work_dir="/tmp/UqWhgS/master" --zk_session_timeout="10secs"
I0525 16:55:27.475993  2280 master.cpp:488] Master only allowing authenticated 
frameworks to register
I0525 16:55:27.476006  2280 master.cpp:502] Master only allowing authenticated 
agents to register
I0525 16:55:27.476013  2280 master.cpp:515] Master only allowing authenticated 
HTTP frameworks to register
I0525 16:55:27.476022  2280 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/UqWhgS/credentials'
I0525 16:55:27.476297  2280 master.cpp:560] Using default 'crammd5' 
authenticator
I0525 16:55:27.476441  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0525 16:55:27.476702  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0525 16:55:27.476845  2280 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0525 16:55:27.476972  2280 master.cpp:640] Authorization enabled
I0525 16:55:27.477180  2272 whitelist_watcher.cpp:77] No whitelist given
I0525 16:55:27.477191  2270 hierarchical.cpp:158] Initialized hierarchical 
allocator process
I0525 16:55:27.480708  2284 master.cpp:2161] Elected as the leading master!
I0525 16:55:27.480739  2284 master.cpp:1700] Recovering from registrar
I0525 16:55:27.480864  2285 registrar.cpp:345] Recovering registrar
I0525 16:55:27.481547  2285 registrar.cpp:389] Successfully fetched the 
registry (0B) in 645888ns
I0525 16:55:27.481663  2285 registrar.cpp:493] Applied 1 operations in 22833ns; 
attempting to update the registry
I0525 16:55:27.482271  2285 registrar.cpp:550] Successfully updated the 
registry in 556032ns
I0525 16:55:27.482409  2285 registrar.cpp:422] Successfully recovered registrar
I0525 16:55:27.482946  2277 hierarchical.cpp:185] Skipping recovery of 
hierarchical allocator: nothing to recover
I0525 16:55:27.482945  2286 master.cpp:1799] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0525 16:55:27.487871  2250 containerizer.cpp:221] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
W0525 16:55:27.488440  2250 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W0525 16:55:27.488562  225

Re: Isolating metrics collection from master/agent slowness

2017-05-30 Thread Zhitao Li
Hi Benjamin,

Thanks for the response. I agree that we should get all the 4 options start
as early as we can.

The reason I'm pushing for the 5) option is that we probably won't get
everything fixed within 1-2 minor release cycles, and I need a way to get
around the problem badly. Also, with a sampled metric routine, we have a
chance to quantify how long metrics collection cycle itself takes, so we
have a good starting point to quantify the entire problem we are dealing
with.

If you don't mind, I'll merge the conversation from the other thread into
this one, file issues and try to find time to work on them.

On Fri, May 26, 2017 at 7:42 PM, Benjamin Mahler  wrote:

> This is great, I would love to see the metrics collection be low latency.
>
> Here is the response I gave last time, with the 4 options I think we should
> tackle:
> https://www.mail-archive.com/dev@mesos.apache.org/msg37113.html
>
> IMHO we should defer looking into the 5th option of sampling/caching
> internally until we've thoroughly attempted to avoid tripping through the
> Process queue, because if we succeed it will become unnecessary.
>
> On Mon, May 22, 2017 at 1:26 PM, Zhitao Li  wrote:
>
> > Thanks for the feedback, James.
> >
> > Replying to your points inline:
> >
> > On Mon, May 22, 2017 at 10:56 AM, James Peach  wrote:
> >
> > >
> > > > On May 19, 2017, at 11:35 AM, Zhitao Li 
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I'd like to start a conversation to talk about metrics collection
> > > endpoints
> > > > (especially `/metrics/snapshot`) behavior.
> > > >
> > > > Right now, these endpoints are served from the same master/agent's
> > > > libprocess, and extensively uses `gauge` to chain further callbacks
> to
> > > > collect various metrics (DRF allocator specifically adds several
> > metrics
> > > > per role).
> > > >
> > > > This brings a problem when the system is under load: when the
> > > > master/allocator libprocess becomes busy, stats collection itself
> > becomes
> > > > slow too. Flying dark when the system is under load is specifically
> > > painful
> > > > for an operator.
> > >
> > > Yes, sampling metrics should approach zero cost.
> > >
> > > > I would like to explore the direction of isolating metric collection
> > even
> > > > when the master is slow. A couple of ideas:
> > > >
> > > > - (short term) reduce usage of gauge and prefer counter (since I
> > believe
> > > > they are less affected);
> > >
> > > I'd rather not squash the semantics for performance reasons. If a
> metric
> > > has gauge semantics, I don't think we should represent that as a
> Counter.
> > >
> >
> > I recall that I had a previous conversation with @bmahler and he thought
> > that certain gauges could be expresses as the differential of two
> counters.
> >
> > We definitely cannot express a gauge as a Counter, because gauge value
> can
> > decrease while counter should always be treated as monotonically
> increasing
> > until process restart.
> >
> >
> > >
> > > > - alternative implementation of `gauge` which does not contend on
> > > > master/allocator's event queue;
> > >
> > > This is doable in some circumstances, but not always. For example,
> > > Master::_uptime_secs() doesn't need to run on the master queue, but
> > > Master::_outstanding_offers arguably does. The latter could be
> > implemented
> > > by sampling an variable that is updated, but that's not very generic,
> so
> > we
> > > should try to think of something better.
> > >
> >
> > I agree that this will not be trivial cut. The fact that this is not
> > something trivially achievable is the primary I want to start this
> > conversation with the general dev community. We can absorb some work to
> > optimize on certain hot paths (I suspect roles specific ones in allocator
> > being one of them for us), but maintaining this in the long term will
> > definitely requires all contributors to help.
> >
> > w.r.t. examples, it seems that Master::_outstanding_offers simply calls a
> > hashmap::size() on a hashmap object, so if the underlying container type
> > conforms to C++11's thread safe requirement (
> > http://en.cppreference.com/w/cpp/container#Thread_safety), we should at
> > least be able to call the size() function with the understanding that we
> > might get slightly stale value?
> >
> > I think a more interesting example is Master::_task_starting(): not only
> > that is is not calculated from a simple const method,but also that the
> > result is actually generated by iterating on all tasks registered to the
> > master. This means the cost of calculating this is linear to number of
> > tasks in the cluster.
> >
> >
> >
> > >
> > > > - serving metrics collection from a different libprocess routine.
> > >
> > > See MetricsProcess. One (mitigation?) approach would be to sample the
> > > metrics at a fixed rate and then serve the cached samples from the
> > > MetricsProcess. I expect most installations have multiple clients
> > sampling
> > > the metrics, so this would at l

Re: [VOTE] Release Apache Mesos 1.3.0 (rc3)

2017-05-30 Thread Vinod Kone
Ran on ASF CI.

Found following issues.

Failed test: CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled



Failed test: OneWayPartitionTest.MasterToSlave


Can you confirm if these are known or new issues?

Thanks,

On Thu, May 25, 2017 at 2:20 AM, Michael Park  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.3.0.
>
>
> 1.3.0 includes the following:
> 
> 
>   - Multi-role framework support
>   - Executor authentication support
>   - Allow frameworks to modify their roles.
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
> lain;f=CHANGELOG;hb=1.3.0-rc3
> 
> 
>
> The candidate for Mesos 1.3.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc3/mesos-1.3.0.tar.gz
>
> The tag to be voted on is 1.3.0-rc3:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.3.0-rc3
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc3/mesos
> -1.3.0.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.3.0-rc3/mesos
> -1.3.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1198
>
> Please vote on releasing this package as Apache Mesos 1.3.0!
>
> The vote is open until Tue May 30 11:59:59 PDT 2017 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.3.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> MPark & Neil
>


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Neil Conway
On Tue, May 30, 2017 at 12:58 PM, Michael Park  wrote:
> I'm all for moving to GCC 4.9+.
>
> I'd love to get C++14 and bump to GCC 5, but I think we should do an
> investigation for "reasonable availability" before we do that.

I agree, although I'd think a similar investigation is required to
move to GCC 4.9+.

To rephrase my previous point, I'd expect that most platforms fall
into two groups: either the system compiler is ancient, in which case
something like devtoolset is required anyway, or the system compiler
is relatively modern, in which case there is no difference between
depending on GCC >= 4.9 vs. GCC >= 5.

Neil


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Svante Karlsson
C++14 would be nice, btw in addition to the list above - Ubuntu 16.04 has
gcc 5.4.0


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Michael Park
I'm all for moving to GCC 4.9+.

I'd love to get C++14 and bump to GCC 5, but I think we should do an
investigation
for "reasonable availability" before we do that.

Also, clang has supported C++14 /  since 3.5, which is our current
requirement.

On Tue, May 30, 2017 at 12:50 PM, Benjamin Mahler 
wrote:

> There's a spreadsheet linked in
> https://issues.apache.org/jira/browse/MESOS-2604 that captures which OSes
> we can support based on compiler version:
>
> https://docs.google.com/spreadsheets/d/1Ji8p3p_
> 1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit#gid=0
>
> Also, do we have already have a minimum clang version for C++14 / ?
>
> On Tue, May 30, 2017 at 11:27 AM, Neil Conway 
> wrote:
>
> > It seems that if we moved to GCC 5, we'd also be able to move to C++14
> > (https://gcc.gnu.org/projects/cxx-status.html#cxx14).
> >
> > CentOS 6 users will need to install devtoolset anyway (which makes it
> > easy to get GCC 5 or 6), so I wonder if skipping directly to requiring
> > GCC 5 would be feasible?
> >
> > Neil
> >
> >
> > On Tue, May 30, 2017 at 11:17 AM, Jacob Janco 
> > wrote:
> > > Along with various additions and optimizations, support for 
> > would be nice to have. Thoughts on this?
> >
>


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Benjamin Mahler
There's a spreadsheet linked in
https://issues.apache.org/jira/browse/MESOS-2604 that captures which OSes
we can support based on compiler version:

https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit#gid=0

Also, do we have already have a minimum clang version for C++14 / ?

On Tue, May 30, 2017 at 11:27 AM, Neil Conway  wrote:

> It seems that if we moved to GCC 5, we'd also be able to move to C++14
> (https://gcc.gnu.org/projects/cxx-status.html#cxx14).
>
> CentOS 6 users will need to install devtoolset anyway (which makes it
> easy to get GCC 5 or 6), so I wonder if skipping directly to requiring
> GCC 5 would be feasible?
>
> Neil
>
>
> On Tue, May 30, 2017 at 11:17 AM, Jacob Janco 
> wrote:
> > Along with various additions and optimizations, support for 
> would be nice to have. Thoughts on this?
>


Re: Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Neil Conway
It seems that if we moved to GCC 5, we'd also be able to move to C++14
(https://gcc.gnu.org/projects/cxx-status.html#cxx14).

CentOS 6 users will need to install devtoolset anyway (which makes it
easy to get GCC 5 or 6), so I wonder if skipping directly to requiring
GCC 5 would be feasible?

Neil


On Tue, May 30, 2017 at 11:17 AM, Jacob Janco  wrote:
> Along with various additions and optimizations, support for  would be 
> nice to have. Thoughts on this?


Moving Mesos builds reqs from GCC 4.8.1+ to GCC 4.9.0+

2017-05-30 Thread Jacob Janco
Along with various additions and optimizations, support for  would be 
nice to have. Thoughts on this?

Mesos on Windows needs your help

2017-05-30 Thread Li Li
With the joint effort from Mesosphere and Microsoft, the windows build 
performance *should* be about equal with Posix/Linux now, ~76% tests are 
enabled on the ported windows components, and Mesos container/docker container 
tasks are launched successfully e2e.
We will start helping Mesos windows customers deploy their windows agent nodes 
in their test environments, and then productize these features as our next 
goals. To be able to do that, we need a stable development environment.
Recently, there have been multiple regressions on Windows from build issues to 
functionality issues. We have been chasing down these regressions, fixing them 
and trying to push Mesos on Windows features forward. However, we all know the 
situation cannot be sustained well with the high frequency of the regressions.
To solve the issues, we've enabled two engineering system features for Mesos on 
Windows to prevent regressions before and after each checkin,

  1.  Windows reviewbot has been enabled to verify all of the tests on windows 
for each PR. For details, please refer to https://reviews.apache.org/r/59116/.

[cid:image001.png@01D2D933.853FB4F0]

  1.  Windows build process has been added to CI system. The build status is 
posted to #windows channel by the CI bot after committing a PR,
[cid:image002.png@01D2D933.853FB4F0]
The build regressions are generally caught manually (i.e. git pull && cmake 
--build .) or when the CI bot posts a failure in the #windows channel. For now, 
these build regressions don't get sent to the 
bui...@mesos.apache.org mailing list due to the 
flakiness we're seeing in the builds@ mailing list.
For developers, if you do not have access to a Windows box, you have two 
options:
1. use the Windows Reviewbot.  This runs in a loop (slightly different than the 
Ubuntu Reviewbot) but both reviewbots function the same way.  Just push an 
update to the last review in a chain, and the reviewbot will get around to it 
eventually.
2. Spin up a Windows box in Azure, AWS or some other cloud with Windows Server 
2016 + Docker + all the dependencies from 
https://github.com/apache/mesos/blob/master/docs/windows.md.
We highly recommend everyone to analyze the Windows Reviewbot before your 
checkins and monitor Windows build status after your checkins.
The above engineering system effort is just a starting point to prevent the 
regressions. We also need help from our Mesos dev community - when you checkin 
a fix, think about if there are some potential regressions on the windows side 
and verify your fix on Windows as well; when you design a feature, feel free to 
involve us in to your discussions and see how these features should be designed 
for windows, etc.
Only with your help, we can deliver Mesos to our Linux customers, and Windows 
customers successfully.