Note about ".proto" files from Mesos 1.3.0+

2017-02-16 Thread Joseph Wu
Hi devs/contributors,

The next time you checkout HEAD and open a .proto file, you may notice this
line at the top of the file (after the Apache license, of course):

syntax = "proto2";

This has been added to all our protobufs in order to allow different
versions of the protobuf compiler to process our protobufs.  This change *does
not *change anything about the generated code, or the wire format, or
anything else.  The new line purely addresses a warning printed by protoc.

If you need to add any new protobufs, make sure you add the "syntax = ..."
in future.

See this and related issues for some more details:
https://issues.apache.org/jira/browse/MESOS-6138

~Joseph


task STAGING for keep long time

2017-02-16 Thread wayne
*centos7.3.1611*

*kernel-3.10.0-514.6.1.el7.x86_64*
*mesos1.2.0*
*Docker version 1.13.1, build 092cba3*

*taskid*
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app

*master log*
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0216
19:18:15.687566 3430683 master.cpp:3352] Authorizing framework principal
'test' to launch task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0216
19:18:15.691611 3430686 master.cpp:9063] Adding task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app with
resources cpus(*)(allocated: test):1; mem(*)(allocated: test):1024 on agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0216
19:18:15.691784 3430686 master.cpp:4426] Launching task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- (appos-scheduler) with
resources cpus(*)(allocated: test):1; mem(*)(allocated: test):1024 on agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0217
02:40:07.723784 3430697 master.cpp:6164] Status update TASK_FAILED (UUID:
75f91aff-a307-4e46-a396-23dfc3877722) for task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- from agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0217
02:40:07.723908 3430697 master.cpp:6232] Forwarding status update
TASK_FAILED (UUID: 75f91aff-a307-4e46-a396-23dfc3877722) for task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad-
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0217
02:40:07.724366 3430697 master.cpp:8312] Updating the state of task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- (latest state:
TASK_FAILED, status update state: TASK_FAILED)
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0217
02:40:07.727289 3430695 master.cpp:5102] Processing ACKNOWLEDGE call
75f91aff-a307-4e46-a396-23dfc3877722 for task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- (appos-scheduler) on
agent 5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4
mesos-master.adca-mesos-1.vm.invalid-user.log.INFO.20170215-162534.3430663:I0217
02:40:07.727365 3430695 master.cpp:8406] Removing task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app with
resources cpus(*)(allocated: test):1; mem(*)(allocated: test):1024 of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- on agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.INFO:I0216 19:18:15.687566 3430683 master.cpp:3352]
Authorizing framework principal 'test' to launch task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app
mesos-master.INFO:I0216 19:18:15.691611 3430686 master.cpp:9063] Adding
task alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app with
resources cpus(*)(allocated: test):1; mem(*)(allocated: test):1024 on agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.INFO:I0216 *19:18:15.691784* 3430686 master.cpp:4426]
Launching task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- (appos-scheduler) with
resources cpus(*)(allocated: test):1; mem(*)(allocated: test):1024 on agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.INFO:I0217 02:40:07.723784 3430697 master.cpp:6164] Status
update TASK_FAILED (UUID: 75f91aff-a307-4e46-a396-23dfc3877722) for task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- from agent
5404b500-c733-4ad8-aaf9-c20247bbe4a2-S4 at slave(1)@10.101.90.84:5051 (
adca-mesos-6.vm.elenet.me)
mesos-master.INFO:I0217 02:40:07.723908 3430697 master.cpp:6232] Forwarding
status update TASK_FAILED (UUID: 75f91aff-a307-4e46-a396-23dfc3877722) for
task alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad-
mesos-master.INFO:I0217* 02:40:07.724366* 3430697 master.cpp:8312] Updating
the state of task
alpha-common_lpt_qc_16_1_pc.1.20170216180430.251_1806_0_1_250_app of
framework 23a1421e-4981-40ae-9e51-2110b8a800ad- (latest state:
TASK_FAILED, s

Re: [VOTE] Release Apache Mesos 1.1.1 (rc1)

2017-02-16 Thread Till Toenshoff
-1 for https://issues.apache.org/jira/browse/MESOS-7133 
 


> On Feb 8, 2017, at 10:39 PM, Vinod Kone  wrote:
> 
> +1 (binding)
> 
> Tested on ASF CI.
> 
>   Revision: 5d4c9962930c3f5c08e802caff40b670424cb091
> refs/tags/1.1.1-rc1
> Configuration Matrix  gcc clang
> centos:7  --verbose --enable-libevent --enable-sslautotools   
>  
> 
>   
> 
> cmake 
>  
> 
>   
> 
> --verbose autotools   
>  
> 
>  
> 
> cmake 
>  
> 
>  
> 
> ubuntu:14.04  --verbose --enable-libevent --enable-sslautotools   
>  
> 
>   
>  
> 
> cmake 
>  
> 
>   
>  
> 
> --verbose autotools   
>  
> 
>  
>  
> 
> cmake 
>  
> 
>  
>  
> 
> 
> On Wed, Feb 8, 2017 at 9:09 AM, Kapil Arya  > wrote:
> +1 binding.
> 
> Internal CI to build deb/rpm packages.
> 
> The deb/rpm binary packages are available at:
> http://open.mesosphere.com/downloads/mesos-rc/#apache-mesos-1.1.1-rc1 
> 
> 
> 
> On Tue, Feb 7, 2017 at 5:39 PM, Alex R  > wrote:
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.1.1.
> 
> 1.1.1 includes the following:
> 
> ** Bug
>   * [MESOS-6002] - The whiteout file cannot be removed correctly using aufs 
> backend.
>   * [MESOS-6010] - Docker registry puller shows decode error "No response 
> decoded".
>   * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
>   * [MESOS-6360] - The handling of whiteout files in provisioner is not 
> correct.
>   * [MESOS-6411] - Add documentation for CNI port-mapper plugin.
>   * [MESOS-6526] - `mesos-containerizer launch --environment` exposes 
> executor env vars in `ps`.
>   * [MESOS-6

[GitHub] mesos issue #165: mesoscon eu - hackatron exercise - CI using travis

2017-02-16 Thread haosdent
Github user haosdent commented on the issue:

https://github.com/apache/mesos/pull/165
  
Hi, @dcaba Since Mesos have used Apache CI to perform testings instead of 
travis ci, could refer to https://issues.apache.org/jira/browse/MESOS-5655 for 
the details. And thx @jfarrell, he just disabled the travis ci for Mesos. cc 
@vinodkone 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: How to consistent handle default values for message types

2017-02-16 Thread Yan Xu
On Mon, Feb 13, 2017 at 5:27 PM, Benjamin Mahler  wrote:

> The way I think about this is that if the field is semantically required
>

"semantically required": good point and this should definitely be one of
the criteria.

I guess we still need to clarify this for each of these messages: there's
another class of "semantically required" fields that you DO need to set we
need to disambiguate.

I'll start by improving the comments on `Filters`.



> then there can be a default. If not, then the optionality has meaning. For
> example, if there is always a notion of filtering, then having a default
> filter makes sense. But if the absence of a filter means no filtering
> occurs, then absence of the optional field has a meaning and we don't
> interpret the overall message to have a default value.
>
> Also, if we want to move to proto3 syntax at some point, we'll have to push
> our defaults into our API handling code rather than in the proto file
> AFAICT.
>
> On Thu, Feb 2, 2017 at 12:06 PM, Yan Xu  wrote:
>
> > With protobuf you can specify custom default values for scalar types
> > (proto2 at least) but not message types, e.g.,
> >
> > ```
> > message Filters {
> >   // Time to consider unused resources refused. Note that all unused
> >   // resources will be considered refused and use the default value
> >   // (below) regardless of whether Filters was passed to
> >   // SchedulerDriver::launchTasks. You MUST pass Filters with this
> >   // field set to change this behavior (i.e., get another offer which
> >   // includes unused resources sooner or later than the default).
> >   optional double refuse_seconds = 1 [default = 5.0];
> > }
> > ```
> >
> > However, the message `Filters` essential has a default value because
> *all*
> > its
> > fields have default values. It all depends on whether the receiver
> chooses
> > to check it is not set or directly accesses it and gets the default
> values.
> >
> > When we reference the type in other messages, e.g.,
> >
> > ```
> >   message Accept {
> > repeated OfferID offer_ids = 1;
> > repeated Offer.Operation operations = 2;
> > optional Filters filters = 3;
> >   }
> > ```
> >
> > We are not explicitly telling users what's going to happen when `filters`
> > is not set. The master just directly uses it without checking.
> >
> > It does feel intuitive to me that "*if all the fields in a message have
> > default values, and it semantically feels like a config, then we can just
> > interpret them when unset as indication to use defaults*".
> >
> > However we probably should document it better.
> >
> > To generalize it further, for something like this with multiple fields
> >
> > ```
> > message ExponentialBackoff {
> >   optional double initial_interval_seconds = 1 [default = 0.5];
> >   optional double max_interval_seconds = 2 [default = 300.0];
> >   optional double randomization_factor = 3 [default = 0.5];
> >   optional double max_elapsed_seconds = 4 [default = 2592000.0];
> > }
> > ```
> >
> > we should be able to not require them to be set and assume the defaults?
> >
> > One step further, if the message has recursively nested messages with
> > default values, we can treat the parent message as having a default value
> > too?
> >
> > Thoughts?
> >
> > Yan
> >
>


[GitHub] mesos issue #165: mesoscon eu - hackatron exercise - CI using travis

2017-02-16 Thread dcaba
Github user dcaba commented on the issue:

https://github.com/apache/mesos/pull/165
  
Hi,

  a bit surprised travis integration is still enabled (so all PRs have red 
crosses)... let me know if you want me to resolve the conflicts to integrate 
this (in addition to a basic travis job, there's also some other small amends I 
think are interesting). If not, I will just close.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Exponential Backoff

2017-02-16 Thread Benjamin Bannier
Hi Anindya,

thanks for that nice systematic write up. It makes it pretty clear that there 
are some inconsistency how back-off is handled, and how a more systematic 
approach could help.

I’d like to make a small remark here where I can use some more space than in 
the doc.

>> On Feb 12, 2017, at 9:03 PM, Anindya Sinha  wrote:
>> 
>> Reference: https://issues.apache.org/jira/browse/MESOS-7087 
>> 
>> 
>> Currently, we have at least 3 types of backoff such as:
>> 1) Exponential backoff with randomness, as in framework/agent registration.
>> 2) Exponential backoff with no randomness, as in status updates.
>> 3) Linear backoff with randomness, as in executor registration.

We had a small water cooler discussion about this, and were wondering if it 
would be worthwhile to also take the possibility of globally rate-limiting 
certain request kinds into account, e.g., of framework/agent registration 
requests regardless of the source. This might lead to improvements for any kind 
of activity caused by state changes affecting a large number of agents or 
frameworks. I give a more technical example below.

Also, I believe when evaluating improvements to back-off, it would a good idea 
to examine the expected time difference between arrivals of messages from 
different actors as a function of the back-off procedure as a benchmark (either 
by checking the theoretical literature or by performing small Monte Carlo 
simulations).


Cheers,

Benjamin


* * * 

# Technical example related to (1) above

Let’s say the following happens:

- A master failover occurs.
- All agents realize this pretty much simultaneously.
- All agents pretty much simultaneously start a registration procedure with the 
new master.

Now if there were no extra randomness introduced into the back-off (but there 
is) the master would see registration attempts from all agents pretty much at 
the same time. In large clusters this could flood the master beyond its 
abilities to handle these requests timely. That we deterministically space out 
registration attempts by larger and larger times wouldn’t help the master much 
when he’d have to deal with massive simultaneous registration load. 
Effectively, the agents inadvertently might still be performing something like 
a coordinated DDOS attack on the master by all retrying after the same time. 
Technically, the underlying issues is that the expected time difference between 
arrival times of registration attempts from different agents at the master 
would still be a Dirac delta function (think: pulse function with zero width 
sitting at zero).

Currently, the only tool protecting the master from having to handle a large 
number of registration attempts is the extra randomness we insert at the sender 
site. We pull this randomness from a uniform distribution. A uniform 
distribution is a great choice here since for a uniform distribution the tails 
of the distribution are as fat as they can get. Fat tails lead to a wider 
arrival time difference distribution at the master (it is a symmetric 
triangular distribution now instead of a delta function, still centered around 
zero though). A wider arrival time distribution means that the the probability 
of registration attempts from different agents arriving close in time is 
lowered; this is great as it potentially gives the master more time to handle 
all the requests.

The remaining issue is that even though we have spaced out requests in time by 
introducing randomness at the source, the most likely time difference between 
arrivals of two messages would still be zero (that’s just a consequence of 
statistics, the distribution for the difference of two independent random 
numbers from the same distribution is symmetric and centered around zero). We 
just have shifted some probability from smaller to larger time differences, but 
for sufficiently large clusters a master might still need to handle many more 
messages than it realistically can. Note that we use randomness at the source 
to space out requests from each other (independent random numbers), and that 
there might be no entity which could coordinate agents to collaboratively space 
out their requests more favorably for the master, e.g., in master failover 
there would be no master to coordinate the agents’ behavior.

I believe one possible solution for this would be to back pressure by the 
master rate limiting messages *before it becomes overloaded* (e.g., decided by 
examining something like the process’ message queue size or the average time a 
message stays in the queue, and dropping requests before performing any real 
work on them). This would force clients into another backoff iteration which 
would additionally space out requests.

Re: Proposal for Mesos Build Improvements

2017-02-16 Thread Alexander Rojas
Actually, this is a policy I have never been a big fan of. In my experience 
just forward declaring as much as possible in the headers and only including in 
compilations units tend to have decent improvements in complication time, 
particularly files like `mesos.cpp` or `slave.cpp` which indirectly end up 
including almost every header in the project.

Alexander Rojas
alexan...@mesosphere.io




> On 15 Feb 2017, at 20:12, Neil Conway  wrote:
> 
> On Tue, Feb 14, 2017 at 11:28 AM, Jeff Coffler
>  wrote:
>> For efficiency purposes, if a header file is included by 50% or more of the 
>> source files, it should be included in the precompiled header. If a header 
>> is included in fewer than 50% of the source files, then it can be separately 
>> included (and thus would not benefit from precompiled headers). Note that 
>> this is a guideline; even if a header is used by less than 50% of source 
>> files, if it's very large, we still may decide to throw it in the 
>> precompiled header.
> 
> It seems like this would have the effect of creating many false
> dependencies: if file X doesn't currently include header Y but Y is
> included in the precompiled header, the symbols in Y will now be
> visible when X is compiled. It would also mean that X would need to be
> recompiled when Y changes.
> 
> Related: the current policy is that headers and implementation files
> should try to include all of their dependencies, without relying on
> transitive includes. For example, if foo.cpp includes bar.hpp, which
> includes , but foo.cpp also uses , both foo.cpp and
> bar.hpp should "#include ". Adopting precompiled headers would
> mean making an exception to this policy, right?
> 
> I wonder if we should instead use headers like:
> 
> <- mesos_common.h ->
> #include 
> #include 
> #include 
> 
> <- xyz.cpp, which needs headers "b" and "d" ->
> #include "mesos_common.h>
> 
> #include 
> #include 
> 
> That way, the fact that "xyz.cpp" logically depends on  (but not
>  or ) is not obscured (in other words, Mesos should continue to
> compile if 'mesos_common.h' is replaced with an empty file). Does
> anyone know whether the header guard in  _should_ make the repeated
> inclusion of  relatively cheap?
> 
> Neil



Re: Exponential Backoff

2017-02-16 Thread Anindya Sinha
Would appreciate feedback/comments on this proposal.

Thanks
Anindya

> On Feb 12, 2017, at 9:03 PM, Anindya Sinha  wrote:
> 
> Reference: https://issues.apache.org/jira/browse/MESOS-7087 
> 
> 
> Currently, we have at least 3 types of backoff such as:
> 1) Exponential backoff with randomness, as in framework/agent registration.
> 2) Exponential backoff with no randomness, as in status updates.
> 3) Linear backoff with randomness, as in executor registration.
> 
> In framework registration as an example, each retry ranges between [0 .. 
> b*2^(n-1)] for nth retry attempt as long as each interval is less than 1 min.
> 
> For clusters with large number of frameworks and/or agents, the randomness 
> may not be enough since the timeout can end up being very small for a 
> substantial number of clients (agents and/or frameworks) due to the fact that 
> the allowed range is [0 .. ] for all retry attempts.
> 
> The following doc looks at an enhancement to the existing proposal to ensure 
> that the timeout values are not extremely small, and that every subsequent 
> retry should have a timeout value atleast as much as the previous iteration.
> 
> https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing
>  
> 
> 
> Feedback welcome.
> 
> Thanks
> Anindya
>