[jira] [Commented] (MESOS-2275) Document header include rules in style guide

2015-10-21 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968658#comment-14968658
 ] 

Benjamin Bannier commented on MESOS-2275:
-

I think we probably would also want an example that makes it clearer if in each 
component we use pure lex sort, or instead do enforce some residual level of 
logical ordering, e.g. {{clang-format}} (from trunk) prefers lexicographical 
sort

{code}
#include 
#include 
{code}

while one could also imagine the opposite ordering which emphasizes {{foo.hpp}} 
as some sort of "heading header" (currently not supported by {{clang-format}}).

The Google style guide asks for "alphabetical ordering" which isn't helpful 
here.

> Document header include rules in style guide
> 
>
> Key: MESOS-2275
> URL: https://issues.apache.org/jira/browse/MESOS-2275
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Assignee: Jan Schlicht
>Priority: Trivial
>  Labels: beginner, docathon, mesosphere
>
> We have several ways of sorting, grouping and ordering headers includes in 
> Mesos. We should agree on a rule set and do a style scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide

2015-10-21 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968642#comment-14968642
 ] 

Benjamin Bannier commented on MESOS-3786:
-

[~bmahler] Recent Doxygen (>=1.8.0) supports markdown and backticks cause 
formatting as "code". I suspect this is what was intended in the instances you 
refer to. 

> Backticks are not mentioned in Mesos C++ Style Guide
> 
>
> Key: MESOS-3786
> URL: https://issues.apache.org/jira/browse/MESOS-3786
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Minor
>  Labels: documentation, mesosphere
>
> As far as I can tell, current practice is to quote code excerpts and object 
> names with backticks when writing comments. For example:
> {code}
> // You know, `sadPanda` seems extra sad lately.
> std::string sadPanda;
> sadPanda = "   :'(   ";
> {code}
> However, I don't see this documented in our C++ style guide at all. It should 
> be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968622#comment-14968622
 ] 

Steven Schlansker commented on MESOS-2186:
--

That's a bummer.  Thank you everyone for looking and your time.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slave

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968545#comment-14968545
 ] 

Neil Conway commented on MESOS-2186:


[~rgs]: I agree, not sure there's a better fix. You could imagine a client API 
which hands more control to the user (e.g., zk_create() doesn't take any 
hostnames, then zk_add_server() takes a single server and returns 
success/failure), but that probably ends up being similar to just having user 
code do hostname resolution, and then pass in IPs.

[~stevenschlansker]: I opened [MESOS-3790] to have Mesos retry Zk connection 
errors that return ENOENT. Otherwise, as far as I know there's nothing else we 
can do here on the Mesos side. If you disagree, please reopen.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendT

[jira] [Created] (MESOS-3790) Zk connection should retry on EAI_NONAME

2015-10-21 Thread Neil Conway (JIRA)
Neil Conway created MESOS-3790:
--

 Summary: Zk connection should retry on EAI_NONAME
 Key: MESOS-3790
 URL: https://issues.apache.org/jira/browse/MESOS-3790
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway
Assignee: Neil Conway
Priority: Minor


The zookeeper interface is designed to retry (once per second for up to ten 
minutes) if one or more of the Zookeeper hostnames can't be resolved (see 
[MESOS-1326] and [MESOS-1523]).

However, the current implementation assumes that a DNS resolution failure is 
indicated by zookeeper_init() returning NULL and errno being set to EINVAL (Zk 
translates getaddrinfo() failures into errno values). However, the current Zk 
code does:

{code}
static int getaddrinfo_errno(int rc) {
switch(rc) {
case EAI_NONAME:
// ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
#if defined EAI_NODATA && EAI_NODATA != EAI_NONAME
case EAI_NODATA:
#endif
return ENOENT;
case EAI_MEMORY:
return ENOMEM;
default:
return EINVAL;
}
}
{code}

getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per 
discussion in [MESOS-2186], this seems to happen intermittently due to DNS 
failures.

Proposed fix: looking at errno is always going to be somewhat fragile, but if 
we're going to continue doing that, we should check for ENOENT as well as 
EINVAL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.

2015-10-21 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968410#comment-14968410
 ] 

haosdent commented on MESOS-3787:
-

For CommandInfo, I think Mesos already set them when start a docker container 
in mesos-docker-executor 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L421-L422. Do 
this not work for you?

> As a developer, I'd like to be able to expand environment variables through 
> the Docker executor.
> 
>
> Key: MESOS-3787
> URL: https://issues.apache.org/jira/browse/MESOS-3787
> Project: Mesos
>  Issue Type: Wish
>Reporter: John Garcia
>  Labels: mesosphere
>
> We'd like to have expanded variables usable in [the json files used to create 
> a Marathon app, hence] the Task's CommandInfo, so that the executor is able 
> to detect the correct values at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3233) Allow developers to decide whether a HTTP endpoint should use authentication

2015-10-21 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968314#comment-14968314
 ] 

Qian Zhang commented on MESOS-3233:
---

Why is this decided by developer? Should be it up to the operator to control 
which HTTP endpoint need to use authentication?

> Allow developers to decide whether a HTTP endpoint should use authentication
> 
>
> Key: MESOS-3233
> URL: https://issues.apache.org/jira/browse/MESOS-3233
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> Once HTTP Authentication is enabled, developers should be allowed to decide 
> which endpoints should require authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968275#comment-14968275
 ] 

Joseph Wu commented on MESOS-3771:
--

Looks like our JSON library will never catch this (it's more permissive), which 
is why none of our unit tests have caught this.

I agree that this is a problem though.  I'll see if I can get more eyes on this.

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3788) Clarify NetworkInfo semantics for IP addresses and group policies.

2015-10-21 Thread Connor Doyle (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968260#comment-14968260
 ] 

Connor Doyle commented on MESOS-3788:
-

Submitted a patch to modify NetworkInfo (work-in-progress) on reviewboard: 
[r/39531|https://reviews.apache.org/r/39531].

> Clarify NetworkInfo semantics for IP addresses and group policies.
> --
>
> Key: MESOS-3788
> URL: https://issues.apache.org/jira/browse/MESOS-3788
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, isolation
>Affects Versions: 0.25.0
>Reporter: Connor Doyle
>  Labels: mesosphere
>
> In Mesos 0.25.0, a new message called NetworkInfo was introduced.  This 
> message allows framework authors to communicate with network isolation 
> modules via a first-class message type to request IP addresses and network 
> group isolation policies.
> Unfortunately, the structure is somewhat confusing to both framework authors 
> and module implementors.
> 1) It's unclear how IP addresses map to virtual interfaces inside the 
> container.
> 2) It's difficult for application developers to understand the final policy 
> when multiple IP addresses can be assigned with differing isolation policies.
> CC [~karya] [~benjaminhindman] [~spikecurtis]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3762) Refactor SSLTest fixture such that MesosTest can use the same helpers.

2015-10-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965883#comment-14965883
 ] 

Joseph Wu edited comment on MESOS-3762 at 10/22/15 12:06 AM:
-

Reviews for:
Step 1)
https://reviews.apache.org/r/39498/
https://reviews.apache.org/r/39499/
Step 2 & 3)
https://reviews.apache.org/r/39501/
Step 4) 
https://reviews.apache.org/r/39533/
https://reviews.apache.org/r/39534/


was (Author: kaysoky):
Reviews for:
Step 1)
https://reviews.apache.org/r/39498/
https://reviews.apache.org/r/39499/
Step 2 & 3)
https://reviews.apache.org/r/39501/

> Refactor SSLTest fixture such that MesosTest can use the same helpers.
> --
>
> Key: MESOS-3762
> URL: https://issues.apache.org/jira/browse/MESOS-3762
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> In order to write tests that exercise SSL with other components of Mesos, 
> such as the HTTP scheduler library, we need to use the setup/teardown logic 
> found in the {{SSLTest}} fixture.
> Currently, the test fixtures have separate inheritance structures like this:
> {code}
> SSLTest <- ::testing::Test
> MesosTest <- TemporaryDirectoryTest <- ::testing::Test
> {code}
> where {{::testing::Test}} is a gtest class.
> The plan is the following:
> # Change {{SSLTest}} to inherit from {{TemporaryDirectoryTest}}.  This will 
> require moving the setup (generation of keys and certs) from 
> {{SetUpTestCase}} to {{SetUp}}.  At the same time, *some* of the cleanup 
> logic in the SSLTest will not be needed.
> # Move the logic of generating keys/certs into helpers, so that individual 
> tests can call them when needed, much like {{MesosTest}}.
> # Write a child class of {{SSLTest}} which has the same functionality as the 
> existing {{SSLTest}}, for use by the existing tests that rely on {{SSLTest}} 
> or the {{RegistryClientTest}}.
> # Have {{MesosTest}} inherit from {{SSLTest}} (which might be renamed during 
> the refactor).  If Mesos is not compiled with {{--enable-ssl}}, then 
> {{SSLTest}} could be {{#ifdef}}'d into any empty class.
> The resulting structure should be like:
> {code}
> MesosTest <- SSLTest <- TemporaryDirectoryTest <- ::testing::Test
> ChildOfSSLTest /
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968198#comment-14968198
 ] 

Vinod Kone commented on MESOS-3747:
---

Let me provide some history. 

"FrameworkInfo.user" should have been an optional field to begin with. As Marco 
mentioned, it only comes into play if "--switch-user" flag is set on the agent. 
More recently, we also added "CommandInfo.user" which takes precedence over 
"FrameworkInfo.user".

I would recommend the following

1) Make FrameworkInfo.user optional in v1/mesos.proto.
2) Fix the agent to return the correct error message (instead of OOM!) if 
flag.switch_user is true and it cannot determine the user to run the 
task/executor under.


> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e

[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky

2015-10-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968193#comment-14968193
 ] 

Anand Mazumdar commented on MESOS-3789:
---

[~gyliu] The tests are parameterized on the {{ContentType}} that can be either 
{{application/x-protobuf}} or {{application/json}}.

For reproducing this, you might want to set {{--gtest_repeat=-1}} and 
{{--gtest_break_on_failure}} when running the test to run them in a loop.

> ContentType/SchedulerTest.Suppress/1 is flalky
> --
>
> Key: MESOS-3789
> URL: https://issues.apache.org/jira/browse/MESOS-3789
> Project: Mesos
>  Issue Type: Bug
> Environment: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Observed in ASF CI
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/1
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w'
> I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms
> I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms
> I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns
> I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 
> 3107ns
> I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 504ns
> I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery
> I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status
> I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received 
> a broadcasted recover request from (10113)@172.17.3.153:57838
> I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to 
> STARTING
> I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 562580ns
> I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to 
> STARTING
> I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status
> I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status 
> received a broadcasted recover request from (10114)@172.17.3.153:57838
> I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING
> I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 275156ns
> I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to 
> VOTING
> I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos 
> group
> I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated
> I1021 19:17:43.154260 30940 master.cpp:376] Master 
> 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 
> 172.17.3.153:57838
> I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" 
> --zk_session_timeout="10secs"
> I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials'
> I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' 
> authenticator
> I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled
> I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given
> I1021 19:17:43.155642 30939 hierarchical.cp

[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky

2015-10-21 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968188#comment-14968188
 ] 

Guangya Liu commented on MESOS-3789:


Hi [~vi...@twitter.com] and [~anandmazumdar] 

This is very similar with MESOS-3733 , the difference is MESOS-3733 is failed 
at ContentType/SchedulerTest.Suppress/1 but not 
ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what is 
the difference of ContentType/SchedulerTest.Suppress/0 and 
ContentType/SchedulerTest.Suppress/1 ?

I also tried to reproduce in my local env but failed to reproduce, will check 
more.

{code}
I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from 
 to 
../../src/tests/scheduler_tests.cpp:1028: Failure
Value of: event.isPending()
  Actual: false
Expected: true
I1021 19:17:43.276475 30920 master.cpp:925] Master terminating
I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 
242dc5ed-402d-4873-be6d-9bad1f3296f9-S0
I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 
242dc5ed-402d-4873-be6d-9bad1f3296f9-
I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited
W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a 
new master to be elected
I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating
[  FAILED  ] ContentType/SchedulerTest.Suppress/1, where GetParam() = 
application/json (172 ms)
{code}

> ContentType/SchedulerTest.Suppress/1 is flalky
> --
>
> Key: MESOS-3789
> URL: https://issues.apache.org/jira/browse/MESOS-3789
> Project: Mesos
>  Issue Type: Bug
> Environment: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Observed in ASF CI
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/1
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w'
> I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms
> I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms
> I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns
> I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 
> 3107ns
> I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 504ns
> I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery
> I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status
> I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received 
> a broadcasted recover request from (10113)@172.17.3.153:57838
> I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to 
> STARTING
> I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 562580ns
> I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to 
> STARTING
> I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status
> I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status 
> received a broadcasted recover request from (10114)@172.17.3.153:57838
> I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING
> I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 275156ns
> I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to 
> VOTING
> I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos 
> group
> I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated
> I1021 19:17:43.154260 30940 master.cpp:376] Master 
> 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 
> 172.17.3.153:57838
> I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_time

[jira] [Issue Comment Deleted] (MESOS-3733) ContentType/SchedulerTest.Suppress/0 is flaky

2015-10-21 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-3733:
---
Comment: was deleted

(was: [~vi...@twitter.com] This is very similar with MESOS-3789 , the 
difference is MESOS-3789 is failed at ContentType/SchedulerTest.Suppress/1 but 
not ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what 
is the difference of ContentType/SchedulerTest.Suppress/0 and 
ContentType/SchedulerTest.Suppress/1 ?

I also tried to reproduce in my local env but failed to reproduce, will check 
more.

{code}
I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from 
 to 
../../src/tests/scheduler_tests.cpp:1028: Failure
Value of: event.isPending()
  Actual: false
Expected: true
I1021 19:17:43.276475 30920 master.cpp:925] Master terminating
I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 
242dc5ed-402d-4873-be6d-9bad1f3296f9-S0
I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 
242dc5ed-402d-4873-be6d-9bad1f3296f9-
I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited
W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a 
new master to be elected
I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating
[  FAILED  ] ContentType/SchedulerTest.Suppress/1, where GetParam() = 
application/json (172 ms)
{code})

> ContentType/SchedulerTest.Suppress/0 is flaky
> -
>
> Key: MESOS-3733
> URL: https://issues.apache.org/jira/browse/MESOS-3733
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Anand Mazumdar
>Assignee: Guangya Liu
>  Labels: flaky-test
>
> Showed up on ASF CI:
> https://builds.apache.org/job/Mesos/931/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/console
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/0
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi'
> I1014 17:34:11.225731 27650 leveldb.cpp:176] Opened db in 2.974504ms
> I1014 17:34:11.226856 27650 leveldb.cpp:183] Compacted db in 980779ns
> I1014 17:34:11.227028 27650 leveldb.cpp:198] Created db iterator in 37641ns
> I1014 17:34:11.227159 27650 leveldb.cpp:204] Seeked to beginning of db in 
> 14959ns
> I1014 17:34:11.227283 27650 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14672ns
> I1014 17:34:11.227449 27650 replica.cpp:746] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1014 17:34:11.228469 27680 recover.cpp:449] Starting replica recovery
> I1014 17:34:11.229202 27673 recover.cpp:475] Replica is in EMPTY status
> I1014 17:34:11.231384 27673 replica.cpp:642] Replica in EMPTY status received 
> a broadcasted recover request from (10262)@172.17.2.194:37545
> I1014 17:34:11.231745 27673 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1014 17:34:11.234242 27680 master.cpp:376] Master 
> 0cc41e7f-8d87-4c2f-9543-3f7198f9fdaf (23af00e0dbe0) started on 
> 172.17.2.194:37545
> I1014 17:34:11.234283 27680 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/master" 
> --zk_session_timeout="10secs"
> I1014 17:34:11.234679 27680 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1014 17:34:11.234694 27680 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I1014 17:34:11.234705 27680 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials'
> I1014 17:34:11.235251 27673 recover.cpp:566] Updating replica status to 
> STARTING
> I1014 17:34:11.235857 27680 master.cpp:467] Using default 'crammd5' 
> authenticator
> I1014 17:34:11.236006 27680 master.cpp:504] Authorization enabled
> I1014 17:34:11.236187 27673 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 729504ns
> I1014 17:34:11.236224 27673 r

[jira] [Commented] (MESOS-3733) ContentType/SchedulerTest.Suppress/0 is flaky

2015-10-21 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968185#comment-14968185
 ] 

Guangya Liu commented on MESOS-3733:


[~vi...@twitter.com] This is very similar with MESOS-3789 , the difference is 
MESOS-3789 is failed at ContentType/SchedulerTest.Suppress/1 but not 
ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what is 
the difference of ContentType/SchedulerTest.Suppress/0 and 
ContentType/SchedulerTest.Suppress/1 ?

I also tried to reproduce in my local env but failed to reproduce, will check 
more.

{code}
I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from 
 to 
../../src/tests/scheduler_tests.cpp:1028: Failure
Value of: event.isPending()
  Actual: false
Expected: true
I1021 19:17:43.276475 30920 master.cpp:925] Master terminating
I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 
242dc5ed-402d-4873-be6d-9bad1f3296f9-S0
I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 
242dc5ed-402d-4873-be6d-9bad1f3296f9-
I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited
W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a 
new master to be elected
I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating
[  FAILED  ] ContentType/SchedulerTest.Suppress/1, where GetParam() = 
application/json (172 ms)
{code}

> ContentType/SchedulerTest.Suppress/0 is flaky
> -
>
> Key: MESOS-3733
> URL: https://issues.apache.org/jira/browse/MESOS-3733
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Anand Mazumdar
>Assignee: Guangya Liu
>  Labels: flaky-test
>
> Showed up on ASF CI:
> https://builds.apache.org/job/Mesos/931/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/console
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/0
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi'
> I1014 17:34:11.225731 27650 leveldb.cpp:176] Opened db in 2.974504ms
> I1014 17:34:11.226856 27650 leveldb.cpp:183] Compacted db in 980779ns
> I1014 17:34:11.227028 27650 leveldb.cpp:198] Created db iterator in 37641ns
> I1014 17:34:11.227159 27650 leveldb.cpp:204] Seeked to beginning of db in 
> 14959ns
> I1014 17:34:11.227283 27650 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14672ns
> I1014 17:34:11.227449 27650 replica.cpp:746] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1014 17:34:11.228469 27680 recover.cpp:449] Starting replica recovery
> I1014 17:34:11.229202 27673 recover.cpp:475] Replica is in EMPTY status
> I1014 17:34:11.231384 27673 replica.cpp:642] Replica in EMPTY status received 
> a broadcasted recover request from (10262)@172.17.2.194:37545
> I1014 17:34:11.231745 27673 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1014 17:34:11.234242 27680 master.cpp:376] Master 
> 0cc41e7f-8d87-4c2f-9543-3f7198f9fdaf (23af00e0dbe0) started on 
> 172.17.2.194:37545
> I1014 17:34:11.234283 27680 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/master" 
> --zk_session_timeout="10secs"
> I1014 17:34:11.234679 27680 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1014 17:34:11.234694 27680 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I1014 17:34:11.234705 27680 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials'
> I1014 17:34:11.235251 27673 recover.cpp:566] Updating replica status to 
> STARTING
> I1014 17:34:11.235857 27680 master.cpp:467] Using default 'crammd5' 
> authenticator
> I1014 17:34:11.236006 27680 master.cpp:504] Authorization enabled
> I1014 17:34:11.236187 27673 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 729504ns
> I101

[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky

2015-10-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968182#comment-14968182
 ] 

Anand Mazumdar commented on MESOS-3789:
---

[~vinodkone] I am marking this as dup in favor of 
https://issues.apache.org/jira/browse/MESOS-3733

[~gyliu] was already looking at this.

> ContentType/SchedulerTest.Suppress/1 is flalky
> --
>
> Key: MESOS-3789
> URL: https://issues.apache.org/jira/browse/MESOS-3789
> Project: Mesos
>  Issue Type: Bug
> Environment: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Observed in ASF CI
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/1
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w'
> I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms
> I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms
> I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns
> I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 
> 3107ns
> I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 504ns
> I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery
> I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status
> I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received 
> a broadcasted recover request from (10113)@172.17.3.153:57838
> I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to 
> STARTING
> I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 562580ns
> I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to 
> STARTING
> I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status
> I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status 
> received a broadcasted recover request from (10114)@172.17.3.153:57838
> I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING
> I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 275156ns
> I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to 
> VOTING
> I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos 
> group
> I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated
> I1021 19:17:43.154260 30940 master.cpp:376] Master 
> 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 
> 172.17.3.153:57838
> I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" 
> --zk_session_timeout="10secs"
> I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials'
> I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' 
> authenticator
> I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled
> I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given
> I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical 
> allocator process
> I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is 
> master@1

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968180#comment-14968180
 ] 

Raul Gutierrez Segales edited comment on MESOS-2186 at 10/21/15 11:43 PM:
--

What would a sane behavior be? Say you give zookeeper_init() a list of 5 
hostnames: for how long should it retry if any of those lookups fail? Can it 
continue if most of them work? What's a good number?

If we can define this in a consistent way that works for everyone, I am happy 
to implement that behavior. But it's tricky to get right, hence it's usually 
better to just get off of DNS entirely if it's flaky (and pass in IP 
addresses).  


was (Author: rgs):
What would a sane behavior be? Say you give zookeeper_init() a list of 5 
hostnames: for how long should it retry if any of those lookups fail? Can it 
continue of most of them work? What's a good number?

If we can define this in a consistent way that works for everyone, I am happy 
to implement that behavior. But it's tricky to get right, hence it's usually 
better to just get off of DNS entirely if it's flaky (and pass in IP 
addresses).  

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> De

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968180#comment-14968180
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

What would a sane behavior be? Say you give zookeeper_init() a list of 5 
hostnames: for how long should it retry if any of those lookups fail? Can it 
continue of most of them work? What's a good number?

If we can define this in a consistent way that works for everyone, I am happy 
to implement that behavior. But it's tricky to get right, hence it's usually 
better to just get off of DNS entirely if it's flaky (and pass in IP 
addresses).  

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:31

[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky

2015-10-21 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968174#comment-14968174
 ] 

Vinod Kone commented on MESOS-3789:
---

[~gyliu] can you take a look at this?

> ContentType/SchedulerTest.Suppress/1 is flalky
> --
>
> Key: MESOS-3789
> URL: https://issues.apache.org/jira/browse/MESOS-3789
> Project: Mesos
>  Issue Type: Bug
> Environment: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>
> Observed in ASF CI
> {code}
> [ RUN  ] ContentType/SchedulerTest.Suppress/1
> Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w'
> I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms
> I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms
> I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns
> I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 
> 3107ns
> I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 504ns
> I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery
> I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status
> I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received 
> a broadcasted recover request from (10113)@172.17.3.153:57838
> I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to 
> STARTING
> I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 562580ns
> I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to 
> STARTING
> I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status
> I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status 
> received a broadcasted recover request from (10114)@172.17.3.153:57838
> I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING
> I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 275156ns
> I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to 
> VOTING
> I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos 
> group
> I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated
> I1021 19:17:43.154260 30940 master.cpp:376] Master 
> 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 
> 172.17.3.153:57838
> I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" 
> --zk_session_timeout="10secs"
> I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials'
> I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' 
> authenticator
> I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled
> I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given
> I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical 
> allocator process
> I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is 
> master@172.17.3.153:57838 with id 242dc5ed-402d-4873-be6d-9bad1f3296f9
> I1021 19:17:43.157438 30952 master.cpp:1622]

[jira] [Created] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky

2015-10-21 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-3789:
-

 Summary: ContentType/SchedulerTest.Suppress/1 is flalky
 Key: MESOS-3789
 URL: https://issues.apache.org/jira/browse/MESOS-3789
 Project: Mesos
  Issue Type: Bug
 Environment: 
https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull
Reporter: Vinod Kone
Assignee: Guangya Liu


Observed in ASF CI

{code}
[ RUN  ] ContentType/SchedulerTest.Suppress/1
Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w'
I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms
I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms
I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns
I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 3107ns
I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the db 
in 504ns
I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery
I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status
I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received a 
broadcasted recover request from (10113)@172.17.3.153:57838
I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from a 
replica in EMPTY status
I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to STARTING
I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 562580ns
I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to 
STARTING
I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status
I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status 
received a broadcasted recover request from (10114)@172.17.3.153:57838
I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from a 
replica in STARTING status
I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING
I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 275156ns
I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to VOTING
I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos group
I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated
I1021 19:17:43.154260 30940 master.cpp:376] Master 
242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 
172.17.3.153:57838
I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="false" --authenticate_slaves="true" --authenticators="crammd5" 
--authorizers="local" 
--credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="25secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" 
--zk_session_timeout="10secs"
I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated 
frameworks to register
I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing authenticated 
slaves to register
I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for 
authentication from 
'/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials'
I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' 
authenticator
I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled
I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given
I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical 
allocator process
I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is 
master@172.17.3.153:57838 with id 242dc5ed-402d-4873-be6d-9bad1f3296f9
I1021 19:17:43.157438 30952 master.cpp:1622] Elected as the leading master!
I1021 19:17:43.157455 30952 master.cpp:1382] Recovering from registrar
I1021 19:17:43.157595 30943 registrar.cpp:309] Recovering registrar
I1021 19:17:43.158347 30950 log.cpp:661] Attempting to start the writer
I1021 19:17:43.159632 30949 replica.cpp:478] Replica received implicit promise 
request from (10115)@172.17.3.153:57838 with proposal 1
I1021 19:17:43.160238 30949 lev

[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide

2015-10-21 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968162#comment-14968162
 ] 

Greg Mann commented on MESOS-3786:
--

Ah OK; I was under the impression that backticks were the way to go. In either 
case, the correct convention should be documented in our style guide. Perhaps a 
thread on the dev list would help folks come to consensus on a policy?

> Backticks are not mentioned in Mesos C++ Style Guide
> 
>
> Key: MESOS-3786
> URL: https://issues.apache.org/jira/browse/MESOS-3786
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Minor
>  Labels: documentation, mesosphere
>
> As far as I can tell, current practice is to quote code excerpts and object 
> names with backticks when writing comments. For example:
> {code}
> // You know, `sadPanda` seems extra sad lately.
> std::string sadPanda;
> sadPanda = "   :'(   ";
> {code}
> However, I don't see this documented in our C++ style guide at all. It should 
> be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide

2015-10-21 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968149#comment-14968149
 ] 

Benjamin Mahler commented on MESOS-3786:


Ah.. should be single quotes for object names.

Looking through a grep, it looks like a number of recent patches introduced the 
backticks: maintenance, systemd, fetcher cache, master json, and a couple of 
others. Would be great to clean this up and prevent more backticks, but not a 
big deal. Unless I'm missing something... e.g. in doxygen do backticks affect 
rendering?

> Backticks are not mentioned in Mesos C++ Style Guide
> 
>
> Key: MESOS-3786
> URL: https://issues.apache.org/jira/browse/MESOS-3786
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Minor
>  Labels: documentation, mesosphere
>
> As far as I can tell, current practice is to quote code excerpts and object 
> names with backticks when writing comments. For example:
> {code}
> // You know, `sadPanda` seems extra sad lately.
> std::string sadPanda;
> sadPanda = "   :'(   ";
> {code}
> However, I don't see this documented in our C++ style guide at all. It should 
> be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968143#comment-14968143
 ] 

Steven Schlansker commented on MESOS-2186:
--

Maybe this will end up being too hard to fix, since it seems to be a limitation 
of the ZK C API.  It's just surprising from an end user perspective that a 
single name failing to resolve (even when two are still happy) causes such a 
disruptive failure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] M

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968131#comment-14968131
 ] 

Neil Conway commented on MESOS-2186:


If the DNS resolution failure lasts for a long time, zookeeper_init() will 
continue to return NULL and hence Mesos will still be unable to make progress.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968128#comment-14968128
 ] 

Steven Schlansker commented on MESOS-2186:
--

This is true in the case the DNS resolution failure is temporary.  If it is not 
temporary, you are still SOL.  Imagine $JUNIOR_ADMIN removes one of the 
ZooKeeper nodes from DNS.  You may then have an inoperable Mesos cluster for a 
long time if you have aggressive DNS caching, even though a ZK quorum is still 
up and alive.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaste

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968099#comment-14968099
 ] 

Neil Conway commented on MESOS-2186:


Ah, okay. So the situation seems to be:

(1) zookeeper_init() returns NULL when getaddrinfo() fails, as intended.
(2) Mesos is _designed_ to loop and retry zookeeper_init(), but it doesn't do 
this: we use a gross hack to determine whether the zookeeper_init() failure was 
due to a hostname resolution failure, and apparently it doesn't account for 
this case (we're expecting errno == EINVAL, apparently we see ENOENT instead).
(3) Hence, we abort the process.

We can revise the condition we're checking in #2 slightly, but that is only 
intended as a convenience anyway: as discussed above, you should be running 
Mesos under process supervision and restarting it when it fails. (The question 
is just whether we do the retry loop in Mesos itself or in the process 
supervisor.) If Mesos exiting unexpectedly "compromises the 'high availability' 
of Mesos", your Mesos installation is not configured correctly.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
>

[jira] [Created] (MESOS-3788) Clarify NetworkInfo semantics for IP addresses and group policies.

2015-10-21 Thread Connor Doyle (JIRA)
Connor Doyle created MESOS-3788:
---

 Summary: Clarify NetworkInfo semantics for IP addresses and group 
policies.
 Key: MESOS-3788
 URL: https://issues.apache.org/jira/browse/MESOS-3788
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, isolation
Affects Versions: 0.25.0
Reporter: Connor Doyle


In Mesos 0.25.0, a new message called NetworkInfo was introduced.  This message 
allows framework authors to communicate with network isolation modules via a 
first-class message type to request IP addresses and network group isolation 
policies.

Unfortunately, the structure is somewhat confusing to both framework authors 
and module implementors.

1) It's unclear how IP addresses map to virtual interfaces inside the container.
2) It's difficult for application developers to understand the final policy 
when multiple IP addresses can be assigned with differing isolation policies.

CC [~karya] [~benjaminhindman] [~spikecurtis]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968075#comment-14968075
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

Yeah, at least for the 3.4 branch we'll probably not have the constructor 
(zookeeper_init) retry the failed getaddrinfo() calls, so it's up to the caller.

(ignore the part about the locks not properly initialized mentioned in the 
description of ZOOKEEPER-1029, that has nothing to do with this bug).

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-maste

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968076#comment-14968076
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

I would think so... 

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 m

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968073#comment-14968073
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968072#comment-14968072
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master 

[jira] [Issue Comment Deleted] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2186:
-
Comment: was deleted

(was: If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 
is unrelated, yeah?)

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968063#comment-14968063
 ] 

Neil Conway commented on MESOS-2186:


The check failure trace happens because the call to zookeeper_init() returns 
NULL; Mesos checks for this and aborts with an error and a stack trace.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:5

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968056#comment-14968056
 ] 

Steven Schlansker commented on MESOS-2186:
--

Well, rgs above called into question whether that is truly the case.  
Additionally at least as of now the "check failure stack trace" is entirely in 
C++ code, seemingly not in the Zookeeper library (pure C).

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to 

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968034#comment-14968034
 ] 

Neil Conway edited comment on MESOS-2186 at 10/21/15 9:55 PM:
--

Hi Steven,

The current theory is that this is a problem with Zookeeper; from a quick look 
at the Zk bug ([ZOOKEEPER-1029]), that seems likely correct to me. When there 
is a Zookeeper patch for the problem, we can discuss whether to backport it to 
Mesos in the time before a new Zk stable release is made. Other than that, I'm 
not sure what else we can do.


was (Author: neilc):
Hi Steven,

The current theory is that this is a Zookeeper; from a quick look at the Zk bug 
([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper 
patch for the problem, we can discuss whether to backport it to Mesos in the 
time before a new Zk stable release is made. Other than that, I'm not sure what 
else we can do.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968034#comment-14968034
 ] 

Neil Conway commented on MESOS-2186:


Hi Steven,

The current theory is that this is a Zookeeper; from a quick look at the Zk bug 
([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper 
patch for the problem, we can discuss whether to backport it to Mesos in the 
time before a new Zk stable release is made. Other than that, I'm not sure what 
else we can do.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 me

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:51 PM:


I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_00' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  I fully admit the linked ZK bug may not be the root cause.  But 
Mesos is still trivial to crash if one of the ZK members are not valid (even if 
a quorum are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/w

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968027#comment-14968027
 ] 

Steven Schlansker commented on MESOS-2186:
--

I reopened the ticket since it is still a crasher in master.  I hope that is 
appropriate, I apologize in advance if not.  Not trying to be a stick in the 
mud but this compromises the "high availability" of Mesos which is a critical 
piece of infrastructure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:3

[jira] [Updated] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2186:
-
Affects Version/s: 0.26.0

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM:


I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_00' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  But Mesos is still trivial to crash if one of the ZK members 
are not valid (even if two are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
thou

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker commented on MESOS-2186:
--

I am still able to easily reproduce this, even with master built from today:

{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat
I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  But Mesos is still trivial to crash if one of the ZK members 
are not valid (even if two are).


> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from

[jira] [Commented] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator

2015-10-21 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967969#comment-14967969
 ] 

Zameer Manji commented on MESOS-2386:
-

[~idownes] The linked design doc is not publicly viewable. Can you change the 
permissions on the document please?

> Provide full filesystem isolation as a native mesos isolator
> 
>
> Key: MESOS-2386
> URL: https://issues.apache.org/jira/browse/MESOS-2386
> Project: Mesos
>  Issue Type: Epic
>  Components: isolation
>Affects Versions: 0.22.1
>Reporter: Dominic Hamon
>Assignee: Ian Downes
>  Labels: mesosphere, twitter
>
> Design
> https://docs.google.com/a/twitter.com/document/d/1Fx5TS0LytV7u5MZExQS0-g-gScX2yKCKQg9UPFzhp6U/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967962#comment-14967962
 ] 

Steven Schlansker commented on MESOS-3771:
--

Okay, I have distilled down the reproduction case.
Using the Python test-framework with the following diff applied:

{code}
diff --git a/src/examples/python/test_framework.py 
b/src/examples/python/test_framework.py
index 6af6d22..95abb97 100755
--- a/src/examples/python/test_framework.py
+++ b/src/examples/python/test_framework.py
@@ -150,6 +150,7 @@ class TestScheduler(mesos.interface.Scheduler):
 print "but received", self.messagesReceived
 sys.exit(1)
 print "All tasks done, and all messages received, exiting"
+time.sleep(30)
 driver.stop()
 
 if __name__ == "__main__":
@@ -158,6 +159,7 @@ if __name__ == "__main__":
 sys.exit(1)
 
 executor = mesos_pb2.ExecutorInfo()
+executor.data = b'\xAC\xED'
 executor.executor_id.value = "default"
 executor.command.value = os.path.abspath("./test-executor")
 executor.name = "Test Executor (Python)"
{code}

if you run the test framework, and during the 30 second wait after it finishes, 
try to grab the {{/master/state.json}} endpoint, you will get a response that 
has invalid UTF8 in it:

{code}
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start 
byte 0xac
 at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@54c8158d; line: 1, 
column: 6432]
{code}

I tested against both 0.24.1 and current master, both exhibit the bad behavior.

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-3771:
-
Affects Version/s: 0.26.0

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1563) Failed to configure on FreeBSD

2015-10-21 Thread David Forsythe (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967688#comment-14967688
 ] 

David Forsythe commented on MESOS-1563:
---

[~idownes] Great! Do you want to make a first pass and have me chop it up when 
I address feedback, or would you like me to chop it up before you review?

> Failed to configure on FreeBSD
> --
>
> Key: MESOS-1563
> URL: https://issues.apache.org/jira/browse/MESOS-1563
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
> Environment: FreeBSD-10/stable
>Reporter: Dmitry Sivachenko
>
> When trying to configure mesos on FreeBSD, I get the following error:
> configure: Setting up build environment for x86_64 freebsd10.0
> configure: error: "Mesos is currently unsupported on your platform."
> Why? Is there anything really Linux-specific inside? It's written in Java 
> after all.
> And MacOS is supported, but it is rather close to FreeBSD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.

2015-10-21 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-3787:
--
Description: We'd like to have expanded variables usable in [the json files 
used to create a Marathon app, hence] the Task's CommandInfo, so that the 
executor is able to detect the correct values at runtime.  (was: We'd like to 
have expanded variables usable in the json files used to create an app, so that 
the executor is able to detect the correct values at runtime.)

> As a developer, I'd like to be able to expand environment variables through 
> the Docker executor.
> 
>
> Key: MESOS-3787
> URL: https://issues.apache.org/jira/browse/MESOS-3787
> Project: Mesos
>  Issue Type: Wish
>Reporter: John Garcia
>  Labels: mesosphere
>
> We'd like to have expanded variables usable in [the json files used to create 
> a Marathon app, hence] the Task's CommandInfo, so that the executor is able 
> to detect the correct values at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.

2015-10-21 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-3787:
--
Labels: mesosphere  (was: )

> As a developer, I'd like to be able to expand environment variables through 
> the Docker executor.
> 
>
> Key: MESOS-3787
> URL: https://issues.apache.org/jira/browse/MESOS-3787
> Project: Mesos
>  Issue Type: Wish
>Reporter: John Garcia
>  Labels: mesosphere
>
> We'd like to have expanded variables usable in the json files used to create 
> an app, so that the executor is able to detect the correct values at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.

2015-10-21 Thread John Garcia (JIRA)
John Garcia created MESOS-3787:
--

 Summary: As a developer, I'd like to be able to expand environment 
variables through the Docker executor.
 Key: MESOS-3787
 URL: https://issues.apache.org/jira/browse/MESOS-3787
 Project: Mesos
  Issue Type: Wish
Reporter: John Garcia


We'd like to have expanded variables usable in the json files used to create an 
app, so that the executor is able to detect the correct values at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide

2015-10-21 Thread Greg Mann (JIRA)
Greg Mann created MESOS-3786:


 Summary: Backticks are not mentioned in Mesos C++ Style Guide
 Key: MESOS-3786
 URL: https://issues.apache.org/jira/browse/MESOS-3786
 Project: Mesos
  Issue Type: Documentation
Reporter: Greg Mann
Assignee: Greg Mann
Priority: Minor


As far as I can tell, current practice is to quote code excerpts and object 
names with backticks when writing comments. For example:

{code}
// You know, `sadPanda` seems extra sad lately.
std::string sadPanda;
sadPanda = "   :'(   ";
{code}

However, I don't see this documented in our C++ style guide at all. It should 
be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-3747:
--
Shepherd: Vinod Kone

i'll shepherd this.

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping 
> launch
> I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'
> E1015 13:15:34.264516 19641 slave.cpp:3342] Container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-' failed to start: Failed to 
> prepare isolator: Faile

[jira] [Updated] (MESOS-3785) Use URI content modification time to trigger fetcher cache updates.

2015-10-21 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-3785:

Labels: mesosphere  (was: )

> Use URI content modification time to trigger fetcher cache updates.
> ---
>
> Key: MESOS-3785
> URL: https://issues.apache.org/jira/browse/MESOS-3785
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Bernd Mathiske
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> Instead of using checksums to trigger fetcher cache updates, we can for 
> starters use the content modification time (mtime), which is available for a 
> number of download protocols, e.g. HTTP and HDFS.
> Proposal: Instead of just fetching the content size, we fetch both size  and 
> mtime together. As before, if there is no size, then caching fails and we 
> fall back on direct downloading to the sandbox. 
> Assuming a size is given, we compare the mtime from the fetch URI with the 
> mtime known to the cache. If it differs, we update the cache. (As a defensive 
> measure, a difference in size should also trigger an update.) 
> Not having an mtime available at the fetch URI is simply treated as a unique 
> valid mtime value that differs from all others. This means that when 
> initially there is no mtime, cache content remains valid until there is one. 
> Thereafter,  anew lack of an mtime invalidates the cache once. In other 
> words: any change from no mtime to having one or back is the same as 
> encountering a new mtime.
> Note that this scheme does not require any new protobuf fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`

2015-10-21 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967581#comment-14967581
 ] 

Greg Mann commented on MESOS-3506:
--

On a related note, can anybody confirm that the following command is necessary 
in the CentOS 6.6 install instructions:

{code}
sudo yum install -y tar wget which
{code}

The OS image I'm using already has these installed by default, so I'm inclined 
to remove that line from the docs unless we can confirm that it's needed.

> Build instructions for CentOS 6.6 should include `sudo yum update`
> --
>
> Key: MESOS-3506
> URL: https://issues.apache.org/jira/browse/MESOS-3506
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: documentation, mesosphere
>
> Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the 
> build to break when building {{mesos-0.25.0.jar}}. The build instructions for 
> this platform on the Getting Started page should be changed accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`

2015-10-21 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967570#comment-14967570
 ] 

Greg Mann commented on MESOS-3506:
--

I'll check it out and see if I can tell which dependency is causing the issue. 
Since this is CentOS 6, however, I think there's a case to be made for just 
saying `sudo yum update` in the docs, since it's an old OS version and I would 
imagine similar problems with other packages may crop up in the future.

> Build instructions for CentOS 6.6 should include `sudo yum update`
> --
>
> Key: MESOS-3506
> URL: https://issues.apache.org/jira/browse/MESOS-3506
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: documentation, mesosphere
>
> Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the 
> build to break when building {{mesos-0.25.0.jar}}. The build instructions for 
> this platform on the Getting Started page should be changed accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1563) Failed to configure on FreeBSD

2015-10-21 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967492#comment-14967492
 ] 

Ian Downes commented on MESOS-1563:
---

[~dforsyth] it should to be split into two. Although in the same tree we 
consider libprocess (and also stout) to be a separate library and changes 
should be separated.

I'm just about to review now.

> Failed to configure on FreeBSD
> --
>
> Key: MESOS-1563
> URL: https://issues.apache.org/jira/browse/MESOS-1563
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
> Environment: FreeBSD-10/stable
>Reporter: Dmitry Sivachenko
>
> When trying to configure mesos on FreeBSD, I get the following error:
> configure: Setting up build environment for x86_64 freebsd10.0
> configure: error: "Mesos is currently unsupported on your platform."
> Why? Is there anything really Linux-specific inside? It's written in Java 
> after all.
> And MacOS is supported, but it is rather close to FreeBSD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3736) Support docker local store pull same image simultaneously

2015-10-21 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967449#comment-14967449
 ] 

Gilbert Song commented on MESOS-3736:
-

A quick note to list solutions to the questions above:

1. By logic check: if it is the first call to get() Image_A, promise associate 
with metadateManager->get(). If not, check whether that promised future 
failed/discarded. If yes, over write to the hash map.
2. To have 'stringify(image)' as key.

> Support docker local store pull same image simultaneously 
> --
>
> Key: MESOS-3736
> URL: https://issues.apache.org/jira/browse/MESOS-3736
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> The current local store implements get() using the local puller. For all 
> requests of pulling same docker image at the same time, the local puller just 
> untar the image tarball as many times as those requests are, and cp all of 
> them to the same directory, which wastes time and bear high demand of 
> computation. We should be able to support the local store/puller only do 
> these for the first time, and the simultaneous pulling request should wait 
> for the promised future and get it once the first pulling finishes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING

2015-10-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967444#comment-14967444
 ] 

Anand Mazumdar commented on MESOS-3766:
---

Looking at the logs, we can identify a few things:

1. The executors for the 2 tasks were indeed launched. After launch, they sent 
a registration request to the agent. The agent successfully registered them, 
and sent the queued task it had back to the executor.
2. The executor never sent any status updates e.g. {{TASK_RUNNING}} for the 
task ( might have been stuck ) , very hard to follow (why) , since we do not 
have any {{VLOG}} messages from the executor driver being logged owing to no 
{{GLOG_v=1}} being set as part of the executor environment variables.
3. The agent upon receiving a {{KillTask}} message from the scheduler kept 
sending them to the executor. They are all best effort (fire and forget), 
meaning , if the executor is hung/un-responsive, there is no way for us to know 
except having a look at the driver logs.

Since we did not have {{GLOG_v}} set, that makes it very hard to reason if the 
Executor did receive messages from the Agent and why it did not act upon them 
i.e. send a {{TASK_KILLED}} status update back to the agent.

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
> Attachments: master.log.zip, slave.log.zip
>
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf

[jira] [Updated] (MESOS-1478) Replace Master/Slave terminology

2015-10-21 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-1478:
-
Epic Name: Agent Rename

> Replace Master/Slave terminology
> 
>
> Key: MESOS-1478
> URL: https://issues.apache.org/jira/browse/MESOS-1478
> Project: Mesos
>  Issue Type: Epic
>Reporter: Clark Breyman
>Assignee: Benjamin Hindman
>Priority: Minor
>  Labels: mesosphere
>
> Inspired by the comments on this PR:
> https://github.com/django/django/pull/2692
> TL;DR - Computers sharing work should be a good thing. Using the language of 
> human bondage and suffering is inappropriate in this context. It also has the 
> potential to alienate users and community members. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1478) Replace Master/Slave terminology

2015-10-21 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-1478:
-
Issue Type: Epic  (was: Wish)

> Replace Master/Slave terminology
> 
>
> Key: MESOS-1478
> URL: https://issues.apache.org/jira/browse/MESOS-1478
> Project: Mesos
>  Issue Type: Epic
>Reporter: Clark Breyman
>Assignee: Benjamin Hindman
>Priority: Minor
>  Labels: mesosphere
>
> Inspired by the comments on this PR:
> https://github.com/django/django/pull/2692
> TL;DR - Computers sharing work should be a good thing. Using the language of 
> human bondage and suffering is inappropriate in this context. It also has the 
> potential to alienate users and community members. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1607) Introduce optimistic offers.

2015-10-21 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967410#comment-14967410
 ] 

Joseph Wu commented on MESOS-1607:
--

We plan to release the MVP before the end of this year, so tentatively sometime 
between v0.26.0 and v0.28.0.

> Introduce optimistic offers.
> 
>
> Key: MESOS-1607
> URL: https://issues.apache.org/jira/browse/MESOS-1607
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, framework, master
>Reporter: Benjamin Hindman
>Assignee: Artem Harutyunyan
>  Labels: mesosphere
> Attachments: optimisitic-offers.pdf
>
>
> *Background*
> The current implementation of resource offers only enable a single framework 
> scheduler to make scheduling decisions for some available resources at a 
> time. In some circumstances, this is good, i.e., when we don't want other 
> framework schedulers to have access to some resources. However, in other 
> circumstances, there are advantages to letting multiple framework schedulers 
> attempt to make scheduling decisions for the _same_ allocation of resources 
> in parallel.
> If you think about this from a "concurrency control" perspective, the current 
> implementation of resource offers is _pessimistic_, the resources contained 
> within an offer are _locked_ until the framework scheduler that they were 
> offered to launches tasks with them or declines them. In addition to making 
> pessimistic offers we'd like to give out _optimistic_ offers, where the same 
> resources are offered to multiple framework schedulers at the same time, and 
> framework schedulers "compete" for those resources on a 
> first-come-first-serve basis (i.e., the first to launch a task "wins"). We've 
> always reserved the right to rescind resource offers using the 'rescind' 
> primitive in the API, and a framework scheduler should be prepared to launch 
> a task and have those tasks go lost because another framework already started 
> to use those resources.
> *Feature*
> We plan to take a step towards optimistic offers, by introducing primitives 
> that allow resources to be offered to multiple frameworks at once.  At first, 
> we will use these primitives to optimistically allocate resources that are 
> reserved for a particular framework/role but have not been allocated by that 
> framework/role.  
> The work with optimistic offers will closely resemble the existing 
> oversubscription feature.  Optimistically offered resources are likely to be 
> considered "revocable resources" (the concept that using resources not 
> reserved for you means you might get those resources revoked).  In effect, we 
> can may create something like a "spot" market for unused resources, driving 
> up utilization by letting frameworks that are willing to use revocable 
> resources run tasks.
> *Future Work*
> This ticket tracks the introduction of some aspects of optimistic offers.  
> Taken to the limit, one could imagine always making optimistic resource 
> offers. This bears a striking resemblance with the Google Omega model (an 
> isomorphism even). However, being able to configure what resources should be 
> allocated optimistically and what resources should be allocated 
> pessimistically gives even more control to a datacenter/cluster operator that 
> might want to, for example, never let multiple frameworks (roles) compete for 
> some set of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3752) CentOS 6 dependency install fails at Maven

2015-10-21 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967396#comment-14967396
 ] 

Greg Mann commented on MESOS-3752:
--

Ding-Yi Chen updated the Maven package, and it now successfully installs on 
CentOS 6, but I'm currently having some compilation errors that might be 
related, so leaving the ticket open for the time being.

> CentOS 6 dependency install fails at Maven
> --
>
> Key: MESOS-3752
> URL: https://issues.apache.org/jira/browse/MESOS-3752
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: documentation, installation, mesosphere
>
> It seems the Apache Maven dependencies have changed such that following the 
> Getting Started docs for CentOS 6.6 will fail at Maven installation:
> {code}
> ---> Package apache-maven.noarch 0:3.3.3-2.el6 will be installed
> --> Processing Dependency: java-devel >= 1:1.7.0 for package: 
> apache-maven-3.3.3-2.el6.noarch
> --> Finished Dependency Resolution
> Error: Package: apache-maven-3.3.3-2.el6.noarch (epel-apache-maven)
>Requires: java-devel >= 1:1.7.0
>Available: java-1.5.0-gcj-devel-1.5.0.0-29.1.el6.x86_64 (base)
>java-devel = 1.5.0
>Available: 
> 1:java-1.6.0-openjdk-devel-1.6.0.35-1.13.7.1.el6_6.x86_64 (base)
>java-devel = 1:1.6.0
>Available: 
> 1:java-1.6.0-openjdk-devel-1.6.0.36-1.13.8.1.el6_7.x86_64 (updates)
>java-devel = 1:1.6.0
>  You could try using --skip-broken to work around the problem
>  You could try running: rpm -Va --nofiles --nodigest
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3326) Make use of C++11 atomics

2015-10-21 Thread Joris Van Remoortere (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-3326:

Labels: mesosphere  (was: )

> Make use of C++11 atomics
> -
>
> Key: MESOS-3326
> URL: https://issues.apache.org/jira/browse/MESOS-3326
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Fix For: 0.26.0
>
>
> Now that we require C++11, we can make use of std::atomic. For example:
> * libprocess/process.cpp uses a bare int + __sync_synchronize() for "running"
> * __sync_synchronize() is used in logging.hpp in libprocess and fork.hpp in 
> stout
> * sched/sched.cpp uses a volatile int for "running" -- this is wrong, 
> "volatile" is not sufficient to ensure safe concurrent access
> * "volatile" is used in a few other places -- most are probably dubious but I 
> haven't looked closely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2953) git rebase --continue does not trigger hooks

2015-10-21 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967270#comment-14967270
 ] 

Joris Van Remoortere commented on MESOS-2953:
-

{code}
commit 8de47a8ef27288d7660ee3a5e40874def912b8c2
Author: haosdent huang 
Date:   Wed Oct 21 10:56:16 2015 -0400

Removed unnecessary exec in post-rewrite hook.

Review: https://reviews.apache.org/r/39506
{code}

> git rebase --continue does not trigger hooks
> 
>
> Key: MESOS-2953
> URL: https://issues.apache.org/jira/browse/MESOS-2953
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joris Van Remoortere
>Assignee: haosdent
>
> Currently there are no git hooks run when executing {{git rebase 
> --continue}}. We do run hooks on {{git commit}}.
> It would help prevent errors if we could also run some of these hooks on the 
> {{git rebase --continue}} flow as this one is rather common.
> I believe we can use the 'post-rewrite' hook to accomplish this. It will not 
> necessarily unwind the commit, but at least give us the opportunity to print 
> warning messages.
> If this is not desirable / feasible I would like to propose running the hooks 
> as part of post-reviews instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3785) Use URI content modification time to trigger fetcher cache updates.

2015-10-21 Thread Bernd Mathiske (JIRA)
Bernd Mathiske created MESOS-3785:
-

 Summary: Use URI content modification time to trigger fetcher 
cache updates.
 Key: MESOS-3785
 URL: https://issues.apache.org/jira/browse/MESOS-3785
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher
Reporter: Bernd Mathiske
Assignee: Benjamin Bannier


Instead of using checksums to trigger fetcher cache updates, we can for 
starters use the content modification time (mtime), which is available for a 
number of download protocols, e.g. HTTP and HDFS.

Proposal: Instead of just fetching the content size, we fetch both size  and 
mtime together. As before, if there is no size, then caching fails and we fall 
back on direct downloading to the sandbox. 

Assuming a size is given, we compare the mtime from the fetch URI with the 
mtime known to the cache. If it differs, we update the cache. (As a defensive 
measure, a difference in size should also trigger an update.) 

Not having an mtime available at the fetch URI is simply treated as a unique 
valid mtime value that differs from all others. This means that when initially 
there is no mtime, cache content remains valid until there is one. Thereafter,  
anew lack of an mtime invalidates the cache once. In other words: any change 
from no mtime to having one or back is the same as encountering a new mtime.

Note that this scheme does not require any new protobuf fields.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3784) Replace Master/Slave Terminology Phase I - Update mesos-cli

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3784:
---

 Summary: Replace Master/Slave Terminology Phase I - Update 
mesos-cli 
 Key: MESOS-3784
 URL: https://issues.apache.org/jira/browse/MESOS-3784
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3783) Replace Master/Slave Terminology Phase I - Update documentation

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3783:
---

 Summary: Replace Master/Slave Terminology Phase I - Update 
documentation 
 Key: MESOS-3783
 URL: https://issues.apache.org/jira/browse/MESOS-3783
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3781:
---

 Summary: Replace Master/Slave Terminology Phase I - Add duplicate 
agent flags 
 Key: MESOS-3781
 URL: https://issues.apache.org/jira/browse/MESOS-3781
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3782:
---

 Summary: Replace Master/Slave Terminology Phase I - Add duplicate 
binaries (or create symlinks)
 Key: MESOS-3782
 URL: https://issues.apache.org/jira/browse/MESOS-3782
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3780) Replace Master/Slave Terminology Phase I - Update all strings output

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3780:
---

 Summary: Replace Master/Slave Terminology Phase I - Update all 
strings output
 Key: MESOS-3780
 URL: https://issues.apache.org/jira/browse/MESOS-3780
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3778) Replace Master/Slave Terminology Phase I - Add duplicate HTTP endpoints

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3778:
---

 Summary: Replace Master/Slave Terminology Phase I - Add duplicate 
HTTP endpoints
 Key: MESOS-3778
 URL: https://issues.apache.org/jira/browse/MESOS-3778
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3777) Replace Master/Slave Terminology Phase I - Modify public interfaces

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3777:
---

 Summary: Replace Master/Slave Terminology Phase I - Modify public 
interfaces 
 Key: MESOS-3777
 URL: https://issues.apache.org/jira/browse/MESOS-3777
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3779) Replace Master/Slave Terminology Phase I - Update webui

2015-10-21 Thread Diana Arroyo (JIRA)
Diana Arroyo created MESOS-3779:
---

 Summary: Replace Master/Slave Terminology Phase I - Update webui
 Key: MESOS-3779
 URL: https://issues.apache.org/jira/browse/MESOS-3779
 Project: Mesos
  Issue Type: Task
Reporter: Diana Arroyo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2092) Make ACLs dynamic

2015-10-21 Thread Yong Qiao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Qiao Wang reassigned MESOS-2092:
-

Assignee: Yong Qiao Wang

> Make ACLs dynamic
> -
>
> Key: MESOS-2092
> URL: https://issues.apache.org/jira/browse/MESOS-2092
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Alexander Rukletsov
>Assignee: Yong Qiao Wang
>  Labels: mesosphere, newbie
>
> Master loads ACLs once during its launch and there is no way to update them 
> in a running master. Making them dynamic will allow for updating ACLs on the 
> fly, for example granting a new framework necessary rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3706) Tasks stuck in staging.

2015-10-21 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966818#comment-14966818
 ] 

haosdent commented on MESOS-3706:
-

As you stay, stdout and stderr are empty. Could you try use strace or gdb to 
attach the mesos-docker-executor hangout which point? For example, in the 
example you describe in description, try use
{noformat}
strace -p 35360
{noformat}

to find mesos-docker-executor blocks on which syscall and then post result here.

> Tasks stuck in staging.
> ---
>
> Key: MESOS-3706
> URL: https://issues.apache.org/jira/browse/MESOS-3706
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.23.0, 0.24.1
>Reporter: Jord Sonneveld
> Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot 
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, 
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one.  On 
> that one, it is stuck in STAGING for a long time and never starts.  The INFO 
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task 
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: 
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging.  It is launched by 
> marathon.  I have launched 161 instances successfully on my cluster.  But it 
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are / 
> should be identical.  I have re-run my ansible scripts and rebooted the 
> machines to no avail.
> It's been in this state for almost 30 minutes.  You can see the mesos docker 
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root 35360  0.0  0.0 1070576 21476 ?   Ssl  15:46   0:00 
> mesos-docker-executor 
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox 
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER IDIMAGE  
> COMMAND  CREATED STATUS  PORTS
> NAMES
> 5c858b90b0a0registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   39 minutes ago  Up 39 minutes   
> 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp   statsd-fe-influxdb
> d765ba3829fdregistry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   41 minutes ago  Up 41 minutes   
> 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp   statsd-repeater
> Those are the only two entries. Nothing about the kwe-vinland job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1832) Slave should accept PingSlaveMessage but not "PING" message.

2015-10-21 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966799#comment-14966799
 ] 

Yong Qiao Wang commented on MESOS-1832:
---

[~vinodkone], append the related RR for this ticket: 
https://reviews.apache.org/r/39516/

> Slave should accept PingSlaveMessage but not "PING" message.
> 
>
> Key: MESOS-1832
> URL: https://issues.apache.org/jira/browse/MESOS-1832
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Yong Qiao Wang
>  Labels: mesosphere
>
> Slave handles both "PING" message and PingSlaveMessage in until 0.22.0 for 
> backwards compatibility (https://reviews.apache.org/r/25867/).
> In 0.23.0, slave no longer needs handle "PING".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3224) Create a Mesos Contributor Newbie Guide

2015-10-21 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966771#comment-14966771
 ] 

haosdent commented on MESOS-3224:
-

Move to github pull request or review board?

> Create a Mesos Contributor Newbie Guide
> ---
>
> Key: MESOS-3224
> URL: https://issues.apache.org/jira/browse/MESOS-3224
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Timothy Chen
>Assignee: Diana Arroyo
>
> Currently the website doesn't have a helpful guide for community users to 
> know how to start learning to contribute to Mesos, understand the concepts 
> and lower the barrier to get involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3688) Get Container Name information when launching a container task

2015-10-21 Thread Raffaele Di Fazio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966746#comment-14966746
 ] 

Raffaele Di Fazio commented on MESOS-3688:
--

Do you have an update on this? I'm just checking if you need need more 
information from me. 

> Get Container Name information when launching a container task
> --
>
> Key: MESOS-3688
> URL: https://issues.apache.org/jira/browse/MESOS-3688
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.24.1
>Reporter: Raffaele Di Fazio
>  Labels: mesosphere
>
> We want to get the Docker Name (or Docker ID, or both) when launching a 
> container task with mesos. The container name is generated by mesos itself 
> (i.e. mesos-77e5fde6-83e7-4618-a2dd-d5b10f2b4d25, obtained with "docker ps") 
> and it would be nice to expose this information to frameworks so that this 
> information can be used, for example by Marathon to give this information to 
> users via a REST API. 
> To go a bit in depth with our use case, we have files created by fluentd 
> logdriver that are named with Docker Name or Docker ID (full or short) and we 
> need a mapping for the users of the REST API and thus the first step is to 
> make this information available from mesos. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3776) Support SELinux docker volume modes

2015-10-21 Thread James Findley (JIRA)
James Findley created MESOS-3776:


 Summary: Support SELinux docker volume modes
 Key: MESOS-3776
 URL: https://issues.apache.org/jira/browse/MESOS-3776
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: James Findley
Priority: Minor


Since docker 1.7, two additional volume modes are supported on top of 'ro' and 
'rw': 'z' and 'Z'. These set the SELinux mode of the volume to be accessible 
from every container or just this container, respectively.

See 
http://www.projectatomic.io/blog/2015/06/using-volumes-with-docker-can-cause-problems-with-selinux/
 for more info on this.

It would be great if mesos were to support these volume modes for better 
container security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3775) MasterAllocatorTest.SlaveLost is slow

2015-10-21 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3775:
--

 Summary: MasterAllocatorTest.SlaveLost is slow
 Key: MESOS-3775
 URL: https://issues.apache.org/jira/browse/MESOS-3775
 Project: Mesos
  Issue Type: Bug
  Components: technical debt, test
Reporter: Alexander Rukletsov
Priority: Minor


The {{MasterAllocatorTest.SlaveLost}} takes more that {{5s}} to complete. A 
brief look into the code hints that the stopped agent does not quit immediately 
(and hence its resources are not released by the allocator) because [it waits 
for the executor to 
terminate|https://github.com/apache/mesos/blob/master/src/tests/master_allocator_tests.cpp#L717].
 {{5s}} timeout comes from {{EXECUTOR_SHUTDOWN_GRACE_PERIOD}} agent constant.

Possible solutions:
* Do not wait until the stopped agent quits (can be flaky, needs deeper 
analysis).
* Decrease the agent's {{executor_shutdown_grace_period}} flag.
* Terminate the executor faster (this may require some refactoring since the 
executor driver is created in the {{TestContainerizer}} and we do not have 
direct access to it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2936) Create a design document for Quota support in Master

2015-10-21 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966524#comment-14966524
 ] 

Yong Qiao Wang commented on MESOS-2936:
---

Hey, [~alexr], we are planing to add a separated endpoint /roles to add/remove 
role in Role Dynamic Configuration project (MESOS-3177). I found in the request 
quota design, it also can be used to add a role. I think there are some repeats 
between these two projects. For my understanding , quota should be view as an 
attribute of role, so in quota management project,  the quota updating 
action(PUT) should be enough to manage(add/update/remove) the quota of an exist 
role, and if the role does not exist, then you need to create this role with 
endpoint /roles before configuring the quota for this role. So maybe we need to 
remove the quota request action from quota management endpoint. [~alexr], any 
thoughts for this?

In addition, welcome you to give me a review for the role dynamically configure 
design, your comments is important for me. Thanks in advance.

> Create a design document for Quota support in Master
> 
>
> Key: MESOS-2936
> URL: https://issues.apache.org/jira/browse/MESOS-2936
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Create a design document for the Quota feature support in Mesos Master 
> (excluding allocator) to be shared with the Mesos community.
> Design Doc:
> https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2742) Architecture doc on global resources

2015-10-21 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966471#comment-14966471
 ] 

Klaus Ma commented on MESOS-2742:
-

[~jieyu]/[~vi...@twitter.com], the draft document of Global Resources are 
uploaded; would you add some input of that?

> Architecture doc on global resources
> 
>
> Key: MESOS-2742
> URL: https://issues.apache.org/jira/browse/MESOS-2742
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Joerg Schad
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966463#comment-14966463
 ] 

Liqiang Lin commented on MESOS-3747:


RR: https://reviews.apache.org/r/39514/

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping 
> launch
> I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'
> E1015 13:15:34.264516 19641 slave.cpp:3342] Container 
> '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework 
> 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'

[jira] [Commented] (MESOS-3477) Add design doc for roles/weights configuration

2015-10-21 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966441#comment-14966441
 ] 

Yong Qiao Wang commented on MESOS-3477:
---

Hi [~adam-mesos], [~cmaloney], the design doc has be updated, could you give me 
a double review. Welcome your any comments. Thanks!

> Add design doc for roles/weights configuration
> --
>
> Key: MESOS-3477
> URL: https://issues.apache.org/jira/browse/MESOS-3477
> Project: Mesos
>  Issue Type: Documentation
>  Components: master
>Reporter: Yong Qiao Wang
>Assignee: Yong Qiao Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string

2015-10-21 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966408#comment-14966408
 ] 

Liqiang Lin commented on MESOS-3747:


Yes. If --[no-]switch_user is false, tasks will be run as the same user as the 
Mesos agent process. Both framework scheduler and Mesos master can not know 
which Mesos agent is set --[no-]switch_user to true, which is set to false. We 
should pass the framework user info anyway. Let Mesos agent decide whether 
switch to framework user or not. If framework user did not exist on that Mesos 
agent, just fail framework tasks as [~gyliu] posted.

> HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
> -
>
> Key: MESOS-3747
> URL: https://issues.apache.org/jira/browse/MESOS-3747
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0, 0.24.1, 0.25.0
>Reporter: Ben Whitehead
>Assignee: Liqiang Lin
>Priority: Blocker
>
> When using libmesos a framework can set its user to {{""}} (empty string) to 
> inherit the user the agent processes is running as, this behavior now results 
> in a {{TASK_FAILED}}.
> Full messages and relevant agent logs below.
> The error returned to the framework tells me nothing about the user not 
> existing on the agent host instead it tells me the container died due to OOM.
> {code:title=FrameworkInfo}
> call {
> type: SUBSCRIBE
> subscribe: {
> frameworkInfo: {
> user: "",
> name: "testing"
> }
> }
> }
> {code}
> {code:title=TaskInfo}
> call {
> framework_id { value: "20151015-125949-16777343-5050-20146-" },
> type: ACCEPT,
> accept { 
> offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }],
> operations { 
> type: LAUNCH, 
> launch { 
> task_infos [
> {
> name: "task-1",
> task_id: { value: "task-1" },
> agent_id: { value: 
> "20151015-125949-16777343-5050-20146-S0" },
> resources [
> { name: "cpus", type: SCALAR, scalar: { value: 
> 0.1 },  role: "*" },
> { name: "mem",  type: SCALAR, scalar: { value: 
> 64.0 }, role: "*" },
> { name: "disk", type: SCALAR, scalar: { value: 
> 0.0 },  role: "*" },
> ],
> command: { 
> environment { 
> variables [ 
> { name: "SLEEP_SECONDS" value: "15" } 
> ] 
> },
> value: "env | sort && sleep $SLEEP_SECONDS"
> }
> }
> ]
>  }
>  }
>  }
> }
> {code}
> {code:title=Update Status}
> event: {
> type: UPDATE,
> update: { 
> status: { 
> task_id: { value: "task-1" }, 
> state: TASK_FAILED,
> message: "Container destroyed while preparing isolators",
> agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, 
> timestamp: 1.444939217401241E9,
> executor_id: { value: "task-1" },
> source: SOURCE_AGENT, 
> reason: REASON_MEMORY_LIMIT,
> uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" 
> } 
> }
> }
> {code}
> {code:title=agent logs}
> I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b':
>  Failed to get user information for '': Success
> I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of 
> framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b'
> I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for 
> executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-
> I1015 13:15:34.262684 19638 docker.cpp:734] No container 

[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING

2015-10-21 Thread Matthias Veit (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966372#comment-14966372
 ] 

Matthias Veit commented on MESOS-3766:
--

[~nnielsen] Added the complete master and slave logs. I killed the mesos master 
and slave process - so I don't can query the endpoints any longer. 

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
> Attachments: master.log.zip, slave.log.zip
>
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:18.316463 3160

[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`

2015-10-21 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966366#comment-14966366
 ] 

Adam B commented on MESOS-3506:
---

That's a good point. We could instead just recommend `sudo yum update 
someDependency`, if we can figure out what that necessary dependency is.

> Build instructions for CentOS 6.6 should include `sudo yum update`
> --
>
> Key: MESOS-3506
> URL: https://issues.apache.org/jira/browse/MESOS-3506
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: documentation, mesosphere
>
> Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the 
> build to break when building {{mesos-0.25.0.jar}}. The build instructions for 
> this platform on the Getting Started page should be changed accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3766) Can not kill task in Status STAGING

2015-10-21 Thread Matthias Veit (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Veit updated MESOS-3766:
-
Attachment: master.log.zip

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
> Attachments: master.log.zip, slave.log.zip
>
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:23.337116 313872384

[jira] [Updated] (MESOS-3766) Can not kill task in Status STAGING

2015-10-21 Thread Matthias Veit (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Veit updated MESOS-3766:
-
Attachment: slave.log.zip

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
> Attachments: slave.log.zip
>
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:23.337116 313872384 slave.cpp:1789] 

[jira] [Comment Edited] (MESOS-2315) Deprecate / Remove CommandInfo::ContainerInfo

2015-10-21 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966347#comment-14966347
 ] 

Adam B edited comment on MESOS-2315 at 10/21/15 7:13 AM:
-

Sounds about right, but..
The first example "with old ContainerInfo" doesn't actually use the old 
ContainerInfo. I don't remember the exact semantics for how various 
containerizers used the options field ([~tillt] might), but you should probably 
have something like:
{code}
CommandInfo::ContainerInfo containerInfo;
containerInfo.set_image("busybox");
task.mutable_command()->mutable_container()->CopyFrom(containerInfo);
{code}
Also, if you're going to use ContainerInfo::MesosInfo, you should probably set 
the image, or at least add a volume. Otherwise, you're just getting the default 
Mesos containerization behavior, which you get even without a ContainerInfo.


was (Author: adam-mesos):
Sounds about right, but..
The first example "with old ContainerInfo" doesn't actually use the old 
ContainerInfo. I don't remember the exact semantics for how various 
containerizers used the options field ([~tillt] might), but you should probably 
have something like:
```
CommandInfo::ContainerInfo containerInfo;
containerInfo.set_image("busybox");
task.mutable_command()->mutable_container()->CopyFrom(containerInfo);
```
Also, if you're going to use ContainerInfo::MesosInfo, you should probably set 
the image, or at least add a volume. Otherwise, you're just getting the default 
Mesos containerization behavior, which you get even without a ContainerInfo.

> Deprecate / Remove CommandInfo::ContainerInfo
> -
>
> Key: MESOS-2315
> URL: https://issues.apache.org/jira/browse/MESOS-2315
> Project: Mesos
>  Issue Type: Task
>Reporter: Ian Downes
>Assignee: Vaibhav Khanduja
>Priority: Minor
>  Labels: mesosphere, newbie
>
> IIUC this has been deprecated and all current code (except 
> examples/docker_no_executor_framework.cpp) uses the top-level ContainerInfo?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2315) Deprecate / Remove CommandInfo::ContainerInfo

2015-10-21 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966347#comment-14966347
 ] 

Adam B commented on MESOS-2315:
---

Sounds about right, but..
The first example "with old ContainerInfo" doesn't actually use the old 
ContainerInfo. I don't remember the exact semantics for how various 
containerizers used the options field ([~tillt] might), but you should probably 
have something like:
```
CommandInfo::ContainerInfo containerInfo;
containerInfo.set_image("busybox");
task.mutable_command()->mutable_container()->CopyFrom(containerInfo);
```
Also, if you're going to use ContainerInfo::MesosInfo, you should probably set 
the image, or at least add a volume. Otherwise, you're just getting the default 
Mesos containerization behavior, which you get even without a ContainerInfo.

> Deprecate / Remove CommandInfo::ContainerInfo
> -
>
> Key: MESOS-2315
> URL: https://issues.apache.org/jira/browse/MESOS-2315
> Project: Mesos
>  Issue Type: Task
>Reporter: Ian Downes
>Assignee: Vaibhav Khanduja
>Priority: Minor
>  Labels: mesosphere, newbie
>
> IIUC this has been deprecated and all current code (except 
> examples/docker_no_executor_framework.cpp) uses the top-level ContainerInfo?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)