[jira] [Commented] (MESOS-2275) Document header include rules in style guide
[ https://issues.apache.org/jira/browse/MESOS-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968658#comment-14968658 ] Benjamin Bannier commented on MESOS-2275: - I think we probably would also want an example that makes it clearer if in each component we use pure lex sort, or instead do enforce some residual level of logical ordering, e.g. {{clang-format}} (from trunk) prefers lexicographical sort {code} #include #include {code} while one could also imagine the opposite ordering which emphasizes {{foo.hpp}} as some sort of "heading header" (currently not supported by {{clang-format}}). The Google style guide asks for "alphabetical ordering" which isn't helpful here. > Document header include rules in style guide > > > Key: MESOS-2275 > URL: https://issues.apache.org/jira/browse/MESOS-2275 > Project: Mesos > Issue Type: Improvement >Reporter: Niklas Quarfot Nielsen >Assignee: Jan Schlicht >Priority: Trivial > Labels: beginner, docathon, mesosphere > > We have several ways of sorting, grouping and ordering headers includes in > Mesos. We should agree on a rule set and do a style scan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide
[ https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968642#comment-14968642 ] Benjamin Bannier commented on MESOS-3786: - [~bmahler] Recent Doxygen (>=1.8.0) supports markdown and backticks cause formatting as "code". I suspect this is what was intended in the instances you refer to. > Backticks are not mentioned in Mesos C++ Style Guide > > > Key: MESOS-3786 > URL: https://issues.apache.org/jira/browse/MESOS-3786 > Project: Mesos > Issue Type: Documentation >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Minor > Labels: documentation, mesosphere > > As far as I can tell, current practice is to quote code excerpts and object > names with backticks when writing comments. For example: > {code} > // You know, `sadPanda` seems extra sad lately. > std::string sadPanda; > sadPanda = " :'( "; > {code} > However, I don't see this documented in our C++ style guide at all. It should > be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968622#comment-14968622 ] Steven Schlansker commented on MESOS-2186: -- That's a bummer. Thank you everyone for looking and your time. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slave
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968545#comment-14968545 ] Neil Conway commented on MESOS-2186: [~rgs]: I agree, not sure there's a better fix. You could imagine a client API which hands more control to the user (e.g., zk_create() doesn't take any hostnames, then zk_add_server() takes a single server and returns success/failure), but that probably ends up being similar to just having user code do hostname resolution, and then pass in IPs. [~stevenschlansker]: I opened [MESOS-3790] to have Mesos retry Zk connection errors that return ENOENT. Otherwise, as far as I know there's nothing else we can do here on the Mesos side. If you disagree, please reopen. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendT
[jira] [Created] (MESOS-3790) Zk connection should retry on EAI_NONAME
Neil Conway created MESOS-3790: -- Summary: Zk connection should retry on EAI_NONAME Key: MESOS-3790 URL: https://issues.apache.org/jira/browse/MESOS-3790 Project: Mesos Issue Type: Bug Reporter: Neil Conway Assignee: Neil Conway Priority: Minor The zookeeper interface is designed to retry (once per second for up to ten minutes) if one or more of the Zookeeper hostnames can't be resolved (see [MESOS-1326] and [MESOS-1523]). However, the current implementation assumes that a DNS resolution failure is indicated by zookeeper_init() returning NULL and errno being set to EINVAL (Zk translates getaddrinfo() failures into errno values). However, the current Zk code does: {code} static int getaddrinfo_errno(int rc) { switch(rc) { case EAI_NONAME: // ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD. #if defined EAI_NODATA && EAI_NODATA != EAI_NONAME case EAI_NODATA: #endif return ENOENT; case EAI_MEMORY: return ENOMEM; default: return EINVAL; } } {code} getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per discussion in [MESOS-2186], this seems to happen intermittently due to DNS failures. Proposed fix: looking at errno is always going to be somewhat fragile, but if we're going to continue doing that, we should check for ENOENT as well as EINVAL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968410#comment-14968410 ] haosdent commented on MESOS-3787: - For CommandInfo, I think Mesos already set them when start a docker container in mesos-docker-executor https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L421-L422. Do this not work for you? > As a developer, I'd like to be able to expand environment variables through > the Docker executor. > > > Key: MESOS-3787 > URL: https://issues.apache.org/jira/browse/MESOS-3787 > Project: Mesos > Issue Type: Wish >Reporter: John Garcia > Labels: mesosphere > > We'd like to have expanded variables usable in [the json files used to create > a Marathon app, hence] the Task's CommandInfo, so that the executor is able > to detect the correct values at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3233) Allow developers to decide whether a HTTP endpoint should use authentication
[ https://issues.apache.org/jira/browse/MESOS-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968314#comment-14968314 ] Qian Zhang commented on MESOS-3233: --- Why is this decided by developer? Should be it up to the operator to control which HTTP endpoint need to use authentication? > Allow developers to decide whether a HTTP endpoint should use authentication > > > Key: MESOS-3233 > URL: https://issues.apache.org/jira/browse/MESOS-3233 > Project: Mesos > Issue Type: Improvement > Components: security >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: mesosphere, security > > Once HTTP Authentication is enabled, developers should be allowed to decide > which endpoints should require authentication. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968275#comment-14968275 ] Joseph Wu commented on MESOS-3771: -- Looks like our JSON library will never catch this (it's more permissive), which is why none of our unit tests have caught this. I agree that this is a problem though. I'll see if I can get more eyes on this. > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3788) Clarify NetworkInfo semantics for IP addresses and group policies.
[ https://issues.apache.org/jira/browse/MESOS-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968260#comment-14968260 ] Connor Doyle commented on MESOS-3788: - Submitted a patch to modify NetworkInfo (work-in-progress) on reviewboard: [r/39531|https://reviews.apache.org/r/39531]. > Clarify NetworkInfo semantics for IP addresses and group policies. > -- > > Key: MESOS-3788 > URL: https://issues.apache.org/jira/browse/MESOS-3788 > Project: Mesos > Issue Type: Improvement > Components: containerization, isolation >Affects Versions: 0.25.0 >Reporter: Connor Doyle > Labels: mesosphere > > In Mesos 0.25.0, a new message called NetworkInfo was introduced. This > message allows framework authors to communicate with network isolation > modules via a first-class message type to request IP addresses and network > group isolation policies. > Unfortunately, the structure is somewhat confusing to both framework authors > and module implementors. > 1) It's unclear how IP addresses map to virtual interfaces inside the > container. > 2) It's difficult for application developers to understand the final policy > when multiple IP addresses can be assigned with differing isolation policies. > CC [~karya] [~benjaminhindman] [~spikecurtis] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3762) Refactor SSLTest fixture such that MesosTest can use the same helpers.
[ https://issues.apache.org/jira/browse/MESOS-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965883#comment-14965883 ] Joseph Wu edited comment on MESOS-3762 at 10/22/15 12:06 AM: - Reviews for: Step 1) https://reviews.apache.org/r/39498/ https://reviews.apache.org/r/39499/ Step 2 & 3) https://reviews.apache.org/r/39501/ Step 4) https://reviews.apache.org/r/39533/ https://reviews.apache.org/r/39534/ was (Author: kaysoky): Reviews for: Step 1) https://reviews.apache.org/r/39498/ https://reviews.apache.org/r/39499/ Step 2 & 3) https://reviews.apache.org/r/39501/ > Refactor SSLTest fixture such that MesosTest can use the same helpers. > -- > > Key: MESOS-3762 > URL: https://issues.apache.org/jira/browse/MESOS-3762 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: mesosphere > > In order to write tests that exercise SSL with other components of Mesos, > such as the HTTP scheduler library, we need to use the setup/teardown logic > found in the {{SSLTest}} fixture. > Currently, the test fixtures have separate inheritance structures like this: > {code} > SSLTest <- ::testing::Test > MesosTest <- TemporaryDirectoryTest <- ::testing::Test > {code} > where {{::testing::Test}} is a gtest class. > The plan is the following: > # Change {{SSLTest}} to inherit from {{TemporaryDirectoryTest}}. This will > require moving the setup (generation of keys and certs) from > {{SetUpTestCase}} to {{SetUp}}. At the same time, *some* of the cleanup > logic in the SSLTest will not be needed. > # Move the logic of generating keys/certs into helpers, so that individual > tests can call them when needed, much like {{MesosTest}}. > # Write a child class of {{SSLTest}} which has the same functionality as the > existing {{SSLTest}}, for use by the existing tests that rely on {{SSLTest}} > or the {{RegistryClientTest}}. > # Have {{MesosTest}} inherit from {{SSLTest}} (which might be renamed during > the refactor). If Mesos is not compiled with {{--enable-ssl}}, then > {{SSLTest}} could be {{#ifdef}}'d into any empty class. > The resulting structure should be like: > {code} > MesosTest <- SSLTest <- TemporaryDirectoryTest <- ::testing::Test > ChildOfSSLTest / > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
[ https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968198#comment-14968198 ] Vinod Kone commented on MESOS-3747: --- Let me provide some history. "FrameworkInfo.user" should have been an optional field to begin with. As Marco mentioned, it only comes into play if "--switch-user" flag is set on the agent. More recently, we also added "CommandInfo.user" which takes precedence over "FrameworkInfo.user". I would recommend the following 1) Make FrameworkInfo.user optional in v1/mesos.proto. 2) Fix the agent to return the correct error message (instead of OOM!) if flag.switch_user is true and it cannot determine the user to run the task/executor under. > HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string > - > > Key: MESOS-3747 > URL: https://issues.apache.org/jira/browse/MESOS-3747 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.0, 0.24.1, 0.25.0 >Reporter: Ben Whitehead >Assignee: Liqiang Lin >Priority: Blocker > > When using libmesos a framework can set its user to {{""}} (empty string) to > inherit the user the agent processes is running as, this behavior now results > in a {{TASK_FAILED}}. > Full messages and relevant agent logs below. > The error returned to the framework tells me nothing about the user not > existing on the agent host instead it tells me the container died due to OOM. > {code:title=FrameworkInfo} > call { > type: SUBSCRIBE > subscribe: { > frameworkInfo: { > user: "", > name: "testing" > } > } > } > {code} > {code:title=TaskInfo} > call { > framework_id { value: "20151015-125949-16777343-5050-20146-" }, > type: ACCEPT, > accept { > offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }], > operations { > type: LAUNCH, > launch { > task_infos [ > { > name: "task-1", > task_id: { value: "task-1" }, > agent_id: { value: > "20151015-125949-16777343-5050-20146-S0" }, > resources [ > { name: "cpus", type: SCALAR, scalar: { value: > 0.1 }, role: "*" }, > { name: "mem", type: SCALAR, scalar: { value: > 64.0 }, role: "*" }, > { name: "disk", type: SCALAR, scalar: { value: > 0.0 }, role: "*" }, > ], > command: { > environment { > variables [ > { name: "SLEEP_SECONDS" value: "15" } > ] > }, > value: "env | sort && sleep $SLEEP_SECONDS" > } > } > ] > } > } > } > } > {code} > {code:title=Update Status} > event: { > type: UPDATE, > update: { > status: { > task_id: { value: "task-1" }, > state: TASK_FAILED, > message: "Container destroyed while preparing isolators", > agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, > timestamp: 1.444939217401241E9, > executor_id: { value: "task-1" }, > source: SOURCE_AGENT, > reason: REASON_MEMORY_LIMIT, > uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" > } > } > } > {code} > {code:title=agent logs} > I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b': > Failed to get user information for '': Success > I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources > cpus(*):0.1; mem(*):32 in work directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b' > I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for > executor task-1 of framework 'e
[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky
[ https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968193#comment-14968193 ] Anand Mazumdar commented on MESOS-3789: --- [~gyliu] The tests are parameterized on the {{ContentType}} that can be either {{application/x-protobuf}} or {{application/json}}. For reproducing this, you might want to set {{--gtest_repeat=-1}} and {{--gtest_break_on_failure}} when running the test to run them in a loop. > ContentType/SchedulerTest.Suppress/1 is flalky > -- > > Key: MESOS-3789 > URL: https://issues.apache.org/jira/browse/MESOS-3789 > Project: Mesos > Issue Type: Bug > Environment: > https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull >Reporter: Vinod Kone >Assignee: Guangya Liu > > Observed in ASF CI > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/1 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w' > I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms > I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms > I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns > I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in > 3107ns > I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the > db in 504ns > I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery > I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status > I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received > a broadcasted recover request from (10113)@172.17.3.153:57838 > I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to > STARTING > I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 562580ns > I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to > STARTING > I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status > I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status > received a broadcasted recover request from (10114)@172.17.3.153:57838 > I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING > I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 275156ns > I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to > VOTING > I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos > group > I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated > I1021 19:17:43.154260 30940 master.cpp:376] Master > 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on > 172.17.3.153:57838 > I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" > --zk_session_timeout="10secs" > I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing > authenticated slaves to register > I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials' > I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' > authenticator > I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled > I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given > I1021 19:17:43.155642 30939 hierarchical.cp
[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky
[ https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968188#comment-14968188 ] Guangya Liu commented on MESOS-3789: Hi [~vi...@twitter.com] and [~anandmazumdar] This is very similar with MESOS-3733 , the difference is MESOS-3733 is failed at ContentType/SchedulerTest.Suppress/1 but not ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what is the difference of ContentType/SchedulerTest.Suppress/0 and ContentType/SchedulerTest.Suppress/1 ? I also tried to reproduce in my local env but failed to reproduce, will check more. {code} I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from to ../../src/tests/scheduler_tests.cpp:1028: Failure Value of: event.isPending() Actual: false Expected: true I1021 19:17:43.276475 30920 master.cpp:925] Master terminating I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 242dc5ed-402d-4873-be6d-9bad1f3296f9-S0 I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 242dc5ed-402d-4873-be6d-9bad1f3296f9- I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a new master to be elected I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating [ FAILED ] ContentType/SchedulerTest.Suppress/1, where GetParam() = application/json (172 ms) {code} > ContentType/SchedulerTest.Suppress/1 is flalky > -- > > Key: MESOS-3789 > URL: https://issues.apache.org/jira/browse/MESOS-3789 > Project: Mesos > Issue Type: Bug > Environment: > https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull >Reporter: Vinod Kone >Assignee: Guangya Liu > > Observed in ASF CI > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/1 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w' > I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms > I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms > I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns > I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in > 3107ns > I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the > db in 504ns > I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery > I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status > I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received > a broadcasted recover request from (10113)@172.17.3.153:57838 > I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to > STARTING > I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 562580ns > I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to > STARTING > I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status > I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status > received a broadcasted recover request from (10114)@172.17.3.153:57838 > I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING > I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 275156ns > I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to > VOTING > I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos > group > I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated > I1021 19:17:43.154260 30940 master.cpp:376] Master > 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on > 172.17.3.153:57838 > I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_time
[jira] [Issue Comment Deleted] (MESOS-3733) ContentType/SchedulerTest.Suppress/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-3733: --- Comment: was deleted (was: [~vi...@twitter.com] This is very similar with MESOS-3789 , the difference is MESOS-3789 is failed at ContentType/SchedulerTest.Suppress/1 but not ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what is the difference of ContentType/SchedulerTest.Suppress/0 and ContentType/SchedulerTest.Suppress/1 ? I also tried to reproduce in my local env but failed to reproduce, will check more. {code} I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from to ../../src/tests/scheduler_tests.cpp:1028: Failure Value of: event.isPending() Actual: false Expected: true I1021 19:17:43.276475 30920 master.cpp:925] Master terminating I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 242dc5ed-402d-4873-be6d-9bad1f3296f9-S0 I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 242dc5ed-402d-4873-be6d-9bad1f3296f9- I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a new master to be elected I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating [ FAILED ] ContentType/SchedulerTest.Suppress/1, where GetParam() = application/json (172 ms) {code}) > ContentType/SchedulerTest.Suppress/0 is flaky > - > > Key: MESOS-3733 > URL: https://issues.apache.org/jira/browse/MESOS-3733 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Anand Mazumdar >Assignee: Guangya Liu > Labels: flaky-test > > Showed up on ASF CI: > https://builds.apache.org/job/Mesos/931/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/console > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/0 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi' > I1014 17:34:11.225731 27650 leveldb.cpp:176] Opened db in 2.974504ms > I1014 17:34:11.226856 27650 leveldb.cpp:183] Compacted db in 980779ns > I1014 17:34:11.227028 27650 leveldb.cpp:198] Created db iterator in 37641ns > I1014 17:34:11.227159 27650 leveldb.cpp:204] Seeked to beginning of db in > 14959ns > I1014 17:34:11.227283 27650 leveldb.cpp:273] Iterated through 0 keys in the > db in 14672ns > I1014 17:34:11.227449 27650 replica.cpp:746] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1014 17:34:11.228469 27680 recover.cpp:449] Starting replica recovery > I1014 17:34:11.229202 27673 recover.cpp:475] Replica is in EMPTY status > I1014 17:34:11.231384 27673 replica.cpp:642] Replica in EMPTY status received > a broadcasted recover request from (10262)@172.17.2.194:37545 > I1014 17:34:11.231745 27673 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1014 17:34:11.234242 27680 master.cpp:376] Master > 0cc41e7f-8d87-4c2f-9543-3f7198f9fdaf (23af00e0dbe0) started on > 172.17.2.194:37545 > I1014 17:34:11.234283 27680 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/master" > --zk_session_timeout="10secs" > I1014 17:34:11.234679 27680 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1014 17:34:11.234694 27680 master.cpp:428] Master only allowing > authenticated slaves to register > I1014 17:34:11.234705 27680 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials' > I1014 17:34:11.235251 27673 recover.cpp:566] Updating replica status to > STARTING > I1014 17:34:11.235857 27680 master.cpp:467] Using default 'crammd5' > authenticator > I1014 17:34:11.236006 27680 master.cpp:504] Authorization enabled > I1014 17:34:11.236187 27673 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 729504ns > I1014 17:34:11.236224 27673 r
[jira] [Commented] (MESOS-3733) ContentType/SchedulerTest.Suppress/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968185#comment-14968185 ] Guangya Liu commented on MESOS-3733: [~vi...@twitter.com] This is very similar with MESOS-3789 , the difference is MESOS-3789 is failed at ContentType/SchedulerTest.Suppress/1 but not ContentType/SchedulerTest.Suppress/0 . Can you please show more detail what is the difference of ContentType/SchedulerTest.Suppress/0 and ContentType/SchedulerTest.Suppress/1 ? I also tried to reproduce in my local env but failed to reproduce, will check more. {code} I1021 19:17:43.270341 30954 slave.cpp:2284] Updated checkpointed resources from to ../../src/tests/scheduler_tests.cpp:1028: Failure Value of: event.isPending() Actual: false Expected: true I1021 19:17:43.276475 30920 master.cpp:925] Master terminating I1021 19:17:43.276880 30949 hierarchical.cpp:364] Removed slave 242dc5ed-402d-4873-be6d-9bad1f3296f9-S0 I1021 19:17:43.277751 30945 hierarchical.cpp:220] Removed framework 242dc5ed-402d-4873-be6d-9bad1f3296f9- I1021 19:17:43.277863 30941 slave.cpp:3258] master@172.17.3.153:57838 exited W1021 19:17:43.277899 30941 slave.cpp:3261] Master disconnected! Waiting for a new master to be elected I1021 19:17:43.303658 30920 slave.cpp:606] Slave terminating [ FAILED ] ContentType/SchedulerTest.Suppress/1, where GetParam() = application/json (172 ms) {code} > ContentType/SchedulerTest.Suppress/0 is flaky > - > > Key: MESOS-3733 > URL: https://issues.apache.org/jira/browse/MESOS-3733 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Anand Mazumdar >Assignee: Guangya Liu > Labels: flaky-test > > Showed up on ASF CI: > https://builds.apache.org/job/Mesos/931/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/console > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/0 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi' > I1014 17:34:11.225731 27650 leveldb.cpp:176] Opened db in 2.974504ms > I1014 17:34:11.226856 27650 leveldb.cpp:183] Compacted db in 980779ns > I1014 17:34:11.227028 27650 leveldb.cpp:198] Created db iterator in 37641ns > I1014 17:34:11.227159 27650 leveldb.cpp:204] Seeked to beginning of db in > 14959ns > I1014 17:34:11.227283 27650 leveldb.cpp:273] Iterated through 0 keys in the > db in 14672ns > I1014 17:34:11.227449 27650 replica.cpp:746] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1014 17:34:11.228469 27680 recover.cpp:449] Starting replica recovery > I1014 17:34:11.229202 27673 recover.cpp:475] Replica is in EMPTY status > I1014 17:34:11.231384 27673 replica.cpp:642] Replica in EMPTY status received > a broadcasted recover request from (10262)@172.17.2.194:37545 > I1014 17:34:11.231745 27673 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1014 17:34:11.234242 27680 master.cpp:376] Master > 0cc41e7f-8d87-4c2f-9543-3f7198f9fdaf (23af00e0dbe0) started on > 172.17.2.194:37545 > I1014 17:34:11.234283 27680 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/master" > --zk_session_timeout="10secs" > I1014 17:34:11.234679 27680 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1014 17:34:11.234694 27680 master.cpp:428] Master only allowing > authenticated slaves to register > I1014 17:34:11.234705 27680 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/ContentType_SchedulerTest_Suppress_0_qcnnQi/credentials' > I1014 17:34:11.235251 27673 recover.cpp:566] Updating replica status to > STARTING > I1014 17:34:11.235857 27680 master.cpp:467] Using default 'crammd5' > authenticator > I1014 17:34:11.236006 27680 master.cpp:504] Authorization enabled > I1014 17:34:11.236187 27673 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 729504ns > I101
[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky
[ https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968182#comment-14968182 ] Anand Mazumdar commented on MESOS-3789: --- [~vinodkone] I am marking this as dup in favor of https://issues.apache.org/jira/browse/MESOS-3733 [~gyliu] was already looking at this. > ContentType/SchedulerTest.Suppress/1 is flalky > -- > > Key: MESOS-3789 > URL: https://issues.apache.org/jira/browse/MESOS-3789 > Project: Mesos > Issue Type: Bug > Environment: > https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull >Reporter: Vinod Kone >Assignee: Guangya Liu > > Observed in ASF CI > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/1 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w' > I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms > I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms > I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns > I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in > 3107ns > I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the > db in 504ns > I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery > I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status > I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received > a broadcasted recover request from (10113)@172.17.3.153:57838 > I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to > STARTING > I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 562580ns > I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to > STARTING > I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status > I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status > received a broadcasted recover request from (10114)@172.17.3.153:57838 > I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING > I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 275156ns > I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to > VOTING > I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos > group > I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated > I1021 19:17:43.154260 30940 master.cpp:376] Master > 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on > 172.17.3.153:57838 > I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" > --zk_session_timeout="10secs" > I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing > authenticated slaves to register > I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials' > I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' > authenticator > I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled > I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given > I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical > allocator process > I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is > master@1
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968180#comment-14968180 ] Raul Gutierrez Segales edited comment on MESOS-2186 at 10/21/15 11:43 PM: -- What would a sane behavior be? Say you give zookeeper_init() a list of 5 hostnames: for how long should it retry if any of those lookups fail? Can it continue if most of them work? What's a good number? If we can define this in a consistent way that works for everyone, I am happy to implement that behavior. But it's tricky to get right, hence it's usually better to just get off of DNS entirely if it's flaky (and pass in IP addresses). was (Author: rgs): What would a sane behavior be? Say you give zookeeper_init() a list of 5 hostnames: for how long should it retry if any of those lookups fail? Can it continue of most of them work? What's a good number? If we can define this in a consistent way that works for everyone, I am happy to implement that behavior. But it's tricky to get right, hence it's usually better to just get off of DNS entirely if it's flaky (and pass in IP addresses). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > De
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968180#comment-14968180 ] Raul Gutierrez Segales commented on MESOS-2186: --- What would a sane behavior be? Say you give zookeeper_init() a list of 5 hostnames: for how long should it retry if any of those lookups fail? Can it continue of most of them work? What's a good number? If we can define this in a consistent way that works for everyone, I am happy to implement that behavior. But it's tricky to get right, hence it's usually better to just get off of DNS entirely if it's flaky (and pass in IP addresses). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:31
[jira] [Commented] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky
[ https://issues.apache.org/jira/browse/MESOS-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968174#comment-14968174 ] Vinod Kone commented on MESOS-3789: --- [~gyliu] can you take a look at this? > ContentType/SchedulerTest.Suppress/1 is flalky > -- > > Key: MESOS-3789 > URL: https://issues.apache.org/jira/browse/MESOS-3789 > Project: Mesos > Issue Type: Bug > Environment: > https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull >Reporter: Vinod Kone >Assignee: Guangya Liu > > Observed in ASF CI > {code} > [ RUN ] ContentType/SchedulerTest.Suppress/1 > Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w' > I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms > I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms > I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns > I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in > 3107ns > I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the > db in 504ns > I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery > I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status > I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received > a broadcasted recover request from (10113)@172.17.3.153:57838 > I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to > STARTING > I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 562580ns > I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to > STARTING > I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status > I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status > received a broadcasted recover request from (10114)@172.17.3.153:57838 > I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING > I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 275156ns > I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to > VOTING > I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos > group > I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated > I1021 19:17:43.154260 30940 master.cpp:376] Master > 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on > 172.17.3.153:57838 > I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" > --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" > --zk_session_timeout="10secs" > I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing > authenticated slaves to register > I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials' > I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' > authenticator > I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled > I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given > I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical > allocator process > I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is > master@172.17.3.153:57838 with id 242dc5ed-402d-4873-be6d-9bad1f3296f9 > I1021 19:17:43.157438 30952 master.cpp:1622]
[jira] [Created] (MESOS-3789) ContentType/SchedulerTest.Suppress/1 is flalky
Vinod Kone created MESOS-3789: - Summary: ContentType/SchedulerTest.Suppress/1 is flalky Key: MESOS-3789 URL: https://issues.apache.org/jira/browse/MESOS-3789 Project: Mesos Issue Type: Bug Environment: https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/965/consoleFull Reporter: Vinod Kone Assignee: Guangya Liu Observed in ASF CI {code} [ RUN ] ContentType/SchedulerTest.Suppress/1 Using temporary directory '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w' I1021 19:17:43.142560 30920 leveldb.cpp:176] Opened db in 2.484217ms I1021 19:17:43.143709 30920 leveldb.cpp:183] Compacted db in 1.002737ms I1021 19:17:43.143831 30920 leveldb.cpp:198] Created db iterator in 25419ns I1021 19:17:43.143971 30920 leveldb.cpp:204] Seeked to beginning of db in 3107ns I1021 19:17:43.144098 30920 leveldb.cpp:273] Iterated through 0 keys in the db in 504ns I1021 19:17:43.144224 30920 replica.cpp:748] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I1021 19:17:43.145409 30948 recover.cpp:449] Starting replica recovery I1021 19:17:43.146062 30943 recover.cpp:475] Replica is in EMPTY status I1021 19:17:43.148715 30942 replica.cpp:644] Replica in EMPTY status received a broadcasted recover request from (10113)@172.17.3.153:57838 I1021 19:17:43.149269 30943 recover.cpp:195] Received a recover response from a replica in EMPTY status I1021 19:17:43.149783 30942 recover.cpp:566] Updating replica status to STARTING I1021 19:17:43.150475 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 562580ns I1021 19:17:43.150565 30945 replica.cpp:323] Persisted replica status to STARTING I1021 19:17:43.150841 30945 recover.cpp:475] Replica is in STARTING status I1021 19:17:43.152133 30945 replica.cpp:644] Replica in STARTING status received a broadcasted recover request from (10114)@172.17.3.153:57838 I1021 19:17:43.152479 30945 recover.cpp:195] Received a recover response from a replica in STARTING status I1021 19:17:43.153056 30945 recover.cpp:566] Updating replica status to VOTING I1021 19:17:43.153539 30945 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 275156ns I1021 19:17:43.153623 30945 replica.cpp:323] Persisted replica status to VOTING I1021 19:17:43.153820 30943 recover.cpp:580] Successfully joined the Paxos group I1021 19:17:43.153996 30943 recover.cpp:464] Recover process terminated I1021 19:17:43.154260 30940 master.cpp:376] Master 242dc5ed-402d-4873-be6d-9bad1f3296f9 (79d8015cd9f0) started on 172.17.3.153:57838 I1021 19:17:43.154288 30940 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" --work_dir="/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/master" --zk_session_timeout="10secs" I1021 19:17:43.154597 30940 master.cpp:425] Master allowing unauthenticated frameworks to register I1021 19:17:43.154605 30940 master.cpp:428] Master only allowing authenticated slaves to register I1021 19:17:43.154611 30940 credentials.hpp:37] Loading credentials for authentication from '/tmp/ContentType_SchedulerTest_Suppress_1_QtJ44w/credentials' I1021 19:17:43.154871 30940 master.cpp:467] Using default 'crammd5' authenticator I1021 19:17:43.155226 30940 master.cpp:504] Authorization enabled I1021 19:17:43.155524 30947 whitelist_watcher.cpp:79] No whitelist given I1021 19:17:43.155642 30939 hierarchical.cpp:140] Initialized hierarchical allocator process I1021 19:17:43.157397 30952 master.cpp:1609] The newly elected leader is master@172.17.3.153:57838 with id 242dc5ed-402d-4873-be6d-9bad1f3296f9 I1021 19:17:43.157438 30952 master.cpp:1622] Elected as the leading master! I1021 19:17:43.157455 30952 master.cpp:1382] Recovering from registrar I1021 19:17:43.157595 30943 registrar.cpp:309] Recovering registrar I1021 19:17:43.158347 30950 log.cpp:661] Attempting to start the writer I1021 19:17:43.159632 30949 replica.cpp:478] Replica received implicit promise request from (10115)@172.17.3.153:57838 with proposal 1 I1021 19:17:43.160238 30949 lev
[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide
[ https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968162#comment-14968162 ] Greg Mann commented on MESOS-3786: -- Ah OK; I was under the impression that backticks were the way to go. In either case, the correct convention should be documented in our style guide. Perhaps a thread on the dev list would help folks come to consensus on a policy? > Backticks are not mentioned in Mesos C++ Style Guide > > > Key: MESOS-3786 > URL: https://issues.apache.org/jira/browse/MESOS-3786 > Project: Mesos > Issue Type: Documentation >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Minor > Labels: documentation, mesosphere > > As far as I can tell, current practice is to quote code excerpts and object > names with backticks when writing comments. For example: > {code} > // You know, `sadPanda` seems extra sad lately. > std::string sadPanda; > sadPanda = " :'( "; > {code} > However, I don't see this documented in our C++ style guide at all. It should > be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide
[ https://issues.apache.org/jira/browse/MESOS-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968149#comment-14968149 ] Benjamin Mahler commented on MESOS-3786: Ah.. should be single quotes for object names. Looking through a grep, it looks like a number of recent patches introduced the backticks: maintenance, systemd, fetcher cache, master json, and a couple of others. Would be great to clean this up and prevent more backticks, but not a big deal. Unless I'm missing something... e.g. in doxygen do backticks affect rendering? > Backticks are not mentioned in Mesos C++ Style Guide > > > Key: MESOS-3786 > URL: https://issues.apache.org/jira/browse/MESOS-3786 > Project: Mesos > Issue Type: Documentation >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Minor > Labels: documentation, mesosphere > > As far as I can tell, current practice is to quote code excerpts and object > names with backticks when writing comments. For example: > {code} > // You know, `sadPanda` seems extra sad lately. > std::string sadPanda; > sadPanda = " :'( "; > {code} > However, I don't see this documented in our C++ style guide at all. It should > be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968143#comment-14968143 ] Steven Schlansker commented on MESOS-2186: -- Maybe this will end up being too hard to fix, since it seems to be a limitation of the ZK C API. It's just surprising from an end user perspective that a single name failing to resolve (even when two are still happy) causes such a disruptive failure. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] M
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968131#comment-14968131 ] Neil Conway commented on MESOS-2186: If the DNS resolution failure lasts for a long time, zookeeper_init() will continue to return NULL and hence Mesos will still be unable to make progress. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968128#comment-14968128 ] Steven Schlansker commented on MESOS-2186: -- This is true in the case the DNS resolution failure is temporary. If it is not temporary, you are still SOL. Imagine $JUNIOR_ADMIN removes one of the ZooKeeper nodes from DNS. You may then have an inoperable Mesos cluster for a long time if you have aggressive DNS caching, even though a ZK quorum is still up and alive. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaste
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968099#comment-14968099 ] Neil Conway commented on MESOS-2186: Ah, okay. So the situation seems to be: (1) zookeeper_init() returns NULL when getaddrinfo() fails, as intended. (2) Mesos is _designed_ to loop and retry zookeeper_init(), but it doesn't do this: we use a gross hack to determine whether the zookeeper_init() failure was due to a hostname resolution failure, and apparently it doesn't account for this case (we're expecting errno == EINVAL, apparently we see ENOENT instead). (3) Hence, we abort the process. We can revise the condition we're checking in #2 slightly, but that is only intended as a convenience anyway: as discussed above, you should be running Mesos under process supervision and restarting it when it fails. (The question is just whether we do the retry loop in Mesos itself or in the process supervisor.) If Mesos exiting unexpectedly "compromises the 'high availability' of Mesos", your Mesos installation is not configured correctly. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack >
[jira] [Created] (MESOS-3788) Clarify NetworkInfo semantics for IP addresses and group policies.
Connor Doyle created MESOS-3788: --- Summary: Clarify NetworkInfo semantics for IP addresses and group policies. Key: MESOS-3788 URL: https://issues.apache.org/jira/browse/MESOS-3788 Project: Mesos Issue Type: Improvement Components: containerization, isolation Affects Versions: 0.25.0 Reporter: Connor Doyle In Mesos 0.25.0, a new message called NetworkInfo was introduced. This message allows framework authors to communicate with network isolation modules via a first-class message type to request IP addresses and network group isolation policies. Unfortunately, the structure is somewhat confusing to both framework authors and module implementors. 1) It's unclear how IP addresses map to virtual interfaces inside the container. 2) It's difficult for application developers to understand the final policy when multiple IP addresses can be assigned with differing isolation policies. CC [~karya] [~benjaminhindman] [~spikecurtis] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968075#comment-14968075 ] Raul Gutierrez Segales commented on MESOS-2186: --- Yeah, at least for the 3.4 branch we'll probably not have the constructor (zookeeper_init) retry the failed getaddrinfo() calls, so it's up to the caller. (ignore the part about the locks not properly initialized mentioned in the description of ZOOKEEPER-1029, that has nothing to do with this bug). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-maste
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968076#comment-14968076 ] Raul Gutierrez Segales commented on MESOS-2186: --- I would think so... > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slaves to register > Dec 9 22:54:54 m
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968073#comment-14968073 ] Steven Schlansker commented on MESOS-2186: -- If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah? > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968072#comment-14968072 ] Steven Schlansker commented on MESOS-2186: -- If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah? > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master
[jira] [Issue Comment Deleted] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2186: - Comment: was deleted (was: If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah?) > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968063#comment-14968063 ] Neil Conway commented on MESOS-2186: The check failure trace happens because the call to zookeeper_init() returns NULL; Mesos checks for this and aborts with an error and a stack trace. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:5
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968056#comment-14968056 ] Steven Schlansker commented on MESOS-2186: -- Well, rgs above called into question whether that is truly the case. Additionally at least as of now the "check failure stack trace" is entirely in C++ code, seemingly not in the Zookeeper library (pure C). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968034#comment-14968034 ] Neil Conway edited comment on MESOS-2186 at 10/21/15 9:55 PM: -- Hi Steven, The current theory is that this is a problem with Zookeeper; from a quick look at the Zk bug ([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper patch for the problem, we can discuss whether to backport it to Mesos in the time before a new Zk stable release is made. Other than that, I'm not sure what else we can do. was (Author: neilc): Hi Steven, The current theory is that this is a Zookeeper; from a quick look at the Zk bug ([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper patch for the problem, we can discuss whether to backport it to Mesos in the time before a new Zk stable release is made. Other than that, I'm not sure what else we can do. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968034#comment-14968034 ] Neil Conway commented on MESOS-2186: Hi Steven, The current theory is that this is a Zookeeper; from a quick look at the Zk bug ([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper patch for the problem, we can discuss whether to backport it to Mesos in the time before a new Zk stable release is made. Other than that, I'm not sure what else we can do. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 me
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:51 PM: I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_00' in ZooKeeper I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master! {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. I fully admit the linked ZK bug may not be the root cause. But Mesos is still trivial to crash if one of the ZK members are not valid (even if a quorum are). was (Author: stevenschlansker): I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/w
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968027#comment-14968027 ] Steven Schlansker commented on MESOS-2186: -- I reopened the ticket since it is still a crasher in master. I hope that is appropriate, I apologize in advance if not. Not trying to be a stick in the mud but this compromises the "high availability" of Mesos which is a critical piece of infrastructure. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:3
[jira] [Updated] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2186: - Affects Version/s: 0.26.0 > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slaves to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM: I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_00' in ZooKeeper I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master! {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). was (Author: stevenschlansker): I am still able to easily reproduce this, even with master built from today: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct thou
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker commented on MESOS-2186: -- I am still able to easily reproduce this, even with master built from today: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from
[jira] [Commented] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator
[ https://issues.apache.org/jira/browse/MESOS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967969#comment-14967969 ] Zameer Manji commented on MESOS-2386: - [~idownes] The linked design doc is not publicly viewable. Can you change the permissions on the document please? > Provide full filesystem isolation as a native mesos isolator > > > Key: MESOS-2386 > URL: https://issues.apache.org/jira/browse/MESOS-2386 > Project: Mesos > Issue Type: Epic > Components: isolation >Affects Versions: 0.22.1 >Reporter: Dominic Hamon >Assignee: Ian Downes > Labels: mesosphere, twitter > > Design > https://docs.google.com/a/twitter.com/document/d/1Fx5TS0LytV7u5MZExQS0-g-gScX2yKCKQg9UPFzhp6U/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967962#comment-14967962 ] Steven Schlansker commented on MESOS-3771: -- Okay, I have distilled down the reproduction case. Using the Python test-framework with the following diff applied: {code} diff --git a/src/examples/python/test_framework.py b/src/examples/python/test_framework.py index 6af6d22..95abb97 100755 --- a/src/examples/python/test_framework.py +++ b/src/examples/python/test_framework.py @@ -150,6 +150,7 @@ class TestScheduler(mesos.interface.Scheduler): print "but received", self.messagesReceived sys.exit(1) print "All tasks done, and all messages received, exiting" +time.sleep(30) driver.stop() if __name__ == "__main__": @@ -158,6 +159,7 @@ if __name__ == "__main__": sys.exit(1) executor = mesos_pb2.ExecutorInfo() +executor.data = b'\xAC\xED' executor.executor_id.value = "default" executor.command.value = os.path.abspath("./test-executor") executor.name = "Test Executor (Python)" {code} if you run the test framework, and during the 30 second wait after it finishes, try to grab the {{/master/state.json}} endpoint, you will get a response that has invalid UTF8 in it: {code} Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@54c8158d; line: 1, column: 6432] {code} I tested against both 0.24.1 and current master, both exhibit the bad behavior. > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-3771: - Affects Version/s: 0.26.0 > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1563) Failed to configure on FreeBSD
[ https://issues.apache.org/jira/browse/MESOS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967688#comment-14967688 ] David Forsythe commented on MESOS-1563: --- [~idownes] Great! Do you want to make a first pass and have me chop it up when I address feedback, or would you like me to chop it up before you review? > Failed to configure on FreeBSD > -- > > Key: MESOS-1563 > URL: https://issues.apache.org/jira/browse/MESOS-1563 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.19.0 > Environment: FreeBSD-10/stable >Reporter: Dmitry Sivachenko > > When trying to configure mesos on FreeBSD, I get the following error: > configure: Setting up build environment for x86_64 freebsd10.0 > configure: error: "Mesos is currently unsupported on your platform." > Why? Is there anything really Linux-specific inside? It's written in Java > after all. > And MacOS is supported, but it is rather close to FreeBSD. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-3787: -- Description: We'd like to have expanded variables usable in [the json files used to create a Marathon app, hence] the Task's CommandInfo, so that the executor is able to detect the correct values at runtime. (was: We'd like to have expanded variables usable in the json files used to create an app, so that the executor is able to detect the correct values at runtime.) > As a developer, I'd like to be able to expand environment variables through > the Docker executor. > > > Key: MESOS-3787 > URL: https://issues.apache.org/jira/browse/MESOS-3787 > Project: Mesos > Issue Type: Wish >Reporter: John Garcia > Labels: mesosphere > > We'd like to have expanded variables usable in [the json files used to create > a Marathon app, hence] the Task's CommandInfo, so that the executor is able > to detect the correct values at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-3787: -- Labels: mesosphere (was: ) > As a developer, I'd like to be able to expand environment variables through > the Docker executor. > > > Key: MESOS-3787 > URL: https://issues.apache.org/jira/browse/MESOS-3787 > Project: Mesos > Issue Type: Wish >Reporter: John Garcia > Labels: mesosphere > > We'd like to have expanded variables usable in the json files used to create > an app, so that the executor is able to detect the correct values at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.
John Garcia created MESOS-3787: -- Summary: As a developer, I'd like to be able to expand environment variables through the Docker executor. Key: MESOS-3787 URL: https://issues.apache.org/jira/browse/MESOS-3787 Project: Mesos Issue Type: Wish Reporter: John Garcia We'd like to have expanded variables usable in the json files used to create an app, so that the executor is able to detect the correct values at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3786) Backticks are not mentioned in Mesos C++ Style Guide
Greg Mann created MESOS-3786: Summary: Backticks are not mentioned in Mesos C++ Style Guide Key: MESOS-3786 URL: https://issues.apache.org/jira/browse/MESOS-3786 Project: Mesos Issue Type: Documentation Reporter: Greg Mann Assignee: Greg Mann Priority: Minor As far as I can tell, current practice is to quote code excerpts and object names with backticks when writing comments. For example: {code} // You know, `sadPanda` seems extra sad lately. std::string sadPanda; sadPanda = " :'( "; {code} However, I don't see this documented in our C++ style guide at all. It should be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
[ https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-3747: -- Shepherd: Vinod Kone i'll shepherd this. > HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string > - > > Key: MESOS-3747 > URL: https://issues.apache.org/jira/browse/MESOS-3747 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.0, 0.24.1, 0.25.0 >Reporter: Ben Whitehead >Assignee: Liqiang Lin >Priority: Blocker > > When using libmesos a framework can set its user to {{""}} (empty string) to > inherit the user the agent processes is running as, this behavior now results > in a {{TASK_FAILED}}. > Full messages and relevant agent logs below. > The error returned to the framework tells me nothing about the user not > existing on the agent host instead it tells me the container died due to OOM. > {code:title=FrameworkInfo} > call { > type: SUBSCRIBE > subscribe: { > frameworkInfo: { > user: "", > name: "testing" > } > } > } > {code} > {code:title=TaskInfo} > call { > framework_id { value: "20151015-125949-16777343-5050-20146-" }, > type: ACCEPT, > accept { > offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }], > operations { > type: LAUNCH, > launch { > task_infos [ > { > name: "task-1", > task_id: { value: "task-1" }, > agent_id: { value: > "20151015-125949-16777343-5050-20146-S0" }, > resources [ > { name: "cpus", type: SCALAR, scalar: { value: > 0.1 }, role: "*" }, > { name: "mem", type: SCALAR, scalar: { value: > 64.0 }, role: "*" }, > { name: "disk", type: SCALAR, scalar: { value: > 0.0 }, role: "*" }, > ], > command: { > environment { > variables [ > { name: "SLEEP_SECONDS" value: "15" } > ] > }, > value: "env | sort && sleep $SLEEP_SECONDS" > } > } > ] > } > } > } > } > {code} > {code:title=Update Status} > event: { > type: UPDATE, > update: { > status: { > task_id: { value: "task-1" }, > state: TASK_FAILED, > message: "Container destroyed while preparing isolators", > agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, > timestamp: 1.444939217401241E9, > executor_id: { value: "task-1" }, > source: SOURCE_AGENT, > reason: REASON_MEMORY_LIMIT, > uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" > } > } > } > {code} > {code:title=agent logs} > I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b': > Failed to get user information for '': Success > I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources > cpus(*):0.1; mem(*):32 in work directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b' > I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for > executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping > launch > I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container > '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework > 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-' > E1015 13:15:34.264516 19641 slave.cpp:3342] Container > '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework > 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-' failed to start: Failed to > prepare isolator: Faile
[jira] [Updated] (MESOS-3785) Use URI content modification time to trigger fetcher cache updates.
[ https://issues.apache.org/jira/browse/MESOS-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-3785: Labels: mesosphere (was: ) > Use URI content modification time to trigger fetcher cache updates. > --- > > Key: MESOS-3785 > URL: https://issues.apache.org/jira/browse/MESOS-3785 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Reporter: Bernd Mathiske >Assignee: Benjamin Bannier > Labels: mesosphere > > Instead of using checksums to trigger fetcher cache updates, we can for > starters use the content modification time (mtime), which is available for a > number of download protocols, e.g. HTTP and HDFS. > Proposal: Instead of just fetching the content size, we fetch both size and > mtime together. As before, if there is no size, then caching fails and we > fall back on direct downloading to the sandbox. > Assuming a size is given, we compare the mtime from the fetch URI with the > mtime known to the cache. If it differs, we update the cache. (As a defensive > measure, a difference in size should also trigger an update.) > Not having an mtime available at the fetch URI is simply treated as a unique > valid mtime value that differs from all others. This means that when > initially there is no mtime, cache content remains valid until there is one. > Thereafter, anew lack of an mtime invalidates the cache once. In other > words: any change from no mtime to having one or back is the same as > encountering a new mtime. > Note that this scheme does not require any new protobuf fields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`
[ https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967581#comment-14967581 ] Greg Mann commented on MESOS-3506: -- On a related note, can anybody confirm that the following command is necessary in the CentOS 6.6 install instructions: {code} sudo yum install -y tar wget which {code} The OS image I'm using already has these installed by default, so I'm inclined to remove that line from the docs unless we can confirm that it's needed. > Build instructions for CentOS 6.6 should include `sudo yum update` > -- > > Key: MESOS-3506 > URL: https://issues.apache.org/jira/browse/MESOS-3506 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Greg Mann >Assignee: Greg Mann > Labels: documentation, mesosphere > > Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the > build to break when building {{mesos-0.25.0.jar}}. The build instructions for > this platform on the Getting Started page should be changed accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`
[ https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967570#comment-14967570 ] Greg Mann commented on MESOS-3506: -- I'll check it out and see if I can tell which dependency is causing the issue. Since this is CentOS 6, however, I think there's a case to be made for just saying `sudo yum update` in the docs, since it's an old OS version and I would imagine similar problems with other packages may crop up in the future. > Build instructions for CentOS 6.6 should include `sudo yum update` > -- > > Key: MESOS-3506 > URL: https://issues.apache.org/jira/browse/MESOS-3506 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Greg Mann >Assignee: Greg Mann > Labels: documentation, mesosphere > > Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the > build to break when building {{mesos-0.25.0.jar}}. The build instructions for > this platform on the Getting Started page should be changed accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1563) Failed to configure on FreeBSD
[ https://issues.apache.org/jira/browse/MESOS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967492#comment-14967492 ] Ian Downes commented on MESOS-1563: --- [~dforsyth] it should to be split into two. Although in the same tree we consider libprocess (and also stout) to be a separate library and changes should be separated. I'm just about to review now. > Failed to configure on FreeBSD > -- > > Key: MESOS-1563 > URL: https://issues.apache.org/jira/browse/MESOS-1563 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.19.0 > Environment: FreeBSD-10/stable >Reporter: Dmitry Sivachenko > > When trying to configure mesos on FreeBSD, I get the following error: > configure: Setting up build environment for x86_64 freebsd10.0 > configure: error: "Mesos is currently unsupported on your platform." > Why? Is there anything really Linux-specific inside? It's written in Java > after all. > And MacOS is supported, but it is rather close to FreeBSD. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3736) Support docker local store pull same image simultaneously
[ https://issues.apache.org/jira/browse/MESOS-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967449#comment-14967449 ] Gilbert Song commented on MESOS-3736: - A quick note to list solutions to the questions above: 1. By logic check: if it is the first call to get() Image_A, promise associate with metadateManager->get(). If not, check whether that promised future failed/discarded. If yes, over write to the hash map. 2. To have 'stringify(image)' as key. > Support docker local store pull same image simultaneously > -- > > Key: MESOS-3736 > URL: https://issues.apache.org/jira/browse/MESOS-3736 > Project: Mesos > Issue Type: Improvement >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: mesosphere > > The current local store implements get() using the local puller. For all > requests of pulling same docker image at the same time, the local puller just > untar the image tarball as many times as those requests are, and cp all of > them to the same directory, which wastes time and bear high demand of > computation. We should be able to support the local store/puller only do > these for the first time, and the simultaneous pulling request should wait > for the promised future and get it once the first pulling finishes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING
[ https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967444#comment-14967444 ] Anand Mazumdar commented on MESOS-3766: --- Looking at the logs, we can identify a few things: 1. The executors for the 2 tasks were indeed launched. After launch, they sent a registration request to the agent. The agent successfully registered them, and sent the queued task it had back to the executor. 2. The executor never sent any status updates e.g. {{TASK_RUNNING}} for the task ( might have been stuck ) , very hard to follow (why) , since we do not have any {{VLOG}} messages from the executor driver being logged owing to no {{GLOG_v=1}} being set as part of the executor environment variables. 3. The agent upon receiving a {{KillTask}} message from the scheduler kept sending them to the executor. They are all best effort (fire and forget), meaning , if the executor is hung/un-responsive, there is no way for us to know except having a look at the driver logs. Since we did not have {{GLOG_v}} set, that makes it very hard to reason if the Executor did receive messages from the Agent and why it did not act upon them i.e. send a {{TASK_KILLED}} status update back to the agent. > Can not kill task in Status STAGING > --- > > Key: MESOS-3766 > URL: https://issues.apache.org/jira/browse/MESOS-3766 > Project: Mesos > Issue Type: Bug > Components: general >Affects Versions: 0.25.0 > Environment: OSX >Reporter: Matthias Veit >Assignee: Niklas Quarfot Nielsen > Attachments: master.log.zip, slave.log.zip > > > I have created a simple Marathon Application with instance count 100 (100 > tasks) with a simple sleep command. Before all tasks were running, I killed > all tasks. This operation was successful, except 2 tasks. These 2 tasks are > in state STAGING (according to the mesos UI). Marathon tries to kill those > tasks every 5 seconds (for over an hour now) - unsuccessfully. > I picked one task and grepped the slave log: > {noformat} > I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour > I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80 > I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container > '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr > I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing > executor's forked pid 37096 to > '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks > I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000 > I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame > I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf
[jira] [Updated] (MESOS-1478) Replace Master/Slave terminology
[ https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-1478: - Epic Name: Agent Rename > Replace Master/Slave terminology > > > Key: MESOS-1478 > URL: https://issues.apache.org/jira/browse/MESOS-1478 > Project: Mesos > Issue Type: Epic >Reporter: Clark Breyman >Assignee: Benjamin Hindman >Priority: Minor > Labels: mesosphere > > Inspired by the comments on this PR: > https://github.com/django/django/pull/2692 > TL;DR - Computers sharing work should be a good thing. Using the language of > human bondage and suffering is inappropriate in this context. It also has the > potential to alienate users and community members. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1478) Replace Master/Slave terminology
[ https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-1478: - Issue Type: Epic (was: Wish) > Replace Master/Slave terminology > > > Key: MESOS-1478 > URL: https://issues.apache.org/jira/browse/MESOS-1478 > Project: Mesos > Issue Type: Epic >Reporter: Clark Breyman >Assignee: Benjamin Hindman >Priority: Minor > Labels: mesosphere > > Inspired by the comments on this PR: > https://github.com/django/django/pull/2692 > TL;DR - Computers sharing work should be a good thing. Using the language of > human bondage and suffering is inappropriate in this context. It also has the > potential to alienate users and community members. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1607) Introduce optimistic offers.
[ https://issues.apache.org/jira/browse/MESOS-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967410#comment-14967410 ] Joseph Wu commented on MESOS-1607: -- We plan to release the MVP before the end of this year, so tentatively sometime between v0.26.0 and v0.28.0. > Introduce optimistic offers. > > > Key: MESOS-1607 > URL: https://issues.apache.org/jira/browse/MESOS-1607 > Project: Mesos > Issue Type: Epic > Components: allocation, framework, master >Reporter: Benjamin Hindman >Assignee: Artem Harutyunyan > Labels: mesosphere > Attachments: optimisitic-offers.pdf > > > *Background* > The current implementation of resource offers only enable a single framework > scheduler to make scheduling decisions for some available resources at a > time. In some circumstances, this is good, i.e., when we don't want other > framework schedulers to have access to some resources. However, in other > circumstances, there are advantages to letting multiple framework schedulers > attempt to make scheduling decisions for the _same_ allocation of resources > in parallel. > If you think about this from a "concurrency control" perspective, the current > implementation of resource offers is _pessimistic_, the resources contained > within an offer are _locked_ until the framework scheduler that they were > offered to launches tasks with them or declines them. In addition to making > pessimistic offers we'd like to give out _optimistic_ offers, where the same > resources are offered to multiple framework schedulers at the same time, and > framework schedulers "compete" for those resources on a > first-come-first-serve basis (i.e., the first to launch a task "wins"). We've > always reserved the right to rescind resource offers using the 'rescind' > primitive in the API, and a framework scheduler should be prepared to launch > a task and have those tasks go lost because another framework already started > to use those resources. > *Feature* > We plan to take a step towards optimistic offers, by introducing primitives > that allow resources to be offered to multiple frameworks at once. At first, > we will use these primitives to optimistically allocate resources that are > reserved for a particular framework/role but have not been allocated by that > framework/role. > The work with optimistic offers will closely resemble the existing > oversubscription feature. Optimistically offered resources are likely to be > considered "revocable resources" (the concept that using resources not > reserved for you means you might get those resources revoked). In effect, we > can may create something like a "spot" market for unused resources, driving > up utilization by letting frameworks that are willing to use revocable > resources run tasks. > *Future Work* > This ticket tracks the introduction of some aspects of optimistic offers. > Taken to the limit, one could imagine always making optimistic resource > offers. This bears a striking resemblance with the Google Omega model (an > isomorphism even). However, being able to configure what resources should be > allocated optimistically and what resources should be allocated > pessimistically gives even more control to a datacenter/cluster operator that > might want to, for example, never let multiple frameworks (roles) compete for > some set of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3752) CentOS 6 dependency install fails at Maven
[ https://issues.apache.org/jira/browse/MESOS-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967396#comment-14967396 ] Greg Mann commented on MESOS-3752: -- Ding-Yi Chen updated the Maven package, and it now successfully installs on CentOS 6, but I'm currently having some compilation errors that might be related, so leaving the ticket open for the time being. > CentOS 6 dependency install fails at Maven > -- > > Key: MESOS-3752 > URL: https://issues.apache.org/jira/browse/MESOS-3752 > Project: Mesos > Issue Type: Documentation >Reporter: Greg Mann >Assignee: Greg Mann > Labels: documentation, installation, mesosphere > > It seems the Apache Maven dependencies have changed such that following the > Getting Started docs for CentOS 6.6 will fail at Maven installation: > {code} > ---> Package apache-maven.noarch 0:3.3.3-2.el6 will be installed > --> Processing Dependency: java-devel >= 1:1.7.0 for package: > apache-maven-3.3.3-2.el6.noarch > --> Finished Dependency Resolution > Error: Package: apache-maven-3.3.3-2.el6.noarch (epel-apache-maven) >Requires: java-devel >= 1:1.7.0 >Available: java-1.5.0-gcj-devel-1.5.0.0-29.1.el6.x86_64 (base) >java-devel = 1.5.0 >Available: > 1:java-1.6.0-openjdk-devel-1.6.0.35-1.13.7.1.el6_6.x86_64 (base) >java-devel = 1:1.6.0 >Available: > 1:java-1.6.0-openjdk-devel-1.6.0.36-1.13.8.1.el6_7.x86_64 (updates) >java-devel = 1:1.6.0 > You could try using --skip-broken to work around the problem > You could try running: rpm -Va --nofiles --nodigest > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3326) Make use of C++11 atomics
[ https://issues.apache.org/jira/browse/MESOS-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3326: Labels: mesosphere (was: ) > Make use of C++11 atomics > - > > Key: MESOS-3326 > URL: https://issues.apache.org/jira/browse/MESOS-3326 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > Fix For: 0.26.0 > > > Now that we require C++11, we can make use of std::atomic. For example: > * libprocess/process.cpp uses a bare int + __sync_synchronize() for "running" > * __sync_synchronize() is used in logging.hpp in libprocess and fork.hpp in > stout > * sched/sched.cpp uses a volatile int for "running" -- this is wrong, > "volatile" is not sufficient to ensure safe concurrent access > * "volatile" is used in a few other places -- most are probably dubious but I > haven't looked closely -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2953) git rebase --continue does not trigger hooks
[ https://issues.apache.org/jira/browse/MESOS-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967270#comment-14967270 ] Joris Van Remoortere commented on MESOS-2953: - {code} commit 8de47a8ef27288d7660ee3a5e40874def912b8c2 Author: haosdent huang Date: Wed Oct 21 10:56:16 2015 -0400 Removed unnecessary exec in post-rewrite hook. Review: https://reviews.apache.org/r/39506 {code} > git rebase --continue does not trigger hooks > > > Key: MESOS-2953 > URL: https://issues.apache.org/jira/browse/MESOS-2953 > Project: Mesos > Issue Type: Improvement >Reporter: Joris Van Remoortere >Assignee: haosdent > > Currently there are no git hooks run when executing {{git rebase > --continue}}. We do run hooks on {{git commit}}. > It would help prevent errors if we could also run some of these hooks on the > {{git rebase --continue}} flow as this one is rather common. > I believe we can use the 'post-rewrite' hook to accomplish this. It will not > necessarily unwind the commit, but at least give us the opportunity to print > warning messages. > If this is not desirable / feasible I would like to propose running the hooks > as part of post-reviews instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3785) Use URI content modification time to trigger fetcher cache updates.
Bernd Mathiske created MESOS-3785: - Summary: Use URI content modification time to trigger fetcher cache updates. Key: MESOS-3785 URL: https://issues.apache.org/jira/browse/MESOS-3785 Project: Mesos Issue Type: Improvement Components: fetcher Reporter: Bernd Mathiske Assignee: Benjamin Bannier Instead of using checksums to trigger fetcher cache updates, we can for starters use the content modification time (mtime), which is available for a number of download protocols, e.g. HTTP and HDFS. Proposal: Instead of just fetching the content size, we fetch both size and mtime together. As before, if there is no size, then caching fails and we fall back on direct downloading to the sandbox. Assuming a size is given, we compare the mtime from the fetch URI with the mtime known to the cache. If it differs, we update the cache. (As a defensive measure, a difference in size should also trigger an update.) Not having an mtime available at the fetch URI is simply treated as a unique valid mtime value that differs from all others. This means that when initially there is no mtime, cache content remains valid until there is one. Thereafter, anew lack of an mtime invalidates the cache once. In other words: any change from no mtime to having one or back is the same as encountering a new mtime. Note that this scheme does not require any new protobuf fields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3784) Replace Master/Slave Terminology Phase I - Update mesos-cli
Diana Arroyo created MESOS-3784: --- Summary: Replace Master/Slave Terminology Phase I - Update mesos-cli Key: MESOS-3784 URL: https://issues.apache.org/jira/browse/MESOS-3784 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3783) Replace Master/Slave Terminology Phase I - Update documentation
Diana Arroyo created MESOS-3783: --- Summary: Replace Master/Slave Terminology Phase I - Update documentation Key: MESOS-3783 URL: https://issues.apache.org/jira/browse/MESOS-3783 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags
Diana Arroyo created MESOS-3781: --- Summary: Replace Master/Slave Terminology Phase I - Add duplicate agent flags Key: MESOS-3781 URL: https://issues.apache.org/jira/browse/MESOS-3781 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)
Diana Arroyo created MESOS-3782: --- Summary: Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks) Key: MESOS-3782 URL: https://issues.apache.org/jira/browse/MESOS-3782 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3780) Replace Master/Slave Terminology Phase I - Update all strings output
Diana Arroyo created MESOS-3780: --- Summary: Replace Master/Slave Terminology Phase I - Update all strings output Key: MESOS-3780 URL: https://issues.apache.org/jira/browse/MESOS-3780 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3778) Replace Master/Slave Terminology Phase I - Add duplicate HTTP endpoints
Diana Arroyo created MESOS-3778: --- Summary: Replace Master/Slave Terminology Phase I - Add duplicate HTTP endpoints Key: MESOS-3778 URL: https://issues.apache.org/jira/browse/MESOS-3778 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3777) Replace Master/Slave Terminology Phase I - Modify public interfaces
Diana Arroyo created MESOS-3777: --- Summary: Replace Master/Slave Terminology Phase I - Modify public interfaces Key: MESOS-3777 URL: https://issues.apache.org/jira/browse/MESOS-3777 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3779) Replace Master/Slave Terminology Phase I - Update webui
Diana Arroyo created MESOS-3779: --- Summary: Replace Master/Slave Terminology Phase I - Update webui Key: MESOS-3779 URL: https://issues.apache.org/jira/browse/MESOS-3779 Project: Mesos Issue Type: Task Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2092) Make ACLs dynamic
[ https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Qiao Wang reassigned MESOS-2092: - Assignee: Yong Qiao Wang > Make ACLs dynamic > - > > Key: MESOS-2092 > URL: https://issues.apache.org/jira/browse/MESOS-2092 > Project: Mesos > Issue Type: Task > Components: security >Reporter: Alexander Rukletsov >Assignee: Yong Qiao Wang > Labels: mesosphere, newbie > > Master loads ACLs once during its launch and there is no way to update them > in a running master. Making them dynamic will allow for updating ACLs on the > fly, for example granting a new framework necessary rights. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3706) Tasks stuck in staging.
[ https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966818#comment-14966818 ] haosdent commented on MESOS-3706: - As you stay, stdout and stderr are empty. Could you try use strace or gdb to attach the mesos-docker-executor hangout which point? For example, in the example you describe in description, try use {noformat} strace -p 35360 {noformat} to find mesos-docker-executor blocks on which syscall and then post result here. > Tasks stuck in staging. > --- > > Key: MESOS-3706 > URL: https://issues.apache.org/jira/browse/MESOS-3706 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Affects Versions: 0.23.0, 0.24.1 >Reporter: Jord Sonneveld > Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot > 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, > mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout > > > I have a docker image which starts fine on all my slaves except for one. On > that one, it is stuck in STAGING for a long time and never starts. The INFO > log is full of messages like this: > I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task > kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework > 20150109-172016-504433162-5050-19367-0002 > E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: > Transport endpoint is not connected [107] > kwe-vinland-work is the task that is stuck in staging. It is launched by > marathon. I have launched 161 instances successfully on my cluster. But it > refuses to launch on this specific slave. > These machines are all managed via ansible so their configurations are / > should be identical. I have re-run my ansible scripts and rebooted the > machines to no avail. > It's been in this state for almost 30 minutes. You can see the mesos docker > executor is still running: > jord@dalstgmesos03:~$ date > Mon Oct 12 16:13:55 UTC 2015 > jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland > root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00 > mesos-docker-executor > --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox > --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --stop_timeout=0ns > According to docker ps -a, nothing was ever even launched: > jord@dalstgmesos03:/data/mesos$ sudo docker ps -a > CONTAINER IDIMAGE > COMMAND CREATED STATUS PORTS > NAMES > 5c858b90b0a0registry.roger.dal.moz.com:5000/moz-statsd-v0.22 > "/bin/sh -c ./start.s" 39 minutes ago Up 39 minutes > 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp statsd-fe-influxdb > d765ba3829fdregistry.roger.dal.moz.com:5000/moz-statsd-v0.22 > "/bin/sh -c ./start.s" 41 minutes ago Up 41 minutes > 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp statsd-repeater > Those are the only two entries. Nothing about the kwe-vinland job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1832) Slave should accept PingSlaveMessage but not "PING" message.
[ https://issues.apache.org/jira/browse/MESOS-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966799#comment-14966799 ] Yong Qiao Wang commented on MESOS-1832: --- [~vinodkone], append the related RR for this ticket: https://reviews.apache.org/r/39516/ > Slave should accept PingSlaveMessage but not "PING" message. > > > Key: MESOS-1832 > URL: https://issues.apache.org/jira/browse/MESOS-1832 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Yong Qiao Wang > Labels: mesosphere > > Slave handles both "PING" message and PingSlaveMessage in until 0.22.0 for > backwards compatibility (https://reviews.apache.org/r/25867/). > In 0.23.0, slave no longer needs handle "PING". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3224) Create a Mesos Contributor Newbie Guide
[ https://issues.apache.org/jira/browse/MESOS-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966771#comment-14966771 ] haosdent commented on MESOS-3224: - Move to github pull request or review board? > Create a Mesos Contributor Newbie Guide > --- > > Key: MESOS-3224 > URL: https://issues.apache.org/jira/browse/MESOS-3224 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Timothy Chen >Assignee: Diana Arroyo > > Currently the website doesn't have a helpful guide for community users to > know how to start learning to contribute to Mesos, understand the concepts > and lower the barrier to get involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3688) Get Container Name information when launching a container task
[ https://issues.apache.org/jira/browse/MESOS-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966746#comment-14966746 ] Raffaele Di Fazio commented on MESOS-3688: -- Do you have an update on this? I'm just checking if you need need more information from me. > Get Container Name information when launching a container task > -- > > Key: MESOS-3688 > URL: https://issues.apache.org/jira/browse/MESOS-3688 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 0.24.1 >Reporter: Raffaele Di Fazio > Labels: mesosphere > > We want to get the Docker Name (or Docker ID, or both) when launching a > container task with mesos. The container name is generated by mesos itself > (i.e. mesos-77e5fde6-83e7-4618-a2dd-d5b10f2b4d25, obtained with "docker ps") > and it would be nice to expose this information to frameworks so that this > information can be used, for example by Marathon to give this information to > users via a REST API. > To go a bit in depth with our use case, we have files created by fluentd > logdriver that are named with Docker Name or Docker ID (full or short) and we > need a mapping for the users of the REST API and thus the first step is to > make this information available from mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3776) Support SELinux docker volume modes
James Findley created MESOS-3776: Summary: Support SELinux docker volume modes Key: MESOS-3776 URL: https://issues.apache.org/jira/browse/MESOS-3776 Project: Mesos Issue Type: Bug Components: docker Reporter: James Findley Priority: Minor Since docker 1.7, two additional volume modes are supported on top of 'ro' and 'rw': 'z' and 'Z'. These set the SELinux mode of the volume to be accessible from every container or just this container, respectively. See http://www.projectatomic.io/blog/2015/06/using-volumes-with-docker-can-cause-problems-with-selinux/ for more info on this. It would be great if mesos were to support these volume modes for better container security. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3775) MasterAllocatorTest.SlaveLost is slow
Alexander Rukletsov created MESOS-3775: -- Summary: MasterAllocatorTest.SlaveLost is slow Key: MESOS-3775 URL: https://issues.apache.org/jira/browse/MESOS-3775 Project: Mesos Issue Type: Bug Components: technical debt, test Reporter: Alexander Rukletsov Priority: Minor The {{MasterAllocatorTest.SlaveLost}} takes more that {{5s}} to complete. A brief look into the code hints that the stopped agent does not quit immediately (and hence its resources are not released by the allocator) because [it waits for the executor to terminate|https://github.com/apache/mesos/blob/master/src/tests/master_allocator_tests.cpp#L717]. {{5s}} timeout comes from {{EXECUTOR_SHUTDOWN_GRACE_PERIOD}} agent constant. Possible solutions: * Do not wait until the stopped agent quits (can be flaky, needs deeper analysis). * Decrease the agent's {{executor_shutdown_grace_period}} flag. * Terminate the executor faster (this may require some refactoring since the executor driver is created in the {{TestContainerizer}} and we do not have direct access to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2936) Create a design document for Quota support in Master
[ https://issues.apache.org/jira/browse/MESOS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966524#comment-14966524 ] Yong Qiao Wang commented on MESOS-2936: --- Hey, [~alexr], we are planing to add a separated endpoint /roles to add/remove role in Role Dynamic Configuration project (MESOS-3177). I found in the request quota design, it also can be used to add a role. I think there are some repeats between these two projects. For my understanding , quota should be view as an attribute of role, so in quota management project, the quota updating action(PUT) should be enough to manage(add/update/remove) the quota of an exist role, and if the role does not exist, then you need to create this role with endpoint /roles before configuring the quota for this role. So maybe we need to remove the quota request action from quota management endpoint. [~alexr], any thoughts for this? In addition, welcome you to give me a review for the role dynamically configure design, your comments is important for me. Thanks in advance. > Create a design document for Quota support in Master > > > Key: MESOS-2936 > URL: https://issues.apache.org/jira/browse/MESOS-2936 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > Create a design document for the Quota feature support in Mesos Master > (excluding allocator) to be shared with the Mesos community. > Design Doc: > https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2742) Architecture doc on global resources
[ https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966471#comment-14966471 ] Klaus Ma commented on MESOS-2742: - [~jieyu]/[~vi...@twitter.com], the draft document of Global Resources are uploaded; would you add some input of that? > Architecture doc on global resources > > > Key: MESOS-2742 > URL: https://issues.apache.org/jira/browse/MESOS-2742 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Joerg Schad > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
[ https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966463#comment-14966463 ] Liqiang Lin commented on MESOS-3747: RR: https://reviews.apache.org/r/39514/ > HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string > - > > Key: MESOS-3747 > URL: https://issues.apache.org/jira/browse/MESOS-3747 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.0, 0.24.1, 0.25.0 >Reporter: Ben Whitehead >Assignee: Liqiang Lin >Priority: Blocker > > When using libmesos a framework can set its user to {{""}} (empty string) to > inherit the user the agent processes is running as, this behavior now results > in a {{TASK_FAILED}}. > Full messages and relevant agent logs below. > The error returned to the framework tells me nothing about the user not > existing on the agent host instead it tells me the container died due to OOM. > {code:title=FrameworkInfo} > call { > type: SUBSCRIBE > subscribe: { > frameworkInfo: { > user: "", > name: "testing" > } > } > } > {code} > {code:title=TaskInfo} > call { > framework_id { value: "20151015-125949-16777343-5050-20146-" }, > type: ACCEPT, > accept { > offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }], > operations { > type: LAUNCH, > launch { > task_infos [ > { > name: "task-1", > task_id: { value: "task-1" }, > agent_id: { value: > "20151015-125949-16777343-5050-20146-S0" }, > resources [ > { name: "cpus", type: SCALAR, scalar: { value: > 0.1 }, role: "*" }, > { name: "mem", type: SCALAR, scalar: { value: > 64.0 }, role: "*" }, > { name: "disk", type: SCALAR, scalar: { value: > 0.0 }, role: "*" }, > ], > command: { > environment { > variables [ > { name: "SLEEP_SECONDS" value: "15" } > ] > }, > value: "env | sort && sleep $SLEEP_SECONDS" > } > } > ] > } > } > } > } > {code} > {code:title=Update Status} > event: { > type: UPDATE, > update: { > status: { > task_id: { value: "task-1" }, > state: TASK_FAILED, > message: "Container destroyed while preparing isolators", > agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, > timestamp: 1.444939217401241E9, > executor_id: { value: "task-1" }, > source: SOURCE_AGENT, > reason: REASON_MEMORY_LIMIT, > uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" > } > } > } > {code} > {code:title=agent logs} > I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b': > Failed to get user information for '': Success > I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources > cpus(*):0.1; mem(*):32 in work directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b' > I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for > executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.262684 19638 docker.cpp:734] No container info found, skipping > launch > I1015 13:15:34.263478 19638 containerizer.cpp:640] Starting container > '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework > 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-' > E1015 13:15:34.264516 19641 slave.cpp:3342] Container > '3958ff84-8dd9-4c3c-995d-5aba5250541b' for executor 'task-1' of framework > 'e4de5b96-41cc-4713-af44-7cffbdd63ba6-'
[jira] [Commented] (MESOS-3477) Add design doc for roles/weights configuration
[ https://issues.apache.org/jira/browse/MESOS-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966441#comment-14966441 ] Yong Qiao Wang commented on MESOS-3477: --- Hi [~adam-mesos], [~cmaloney], the design doc has be updated, could you give me a double review. Welcome your any comments. Thanks! > Add design doc for roles/weights configuration > -- > > Key: MESOS-3477 > URL: https://issues.apache.org/jira/browse/MESOS-3477 > Project: Mesos > Issue Type: Documentation > Components: master >Reporter: Yong Qiao Wang >Assignee: Yong Qiao Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3747) HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string
[ https://issues.apache.org/jira/browse/MESOS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966408#comment-14966408 ] Liqiang Lin commented on MESOS-3747: Yes. If --[no-]switch_user is false, tasks will be run as the same user as the Mesos agent process. Both framework scheduler and Mesos master can not know which Mesos agent is set --[no-]switch_user to true, which is set to false. We should pass the framework user info anyway. Let Mesos agent decide whether switch to framework user or not. If framework user did not exist on that Mesos agent, just fail framework tasks as [~gyliu] posted. > HTTP Scheduler API no longer allows FrameworkInfo.user to be empty string > - > > Key: MESOS-3747 > URL: https://issues.apache.org/jira/browse/MESOS-3747 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.0, 0.24.1, 0.25.0 >Reporter: Ben Whitehead >Assignee: Liqiang Lin >Priority: Blocker > > When using libmesos a framework can set its user to {{""}} (empty string) to > inherit the user the agent processes is running as, this behavior now results > in a {{TASK_FAILED}}. > Full messages and relevant agent logs below. > The error returned to the framework tells me nothing about the user not > existing on the agent host instead it tells me the container died due to OOM. > {code:title=FrameworkInfo} > call { > type: SUBSCRIBE > subscribe: { > frameworkInfo: { > user: "", > name: "testing" > } > } > } > {code} > {code:title=TaskInfo} > call { > framework_id { value: "20151015-125949-16777343-5050-20146-" }, > type: ACCEPT, > accept { > offer_ids: [{ value: "20151015-125949-16777343-5050-20146-O0" }], > operations { > type: LAUNCH, > launch { > task_infos [ > { > name: "task-1", > task_id: { value: "task-1" }, > agent_id: { value: > "20151015-125949-16777343-5050-20146-S0" }, > resources [ > { name: "cpus", type: SCALAR, scalar: { value: > 0.1 }, role: "*" }, > { name: "mem", type: SCALAR, scalar: { value: > 64.0 }, role: "*" }, > { name: "disk", type: SCALAR, scalar: { value: > 0.0 }, role: "*" }, > ], > command: { > environment { > variables [ > { name: "SLEEP_SECONDS" value: "15" } > ] > }, > value: "env | sort && sleep $SLEEP_SECONDS" > } > } > ] > } > } > } > } > {code} > {code:title=Update Status} > event: { > type: UPDATE, > update: { > status: { > task_id: { value: "task-1" }, > state: TASK_FAILED, > message: "Container destroyed while preparing isolators", > agent_id: { value: "20151015-125949-16777343-5050-20146-S0" }, > timestamp: 1.444939217401241E9, > executor_id: { value: "task-1" }, > source: SOURCE_AGENT, > reason: REASON_MEMORY_LIMIT, > uuid: "\237g()L\026EQ\222\301\261\265\\\221\224|" > } > } > } > {code} > {code:title=agent logs} > I1015 13:15:34.260592 19639 slave.cpp:1270] Got assigned task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.260921 19639 slave.cpp:1386] Launching task task-1 for > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- > W1015 13:15:34.262243 19639 paths.cpp:423] Failed to chown executor directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b': > Failed to get user information for '': Success > I1015 13:15:34.262444 19639 slave.cpp:4852] Launching executor task-1 of > framework e4de5b96-41cc-4713-af44-7cffbdd63ba6- with resources > cpus(*):0.1; mem(*):32 in work directory > '/home/ben.whitehead/opt/mesos/work/slave/work_dir/slaves/e4de5b96-41cc-4713-af44-7cffbdd63ba6-S0/frameworks/e4de5b96-41cc-4713-af44-7cffbdd63ba6-/executors/task-1/runs/3958ff84-8dd9-4c3c-995d-5aba5250541b' > I1015 13:15:34.262581 19639 slave.cpp:1604] Queuing task 'task-1' for > executor task-1 of framework 'e4de5b96-41cc-4713-af44-7cffbdd63ba6- > I1015 13:15:34.262684 19638 docker.cpp:734] No container
[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING
[ https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966372#comment-14966372 ] Matthias Veit commented on MESOS-3766: -- [~nnielsen] Added the complete master and slave logs. I killed the mesos master and slave process - so I don't can query the endpoints any longer. > Can not kill task in Status STAGING > --- > > Key: MESOS-3766 > URL: https://issues.apache.org/jira/browse/MESOS-3766 > Project: Mesos > Issue Type: Bug > Components: general >Affects Versions: 0.25.0 > Environment: OSX >Reporter: Matthias Veit >Assignee: Niklas Quarfot Nielsen > Attachments: master.log.zip, slave.log.zip > > > I have created a simple Marathon Application with instance count 100 (100 > tasks) with a simple sleep command. Before all tasks were running, I killed > all tasks. This operation was successful, except 2 tasks. These 2 tasks are > in state STAGING (according to the mesos UI). Marathon tries to kill those > tasks every 5 seconds (for over an hour now) - unsuccessfully. > I picked one task and grepped the slave log: > {noformat} > I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour > I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80 > I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container > '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr > I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing > executor's forked pid 37096 to > '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks > I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000 > I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame > I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:18.316463 3160
[jira] [Commented] (MESOS-3506) Build instructions for CentOS 6.6 should include `sudo yum update`
[ https://issues.apache.org/jira/browse/MESOS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966366#comment-14966366 ] Adam B commented on MESOS-3506: --- That's a good point. We could instead just recommend `sudo yum update someDependency`, if we can figure out what that necessary dependency is. > Build instructions for CentOS 6.6 should include `sudo yum update` > -- > > Key: MESOS-3506 > URL: https://issues.apache.org/jira/browse/MESOS-3506 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Greg Mann >Assignee: Greg Mann > Labels: documentation, mesosphere > > Neglecting to run {{sudo yum update}} on CentOS 6.6 currently causes the > build to break when building {{mesos-0.25.0.jar}}. The build instructions for > this platform on the Getting Started page should be changed accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3766) Can not kill task in Status STAGING
[ https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Veit updated MESOS-3766: - Attachment: master.log.zip > Can not kill task in Status STAGING > --- > > Key: MESOS-3766 > URL: https://issues.apache.org/jira/browse/MESOS-3766 > Project: Mesos > Issue Type: Bug > Components: general >Affects Versions: 0.25.0 > Environment: OSX >Reporter: Matthias Veit >Assignee: Niklas Quarfot Nielsen > Attachments: master.log.zip, slave.log.zip > > > I have created a simple Marathon Application with instance count 100 (100 > tasks) with a simple sleep command. Before all tasks were running, I killed > all tasks. This operation was successful, except 2 tasks. These 2 tasks are > in state STAGING (according to the mesos UI). Marathon tries to kill those > tasks every 5 seconds (for over an hour now) - unsuccessfully. > I picked one task and grepped the slave log: > {noformat} > I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour > I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80 > I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container > '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr > I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing > executor's forked pid 37096 to > '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks > I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000 > I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame > I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:23.337116 313872384
[jira] [Updated] (MESOS-3766) Can not kill task in Status STAGING
[ https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Veit updated MESOS-3766: - Attachment: slave.log.zip > Can not kill task in Status STAGING > --- > > Key: MESOS-3766 > URL: https://issues.apache.org/jira/browse/MESOS-3766 > Project: Mesos > Issue Type: Bug > Components: general >Affects Versions: 0.25.0 > Environment: OSX >Reporter: Matthias Veit >Assignee: Niklas Quarfot Nielsen > Attachments: slave.log.zip > > > I have created a simple Marathon Application with instance count 100 (100 > tasks) with a simple sleep command. Before all tasks were running, I killed > all tasks. This operation was successful, except 2 tasks. These 2 tasks are > in state STAGING (according to the mesos UI). Marathon tries to kill those > tasks every 5 seconds (for over an hour now) - unsuccessfully. > I picked one task and grepped the slave log: > {noformat} > I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour > I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80 > I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container > '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr > I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing > executor's forked pid 37096 to > '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks > I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000 > I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor > 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame > I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task > app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework > 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- > I1020 12:41:23.337116 313872384 slave.cpp:1789]
[jira] [Comment Edited] (MESOS-2315) Deprecate / Remove CommandInfo::ContainerInfo
[ https://issues.apache.org/jira/browse/MESOS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966347#comment-14966347 ] Adam B edited comment on MESOS-2315 at 10/21/15 7:13 AM: - Sounds about right, but.. The first example "with old ContainerInfo" doesn't actually use the old ContainerInfo. I don't remember the exact semantics for how various containerizers used the options field ([~tillt] might), but you should probably have something like: {code} CommandInfo::ContainerInfo containerInfo; containerInfo.set_image("busybox"); task.mutable_command()->mutable_container()->CopyFrom(containerInfo); {code} Also, if you're going to use ContainerInfo::MesosInfo, you should probably set the image, or at least add a volume. Otherwise, you're just getting the default Mesos containerization behavior, which you get even without a ContainerInfo. was (Author: adam-mesos): Sounds about right, but.. The first example "with old ContainerInfo" doesn't actually use the old ContainerInfo. I don't remember the exact semantics for how various containerizers used the options field ([~tillt] might), but you should probably have something like: ``` CommandInfo::ContainerInfo containerInfo; containerInfo.set_image("busybox"); task.mutable_command()->mutable_container()->CopyFrom(containerInfo); ``` Also, if you're going to use ContainerInfo::MesosInfo, you should probably set the image, or at least add a volume. Otherwise, you're just getting the default Mesos containerization behavior, which you get even without a ContainerInfo. > Deprecate / Remove CommandInfo::ContainerInfo > - > > Key: MESOS-2315 > URL: https://issues.apache.org/jira/browse/MESOS-2315 > Project: Mesos > Issue Type: Task >Reporter: Ian Downes >Assignee: Vaibhav Khanduja >Priority: Minor > Labels: mesosphere, newbie > > IIUC this has been deprecated and all current code (except > examples/docker_no_executor_framework.cpp) uses the top-level ContainerInfo? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2315) Deprecate / Remove CommandInfo::ContainerInfo
[ https://issues.apache.org/jira/browse/MESOS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966347#comment-14966347 ] Adam B commented on MESOS-2315: --- Sounds about right, but.. The first example "with old ContainerInfo" doesn't actually use the old ContainerInfo. I don't remember the exact semantics for how various containerizers used the options field ([~tillt] might), but you should probably have something like: ``` CommandInfo::ContainerInfo containerInfo; containerInfo.set_image("busybox"); task.mutable_command()->mutable_container()->CopyFrom(containerInfo); ``` Also, if you're going to use ContainerInfo::MesosInfo, you should probably set the image, or at least add a volume. Otherwise, you're just getting the default Mesos containerization behavior, which you get even without a ContainerInfo. > Deprecate / Remove CommandInfo::ContainerInfo > - > > Key: MESOS-2315 > URL: https://issues.apache.org/jira/browse/MESOS-2315 > Project: Mesos > Issue Type: Task >Reporter: Ian Downes >Assignee: Vaibhav Khanduja >Priority: Minor > Labels: mesosphere, newbie > > IIUC this has been deprecated and all current code (except > examples/docker_no_executor_framework.cpp) uses the top-level ContainerInfo? -- This message was sent by Atlassian JIRA (v6.3.4#6332)