[jira] [Commented] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551804#comment-14551804 ] Vinod Kone commented on MESOS-2735: --- Thanks Ben for the comments. I think the main motivations for the push model were to 1) make the writing of the slave logic (interfacing with estimator) simple and 2) make the writing of estimator module simple. Originally, with the pull model, it looked like we need to have 2 intervals within the slave: one for slave sending estimates to the master and one for slave getting estimates from the estimator. But if we assume that the estimators will be well behaved then we don't need an interval for the latter. The other issue, as you discussed in your comment, was about DoS. It *looked* like both the push and pull model had the same scope for DoS on the slave, so we didn't find a compelling a reason to go for pull because push was easier to implement on both sides of the interface. I said *looked*, because after processing your comments, I realized that the DoS behavior is different in push vs pull. In a push model a misbehaving estimator could do head of line blocking of other messages enqueued on the slave's queue, whereas in the pull model head of line blocking is not possible because the next (deferred) pull will be enqueued behind all the other messages. So, I'm ok going with pull for safety. Also, the composition argument can't be denied. Btw, the inspiration for the push model came from the allocator (and to a lesser extent Mesos class) which I think is very close to the estimator in terms of interactions. [~jieyu], ok with this? > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-354) Oversubscribe resources
[ https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551732#comment-14551732 ] Joris Van Remoortere commented on MESOS-354: [~vinodkone] I think the allocator logic generally makes sense. I would just call out that we will likely want to treat revocable_available differently for resources coming from the resource estimator as opposed to optimistic offers. The reason for that is: 1) resource estimator updates are a "rude edit", as in they purely overwrite the revocable resources 2) resources from optimistic offers are increased / decreased based on allocation by the original owner of the resources. The same way that we expect the Revocable resources to be flagged differently in the offer protobuf, I think we may want to either: 1) have separate pools of revocable resources available in the allocator for each source (lender?) of the resource OR 2) ensure that all revocable resources are introduced into the allocator the same way (as in rude edits, or deltas). In general, though, I think the behavior is common between them. What do you think? > Oversubscribe resources > --- > > Key: MESOS-354 > URL: https://issues.apache.org/jira/browse/MESOS-354 > Project: Mesos > Issue Type: Epic > Components: isolation, master, slave >Reporter: brian wickman >Priority: Minor > Labels: mesosphere, twitter > Attachments: mesos_virtual_offers.pdf > > > This proposal is predicated upon offer revocation. > The idea would be to add a new "revoked" status either by (1) piggybacking > off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a > new status update TASK_REVOKED. > In order to augment an offer with metadata about revocability, there are > options: > 1) Add a revocable boolean to the Offer and > a) offer only one type of Offer per slave at a particular time > b) offer both revocable and non-revocable resources at the same time but > require frameworks to understand that Offers can contain overlapping resources > 2) Add a revocable_resources field on the Offer which is a superset of the > regular resources field. By consuming > resources <= revocable_resources in > a launchTask, the Task becomes a revocable task. If launching a task with < > resources, the Task is non-revocable. > The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) > and non-revocable tasks are online higher-SLA tasks (e.g. services.) > Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk. > One of these resources is a rate (4 cpu seconds per second) and two of them > are fixed values (8GB and 20GB respectively, though disk resources can be > further broken down into spindles - fixed - and iops - a rate.) In practice, > these are the maximum resources in the respective dimensions that this task > will use. In reality, we provision tasks at some factor below peak, and only > hit peak resource consumption in rare circumstances or perhaps at a diurnal > peak. > In the meantime, we stand to gain from offering the some constant factor of > the difference between (reserved - actual) of non-revocable tasks as > revocable resources, depending upon our tolerance for revocable task churn. > The main challenge is coming up with an accurate short / medium / long-term > prediction of resource consumption based upon current behavior. > In many cases it would be OK to be sloppy: > * CPU / iops / network IO are rates (compressible) and can often be OK > below guarantees for brief periods of time while task revocation takes place > * Memory slack can be provided by enabling swap and dynamically setting > swap paging boundaries. Should swap ever be activated, that would be a > signal to revoke. > The master / allocator would piggyback on the slave heartbeat mechanism to > learn of the amount of revocable resources available at any point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2044) Use one IP address per container for network isolation
[ https://issues.apache.org/jira/browse/MESOS-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551652#comment-14551652 ] Swapnil Daingade commented on MESOS-2044: - We are trying to support network isolation between different YARN clusters running on Mesos as part of the Apache Myriad project. We tried using OpenVSwitch and Socketplane(Docker). See the design docs here. https://github.com/mesos/myriad/issues/96 https://docs.google.com/document/d/1uV2V0cSTngVfWs-5pYm2b9gOCYF4WSNkyzj2dm3bRnw/pub > Use one IP address per container for network isolation > -- > > Key: MESOS-2044 > URL: https://issues.apache.org/jira/browse/MESOS-2044 > Project: Mesos > Issue Type: Epic >Reporter: Cong Wang > > If there are enough IP addresses, either IPv4 or IPv6, we should use one IP > address per container, instead of the ugly port range based solution. One > problem with this is the IP address management, usually it is managed by a > DHCP server, maybe we need to manage them in mesos master/slave. > Also, maybe use macvlan instead of veth for better isolation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-994) Add an Option os::getenv() to stout
[ https://issues.apache.org/jira/browse/MESOS-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551599#comment-14551599 ] Benjamin Mahler commented on MESOS-994: --- [~tnachen] I saw you reviewed the first one, can you review the rest as well? :) > Add an Option os::getenv() to stout > --- > > Key: MESOS-994 > URL: https://issues.apache.org/jira/browse/MESOS-994 > Project: Mesos > Issue Type: Improvement > Components: stout, technical debt >Reporter: Ian Downes >Assignee: Greg Mann > Labels: newbie > > This would replace the common pattern of: > Option = os::hasenv() ? Option(os::getenv()) : None() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Hindman updated MESOS-2735: Comment: was deleted (was: There are definitely differences in message queue behavior, one of which is significantly safer than the other. There are two safety concerns that I can think of, one of which [~jieyu] has addressed here but I'll repeat to be sure I properly understood. (1) Someone might write a ResourceEstimator that isn't asynchronous, causing the slave to "block" while the resource estimator estimates. (2) The ResourceEstimator might cause a denial of service attack on the slave. I understand the concern with (1) but I'm not too anxious about it. Why? It should be trivial to make a wrapper module which forces people to implement the ResourceEstimator to be asynchronous, either using `async` like you suggested or implementing a version of ResourceEstimator which wraps an actor (libprocess process). We'll only need to do this once and then other ResourceEstimator implementations can leverage this stuff. On the other hand, I don't like the behavior of push because of (2). Fundamentally, if the slave can't keep up with the rate at which the ResourceEstimator is pushing then we could create a denial of service issue with the slave, i.e., it takes a long time to process non-ResourceEstimator messages because it's queue is full of just ResourceEstimator messages. I'm more anxious about (2) than (1) because it's harder to find bugs in (2) than with (1) since once you fix (1) it stays fixed forever but any time you updated the algorithm you impact the potential to cause (2). Now, I acknowledge that implementing this as a pull versus push will make the implementation in the ResourceEstimator slightly more complicated, but not really. In particular, it should be trivial to always use a `Queue` to achieve the push semantics in any ResourceEstimator implementation, while still providing the pull semantics externally. Make sense? Finally, one of the advantages of the pull model is that it's easier to reason about because we don't have "anonymous" lambdas that cause execution in some other random place in the code (i.e., you can easily see in the slave where the future that gets returned from `ResourceEstimator::estimate()` gets handled). In addition, the ResourceEstimator remains "functional" in the sense that it just has to return some value (or a future) from it's functions versus invoking some callback that causes something to get run some other place (and in fact, may also block, so isn't it safer for the ResourceEstimator to invoke the callback in it's own `async`?). The invocation of the `ResourceEstimator::estimate()` followed by the `.then` is a nice pattern that let's us compose with other things as well, which is harder to do with the lambda style callbacks and why we've avoided it where we've been able (in fact, I'm curious which place in the code are you imitating here?).) > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551569#comment-14551569 ] Benjamin Hindman commented on MESOS-2735: - There are definitely differences in message queue behavior, one of which is significantly safer than the other. There are two safety concerns that I can think of, one of which [~jieyu] has addressed here but I'll repeat to be sure I properly understood. (1) Someone might write a ResourceEstimator that isn't asynchronous, causing the slave to "block" while the resource estimator estimates. (2) The ResourceEstimator might cause a denial of service attack on the slave. I understand the concern with (1) but I'm not too anxious about it. Why? It should be trivial to make a wrapper module which forces people to implement the ResourceEstimator to be asynchronous, either using `async` like you suggested or implementing a version of ResourceEstimator which wraps an actor (libprocess process). We'll only need to do this once and then other ResourceEstimator implementations can leverage this stuff. On the other hand, I don't like the behavior of push because of (2). Fundamentally, if the slave can't keep up with the rate at which the ResourceEstimator is pushing then we could create a denial of service issue with the slave, i.e., it takes a long time to process non-ResourceEstimator messages because it's queue is full of just ResourceEstimator messages. I'm more anxious about (2) than (1) because it's harder to find bugs in (2) than with (1) since once you fix (1) it stays fixed forever but any time you updated the algorithm you impact the potential to cause (2). Now, I acknowledge that implementing this as a pull versus push will make the implementation in the ResourceEstimator slightly more complicated, but not really. In particular, it should be trivial to always use a `Queue` to achieve the push semantics in any ResourceEstimator implementation, while still providing the pull semantics externally. Make sense? Finally, one of the advantages of the pull model is that it's easier to reason about because we don't have "anonymous" lambdas that cause execution in some other random place in the code (i.e., you can easily see in the slave where the future that gets returned from `ResourceEstimator::estimate()` gets handled). In addition, the ResourceEstimator remains "functional" in the sense that it just has to return some value (or a future) from it's functions versus invoking some callback that causes something to get run some other place (and in fact, may also block, so isn't it safer for the ResourceEstimator to invoke the callback in it's own `async`?). The invocation of the `ResourceEstimator::estimate()` followed by the `.then` is a nice pattern that let's us compose with other things as well, which is harder to do with the lambda style callbacks and why we've avoided it where we've been able (in fact, I'm curious which place in the code are you imitating here?). > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551568#comment-14551568 ] Benjamin Hindman commented on MESOS-2735: - There are definitely differences in message queue behavior, one of which is significantly safer than the other. There are two safety concerns that I can think of, one of which [~jieyu] has addressed here but I'll repeat to be sure I properly understood. (1) Someone might write a ResourceEstimator that isn't asynchronous, causing the slave to "block" while the resource estimator estimates. (2) The ResourceEstimator might cause a denial of service attack on the slave. I understand the concern with (1) but I'm not too anxious about it. Why? It should be trivial to make a wrapper module which forces people to implement the ResourceEstimator to be asynchronous, either using `async` like you suggested or implementing a version of ResourceEstimator which wraps an actor (libprocess process). We'll only need to do this once and then other ResourceEstimator implementations can leverage this stuff. On the other hand, I don't like the behavior of push because of (2). Fundamentally, if the slave can't keep up with the rate at which the ResourceEstimator is pushing then we could create a denial of service issue with the slave, i.e., it takes a long time to process non-ResourceEstimator messages because it's queue is full of just ResourceEstimator messages. I'm more anxious about (2) than (1) because it's harder to find bugs in (2) than with (1) since once you fix (1) it stays fixed forever but any time you updated the algorithm you impact the potential to cause (2). Now, I acknowledge that implementing this as a pull versus push will make the implementation in the ResourceEstimator slightly more complicated, but not really. In particular, it should be trivial to always use a `Queue` to achieve the push semantics in any ResourceEstimator implementation, while still providing the pull semantics externally. Make sense? Finally, one of the advantages of the pull model is that it's easier to reason about because we don't have "anonymous" lambdas that cause execution in some other random place in the code (i.e., you can easily see in the slave where the future that gets returned from `ResourceEstimator::estimate()` gets handled). In addition, the ResourceEstimator remains "functional" in the sense that it just has to return some value (or a future) from it's functions versus invoking some callback that causes something to get run some other place (and in fact, may also block, so isn't it safer for the ResourceEstimator to invoke the callback in it's own `async`?). The invocation of the `ResourceEstimator::estimate()` followed by the `.then` is a nice pattern that let's us compose with other things as well, which is harder to do with the lambda style callbacks and why we've avoided it where we've been able (in fact, I'm curious which place in the code are you imitating here?). > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2652) Update Mesos containerizer to understand revocable cpu resources
[ https://issues.apache.org/jira/browse/MESOS-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551462#comment-14551462 ] Joris Van Remoortere commented on MESOS-2652: - Review for setting core affinity: https://reviews.apache.org/r/34442 Will base the SCHED_OTHER over SCHED_IDLE pre-emption test on this. > Update Mesos containerizer to understand revocable cpu resources > > > Key: MESOS-2652 > URL: https://issues.apache.org/jira/browse/MESOS-2652 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Ian Downes > Labels: twitter > > The CPU isolator needs to properly set limits for revocable and non-revocable > containers. > The proposed strategy is to use a two-way split of the cpu cgroup hierarchy > -- normal (non-revocable) and low priority (revocable) subtrees -- and to use > a biased split of CFS cpu.shares across the subtrees, e.g., a 20:1 split > (TBD). Containers would be present in only one of the subtrees. CFS quotas > will *not* be set on subtree roots, only cpu.shares. Each container would set > CFS quota and shares as done currently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2753) Enforce revocable CPU invariant in Master
Ian Downes created MESOS-2753: - Summary: Enforce revocable CPU invariant in Master Key: MESOS-2753 URL: https://issues.apache.org/jira/browse/MESOS-2753 Project: Mesos Issue Type: Task Components: isolation, master Affects Versions: 0.23.0 Reporter: Ian Downes Current implementation out for [review|https://reviews.apache.org/r/34310] only supports setting the priority of containers with revocable CPU if it's specified in the initial executor info resources. This should be enforced at the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2652) Update Mesos containerizer to understand revocable cpu resources
[ https://issues.apache.org/jira/browse/MESOS-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551381#comment-14551381 ] Ian Downes commented on MESOS-2652: --- Borg does prod and non-prod as coarse prioritization bands but supports different priorities within each. > Update Mesos containerizer to understand revocable cpu resources > > > Key: MESOS-2652 > URL: https://issues.apache.org/jira/browse/MESOS-2652 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Ian Downes > Labels: twitter > > The CPU isolator needs to properly set limits for revocable and non-revocable > containers. > The proposed strategy is to use a two-way split of the cpu cgroup hierarchy > -- normal (non-revocable) and low priority (revocable) subtrees -- and to use > a biased split of CFS cpu.shares across the subtrees, e.g., a 20:1 split > (TBD). Containers would be present in only one of the subtrees. CFS quotas > will *not* be set on subtree roots, only cpu.shares. Each container would set > CFS quota and shares as done currently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2636) Segfault in inline Try getIP(const std::string& hostname, int family)
[ https://issues.apache.org/jira/browse/MESOS-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551337#comment-14551337 ] Benjamin Mahler commented on MESOS-2636: net::hostname fix was committed: {noformat} commit 08e11d372afbb66907130998b485c185687fae34 Author: Chi Zhang Date: Tue May 19 15:03:23 2015 -0700 Removed bad call to freeaddrinfo in net::hostname. Review: https://reviews.apache.org/r/34438 {noformat} > Segfault in inline Try getIP(const std::string& hostname, int family) > - > > Key: MESOS-2636 > URL: https://issues.apache.org/jira/browse/MESOS-2636 > Project: Mesos > Issue Type: Bug >Reporter: Chi Zhang >Assignee: Chi Zhang > Labels: twitter > Fix For: 0.23.0 > > > We saw a segfault in production. Attaching the coredump, we see: > Core was generated by `/usr/local/sbin/mesos-slave --port=5051 > --resources=cpus:23;mem:70298;ports:[31'. > Program terminated with signal 11, Segmentation fault. > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > #1 0x7f63986c25d0 in freeaddrinfo () from /lib64/libc.so.6 > #2 0x7f6399deeafa in net::getIP (hostname="", family=2) at > ./3rdparty/stout/include/stout/net.hpp:201 > #3 0x7f6399e1f273 in process::initialize (delegate=Unhandled dwarf > expression opcode 0xf3 > ) at src/process.cpp:837 > #4 0x0042342f in main () -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2044) Use one IP address per container for network isolation
[ https://issues.apache.org/jira/browse/MESOS-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551326#comment-14551326 ] Ian Downes commented on MESOS-2044: --- This JIRA is intended to address a single IP per container which is shared by the executor and all tasks within the container and is different to the host's. That's a very valid requirement though so please raise a separate ticket. > Use one IP address per container for network isolation > -- > > Key: MESOS-2044 > URL: https://issues.apache.org/jira/browse/MESOS-2044 > Project: Mesos > Issue Type: Epic >Reporter: Cong Wang > > If there are enough IP addresses, either IPv4 or IPv6, we should use one IP > address per container, instead of the ugly port range based solution. One > problem with this is the IP address management, usually it is managed by a > DHCP server, maybe we need to manage them in mesos master/slave. > Also, maybe use macvlan instead of veth for better isolation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2752) Add HTB queueing discipline wrapper class
Paul Brett created MESOS-2752: - Summary: Add HTB queueing discipline wrapper class Key: MESOS-2752 URL: https://issues.apache.org/jira/browse/MESOS-2752 Project: Mesos Issue Type: Bug Reporter: Paul Brett Assignee: Paul Brett Network isolator uses a Hierarchical Token Bucket (HTB) traffic control discipline on the egress filter inside each container as the root for adding traffic filters. A HTB wrapper is needed to access the network statistics for this interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2636) Segfault in inline Try getIP(const std::string& hostname, int family)
[ https://issues.apache.org/jira/browse/MESOS-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551289#comment-14551289 ] Chi Zhang commented on MESOS-2636: -- https://reviews.apache.org/r/34438/ did a round of grepping. getIP and hostname are the only 2 places use freeaddrinfo. > Segfault in inline Try getIP(const std::string& hostname, int family) > - > > Key: MESOS-2636 > URL: https://issues.apache.org/jira/browse/MESOS-2636 > Project: Mesos > Issue Type: Bug >Reporter: Chi Zhang >Assignee: Chi Zhang > Labels: twitter > Fix For: 0.23.0 > > > We saw a segfault in production. Attaching the coredump, we see: > Core was generated by `/usr/local/sbin/mesos-slave --port=5051 > --resources=cpus:23;mem:70298;ports:[31'. > Program terminated with signal 11, Segmentation fault. > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > #1 0x7f63986c25d0 in freeaddrinfo () from /lib64/libc.so.6 > #2 0x7f6399deeafa in net::getIP (hostname="", family=2) at > ./3rdparty/stout/include/stout/net.hpp:201 > #3 0x7f6399e1f273 in process::initialize (delegate=Unhandled dwarf > expression opcode 0xf3 > ) at src/process.cpp:837 > #4 0x0042342f in main () -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2665) Fix queuing discipline wrapper in linux/routing/queueing
[ https://issues.apache.org/jira/browse/MESOS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551270#comment-14551270 ] Paul Brett commented on MESOS-2665: --- Added: https://reviews.apache.org/r/34426/ > Fix queuing discipline wrapper in linux/routing/queueing > - > > Key: MESOS-2665 > URL: https://issues.apache.org/jira/browse/MESOS-2665 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Paul Brett >Assignee: Paul Brett >Priority: Critical > > qdisc search function is dependent on matching a single hard coded handle and > does not correctly test for interface, making the implementation fragile. > Additionally, the current setup scripts (using dynamically created shell > commands) do not match the hard coded handles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551142#comment-14551142 ] Ian Downes edited comment on MESOS-2254 at 5/19/15 8:18 PM: I presume he's referring to the slave flag {{--resource_monitoring_interval}} which currently defaults to {{RESOURCE_MONITORING_INTERVAL = Seconds(1)}} but which [~nnielsen] has marked as soon to be deprecated. {noformat} // TODO(nnielsen): Deprecate resource_monitoring_interval flag after // Mesos 0.23.0. Duration resource_monitoring_interval; {noformat} In the meantime, if this is causing performance issues then you could set {{--resource_monitoring_interval}} to something longer than the default. was (Author: idownes): I presume he's referring to the slave flag {{--resource_monitoring_interval}} which currently defaults to {{RESOURCE_MONITORING_INTERVAL = Seconds(1)}} but which [~nnielsen] has marked as soon to be deprecated. {noformat} // TODO(nnielsen): Deprecate resource_monitoring_interval flag after // Mesos 0.23.0. Duration resource_monitoring_interval; {noformat} In the meantime, if this is causing performance issues then you could set {{--resource_monitoring_internal}} to something longer than the default. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551142#comment-14551142 ] Ian Downes commented on MESOS-2254: --- I presume he's referring to the slave flag {{--resource_monitoring_interval}} which currently defaults to {{RESOURCE_MONITORING_INTERVAL = Seconds(1)}} but which [~nnielsen] has marked as soon to be deprecated. {noformat} // TODO(nnielsen): Deprecate resource_monitoring_interval flag after // Mesos 0.23.0. Duration resource_monitoring_interval; {noformat} In the meantime, if this is causing performance issues then you could set {{--resource_monitoring_internal}} to something longer than the default. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551127#comment-14551127 ] Daniel Nugent commented on MESOS-2254: -- [~idownes] In that case, do you know what the rate limiting is that [~nnielsen] referred to? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551118#comment-14551118 ] Ian Downes commented on MESOS-2254: --- [~nugend] No, --perf_interval is just for the perf isolator which uses a perf_event cgroup to efficiently run perf against a container. Unrelated to this. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-354) Oversubscribe resources
[ https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551042#comment-14551042 ] Vinod Kone commented on MESOS-354: -- This is the high level idea of how the different components (described in the design doc) interact for oversubscription for the MVP. --> Resource estimator sends an estimate of 'oversubscribable' resources to the slave. --> Slave periodically checks if its cached value of 'revocable resources' (i.e., allocations of revocable containers + oversubscribable resources) has changed. If changed, slave forwards 'revocable resources' to the master. --> Master rescinds outstanding revocable offers when it gets new 'revocable resources' estimate and updates the allocator. --> On receiving 'revocable resources' update, allocator updates 'revocable_available' (revocable resources - revocable allocation) resources. --> 'revocable_available' gets allocated to (and recovered from) frameworks in the same way as 'available' (regular resources). --> When sending offers master sends separate offers for revocable and regular resources. Some salient features of this proposal: --> Allocator changes are minimal. --> Slave forwards estimates only when there is a change => low load on master. --> Split offers allows master to rescind only revocable resources when necessary. Thoughts? > Oversubscribe resources > --- > > Key: MESOS-354 > URL: https://issues.apache.org/jira/browse/MESOS-354 > Project: Mesos > Issue Type: Epic > Components: isolation, master, slave >Reporter: brian wickman >Priority: Minor > Labels: mesosphere, twitter > Attachments: mesos_virtual_offers.pdf > > > This proposal is predicated upon offer revocation. > The idea would be to add a new "revoked" status either by (1) piggybacking > off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a > new status update TASK_REVOKED. > In order to augment an offer with metadata about revocability, there are > options: > 1) Add a revocable boolean to the Offer and > a) offer only one type of Offer per slave at a particular time > b) offer both revocable and non-revocable resources at the same time but > require frameworks to understand that Offers can contain overlapping resources > 2) Add a revocable_resources field on the Offer which is a superset of the > regular resources field. By consuming > resources <= revocable_resources in > a launchTask, the Task becomes a revocable task. If launching a task with < > resources, the Task is non-revocable. > The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) > and non-revocable tasks are online higher-SLA tasks (e.g. services.) > Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk. > One of these resources is a rate (4 cpu seconds per second) and two of them > are fixed values (8GB and 20GB respectively, though disk resources can be > further broken down into spindles - fixed - and iops - a rate.) In practice, > these are the maximum resources in the respective dimensions that this task > will use. In reality, we provision tasks at some factor below peak, and only > hit peak resource consumption in rare circumstances or perhaps at a diurnal > peak. > In the meantime, we stand to gain from offering the some constant factor of > the difference between (reserved - actual) of non-revocable tasks as > revocable resources, depending upon our tolerance for revocable task churn. > The main challenge is coming up with an accurate short / medium / long-term > prediction of resource consumption based upon current behavior. > In many cases it would be OK to be sloppy: > * CPU / iops / network IO are rates (compressible) and can often be OK > below guarantees for brief periods of time while task revocation takes place > * Memory slack can be provided by enabling swap and dynamically setting > swap paging boundaries. Should swap ever be activated, that would be a > signal to revoke. > The master / allocator would piggyback on the slave heartbeat mechanism to > learn of the amount of revocable resources available at any point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2751) Stopping the scheduler driver w/o failover requires a sleep to ensure the UnregisterFrameworkMessage is delivered.
Benjamin Mahler created MESOS-2751: -- Summary: Stopping the scheduler driver w/o failover requires a sleep to ensure the UnregisterFrameworkMessage is delivered. Key: MESOS-2751 URL: https://issues.apache.org/jira/browse/MESOS-2751 Project: Mesos Issue Type: Bug Components: framework Reporter: Benjamin Mahler Priority: Minor When the call to {{driver.stop(false)}} completes, the UnregisterFrameworkMessage will be sent asynchronously once the SchedulerProcess processes the dispatch event. This requires schedulers to sleep to ensure the message is processed: http://markmail.org/thread/yuzq5i3hkpttxc2s We could block on a Future result from the dispatch, if safe. But this still doesn't ensure the message is flushed out of libprocess. And without acknowledgements, we don't know if the master has successfully unregistered the framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2750) Extend queueing discipline wrappers to expose network isolator statistics
[ https://issues.apache.org/jira/browse/MESOS-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Brett updated MESOS-2750: -- Summary: Extend queueing discipline wrappers to expose network isolator statistics (was: Extend qeueing discipline wrappers to expose network isolator statistics) > Extend queueing discipline wrappers to expose network isolator statistics > - > > Key: MESOS-2750 > URL: https://issues.apache.org/jira/browse/MESOS-2750 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Paul Brett >Assignee: Paul Brett > > Export Traffic Control statistics in queueing library to enable reporting out > impact of network bandwidth statistics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2750) Extend qeueing discipline wrappers to expose network isolator statistics
Paul Brett created MESOS-2750: - Summary: Extend qeueing discipline wrappers to expose network isolator statistics Key: MESOS-2750 URL: https://issues.apache.org/jira/browse/MESOS-2750 Project: Mesos Issue Type: Bug Components: isolation Reporter: Paul Brett Assignee: Paul Brett Export Traffic Control statistics in queueing library to enable reporting out impact of network bandwidth statistics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2717) Qemu/KVM containerizer
[ https://issues.apache.org/jira/browse/MESOS-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550964#comment-14550964 ] Ian Downes commented on MESOS-2717: --- The containerizer interface was designed to support this and I'd be happy to shepherd any efforts. Some initial notes: 1) I agree, bridged networking would be simplest. 2) This could be done by the custom executor. Work is being started on making IP addresses a global resource. 3) The fetcher should be used. Patches for caching objects will soon be committed. 4) You could just run the VM inside cgroups/namespaces etc. and leverage the existing code for managing them. 5) Be aware that you need to architect the code so the slave can be restarted while VMs/containers are running, i.e., you'll need to re-establish said connections during slave recovery. > Qemu/KVM containerizer > -- > > Key: MESOS-2717 > URL: https://issues.apache.org/jira/browse/MESOS-2717 > Project: Mesos > Issue Type: Wish > Components: containerization >Reporter: Pierre-Yves Ritschard > > I think it would make sense for Mesos to have the ability to treat > hypervisors as containerizers and the most sensible one to start with would > probably be Qemu/KVM. > There are a few workloads that can require full-fledged VMs (the most obvious > one being Windows workloads). > The containerization code is well decoupled and seems simple enough, I can > definitely take a shot at it. VMs do bring some questions with them here is > my take on them: > 1. Routing, network strategy > == > The simplest approach here might very well be to go for bridged networks > and leave the setup and inter slave routing up to the administrator > 2. IP Address assignment > > At first, it can be up to the Frameworks to deal with IP assignment. > The simplest way to address this could be to have an executor running > on slaves providing the qemu/kvm containerizer which would instrument a DHCP > server and collect IP + Mac address resources from slaves. While it may be up > to the frameworks to provide this, an example should most likely be provided. > 3. VM Templates > == > VM templates should probably leverage the fetcher and could thus be copied > locally or fetch from HTTP(s) / HDFS. > 4. Resource limiting > > Mapping resouce constraints to the qemu command line is probably the easiest > part, Additional command line should also be fetchable. For Unix VMs, the > sandbox could show the output of the serial console > 5. Libvirt / plain Qemu > = > I tend to favor limiting the amount of necessary hoops to jump through and > would thus investigate working directly with Qemu, maintaining an open > connection to the monitor to assert status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2340) Publish JSON in ZK instead of serialized MasterInfo
[ https://issues.apache.org/jira/browse/MESOS-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550958#comment-14550958 ] Marco Massenzio edited comment on MESOS-2340 at 5/19/15 6:30 PM: - so, this is the simplest code I could came up with (this needs refining, obviously!) (in {{src/zookeeper/group.cpp}}: {code} // if label is not None, this is the MasterInfo being serialized if (label.isSome()) { // TODO: how do we serialize MasterInfo to JSON? we only have the // raw serialized data here string json = "{\"value\": \"foobar\"}"; string loc = result + ".json"; string jsonResult; zk->create( loc, json, acl, ZOO_EPHEMERAL, &jsonResult); LOG(INFO) << "Added JSON data to " << jsonResult; } {code} If I now start the Master, I can see both nodes in the {{/test/json}} folder. {noformat} $ ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/json/test --work_dir=/tmp/mesos --quorum=1 {noformat} {noformat} [zk: localhost:2181(CONNECTED) 8] ls /json/test [log_replicas, info_10, info_10.json] [zk: localhost:2181(CONNECTED) 9] get /json/test/info_10.json {"value": "foobar"} cZxid = 0xe6 ctime = Tue May 19 11:24:55 PDT 2015 mZxid = 0xe6 mtime = Tue May 19 11:24:55 PDT 2015 pZxid = 0xe6 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x14d496680460057 dataLength = 19 numChildren = 0 {noformat} and there's no .json in {{log_replicas}}. I would like to get suggestions as to where to "inject" the JSON, as in the Group class, we only get the serialized String, not the {{MasterInfo}} PB. There are obviously way around this, but I'd like to come up with an extensible way. was (Author: marco-mesos): so, this is the simplest code I could came up with (this needs refining, obviously!) (in {{src/zookeeper/group.cpp}}: {code} // if label is not None, this is the MasterInfo being serialized if (label.isSome()) { // TODO: how do we serialize MasterInfo to JSON? we only have the // raw serialized data here string json = "{\"value\": \"foobar\"}"; string loc = result + ".json"; string jsonResult; zk->create( loc, json, acl, ZOO_EPHEMERAL, &jsonResult); LOG(INFO) << "Added JSON data to " << jsonResult; } {info} If I now start the Master, I can see both nodes in the {{/test/json}} folder. {noformat} $ ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/json/test --work_dir=/tmp/mesos --quorum=1 {noformat} {noformat} [zk: localhost:2181(CONNECTED) 8] ls /json/test [log_replicas, info_10, info_10.json] [zk: localhost:2181(CONNECTED) 9] get /json/test/info_10.json {"value": "foobar"} cZxid = 0xe6 ctime = Tue May 19 11:24:55 PDT 2015 mZxid = 0xe6 mtime = Tue May 19 11:24:55 PDT 2015 pZxid = 0xe6 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x14d496680460057 dataLength = 19 numChildren = 0 {noformat} and there's no .json in {{log_replicas}}. I would like to get suggestions as to where to "inject" the JSON, as in the Group class, we only get the serialized String, not the {{MasterInfo}} PB. There are obviously way around this, but I'd like to come up with an extensible way. > Publish JSON in ZK instead of serialized MasterInfo > --- > > Key: MESOS-2340 > URL: https://issues.apache.org/jira/browse/MESOS-2340 > Project: Mesos > Issue Type: Improvement >Reporter: Zameer Manji >Assignee: haosdent > > Currently to discover the master a client needs the ZK node location and > access to the MasterInfo protobuf so it can deserialize the binary blob in > the node. > I think it would be nice to publish JSON (like Twitter's ServerSets) so > clients are not tied to protobuf to do service discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2340) Publish JSON in ZK instead of serialized MasterInfo
[ https://issues.apache.org/jira/browse/MESOS-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550958#comment-14550958 ] Marco Massenzio commented on MESOS-2340: so, this is the simplest code I could came up with (this needs refining, obviously!) (in {{src/zookeeper/group.cpp}}: {code} // if label is not None, this is the MasterInfo being serialized if (label.isSome()) { // TODO: how do we serialize MasterInfo to JSON? we only have the // raw serialized data here string json = "{\"value\": \"foobar\"}"; string loc = result + ".json"; string jsonResult; zk->create( loc, json, acl, ZOO_EPHEMERAL, &jsonResult); LOG(INFO) << "Added JSON data to " << jsonResult; } {info} If I now start the Master, I can see both nodes in the {{/test/json}} folder. {noformat} $ ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/json/test --work_dir=/tmp/mesos --quorum=1 {noformat} {noformat} [zk: localhost:2181(CONNECTED) 8] ls /json/test [log_replicas, info_10, info_10.json] [zk: localhost:2181(CONNECTED) 9] get /json/test/info_10.json {"value": "foobar"} cZxid = 0xe6 ctime = Tue May 19 11:24:55 PDT 2015 mZxid = 0xe6 mtime = Tue May 19 11:24:55 PDT 2015 pZxid = 0xe6 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x14d496680460057 dataLength = 19 numChildren = 0 {noformat} and there's no .json in {{log_replicas}}. I would like to get suggestions as to where to "inject" the JSON, as in the Group class, we only get the serialized String, not the {{MasterInfo}} PB. There are obviously way around this, but I'd like to come up with an extensible way. > Publish JSON in ZK instead of serialized MasterInfo > --- > > Key: MESOS-2340 > URL: https://issues.apache.org/jira/browse/MESOS-2340 > Project: Mesos > Issue Type: Improvement >Reporter: Zameer Manji >Assignee: haosdent > > Currently to discover the master a client needs the ZK node location and > access to the MasterInfo protobuf so it can deserialize the binary blob in > the node. > I think it would be nice to publish JSON (like Twitter's ServerSets) so > clients are not tied to protobuf to do service discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2636) Segfault in inline Try getIP(const std::string& hostname, int family)
[ https://issues.apache.org/jira/browse/MESOS-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550913#comment-14550913 ] Chi Zhang commented on MESOS-2636: -- [~cnstar9988] thanks for reporting! I will submit a fix. > Segfault in inline Try getIP(const std::string& hostname, int family) > - > > Key: MESOS-2636 > URL: https://issues.apache.org/jira/browse/MESOS-2636 > Project: Mesos > Issue Type: Bug >Reporter: Chi Zhang >Assignee: Chi Zhang > Labels: twitter > Fix For: 0.23.0 > > > We saw a segfault in production. Attaching the coredump, we see: > Core was generated by `/usr/local/sbin/mesos-slave --port=5051 > --resources=cpus:23;mem:70298;ports:[31'. > Program terminated with signal 11, Segmentation fault. > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > (gdb) bt > #0 0x7f639867c77e in free () from /lib64/libc.so.6 > #1 0x7f63986c25d0 in freeaddrinfo () from /lib64/libc.so.6 > #2 0x7f6399deeafa in net::getIP (hostname="", family=2) at > ./3rdparty/stout/include/stout/net.hpp:201 > #3 0x7f6399e1f273 in process::initialize (delegate=Unhandled dwarf > expression opcode 0xf3 > ) at src/process.cpp:837 > #4 0x0042342f in main () -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent reassigned MESOS-2748: --- Assignee: haosdent > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Assignee: haosdent >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550877#comment-14550877 ] haosdent commented on MESOS-2748: - Yes, thank you for your explain. Let me try fix it. > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2340) Publish JSON in ZK instead of serialized MasterInfo
[ https://issues.apache.org/jira/browse/MESOS-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550857#comment-14550857 ] haosdent commented on MESOS-2340: - I think create the protobuf node and json node at the same time maybe more clear. And zookeeper have multi operation api. https://github.com/apache/zookeeper/blob/trunk/src/c/tests/TestMulti.cc#L282-L284 But it is also OK for me if you decide add separate process to watch the node. I would try to implement this after you have a conclusion. :-) > Publish JSON in ZK instead of serialized MasterInfo > --- > > Key: MESOS-2340 > URL: https://issues.apache.org/jira/browse/MESOS-2340 > Project: Mesos > Issue Type: Improvement >Reporter: Zameer Manji >Assignee: haosdent > > Currently to discover the master a client needs the ZK node location and > access to the MasterInfo protobuf so it can deserialize the binary blob in > the node. > I think it would be nice to publish JSON (like Twitter's ServerSets) so > clients are not tied to protobuf to do service discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550838#comment-14550838 ] Jie Yu commented on MESOS-2735: --- {quote} One of the advantages that we had discussed in the past was that the pull model enables us to move as fast as we possibly can, rather than just getting a bunch of messages queued up in the slave that we have to process. {quote} I don't think there is a difference in terms of queueing messages. The pull model also queues messages in the slave (e.g., 'estimator->oversubscribed().then(defer(...))' also queues messages in slave's queue). {quote} Even if we want to collect more fine-grained resource estimations a ResourceEstimator could do this and store this information until future polls. {quote} I think there's no fundamental difference between the pull and the push model. The are only two subtle differences between the two: 1) the push model makes less assumptions about the slave behavior. 2) the push model is safer in the face of bad behaved resource estimator. Let me elaborate both of them below: Regarding (1), let's use an example. Say we want to write a resource estimator which sends constant number of cpus (say 2 cpus) every 10 seconds. If we use a push model, we could just follow the [NoopResourceEstimatorProcess|https://github.com/apache/mesos/blob/master/src/slave/resource_estimator.cpp#L52] implementation in the code. Basically, we fork a libprocess and invoke the registered callback every 10 seconds with 2 cpus. Now, if we use a pull model, we first need to make an assumption that the slave pull the resource estimator as fast as it can without any delay. If there's a delay say 1 second, the resource estimator needs to adjust its internal delay to be 9 seconds so that the total interval between two estimations is 10 seconds apart. When implementing the `Future oversubscribed()` interface, the module writer needs to make another assumption about the slave that the slave will not invoke the interface again if the previous estimation is still pending. This is important because otherwise, the module writer needs to maintain a list of Promises (instead of just one). I just feels that there're so many implicit assumptions that the module writer needs to make in a pull model. Regarding (2), as I already stated in this ticket, since the slave invoked the interface ('oversubscribed()') in its context, the module writer needs to make sure the implementation of the interface does not block, otherwise the slave will hang. An alternative is to use 'async' while invoking the interface in the slave. I just feel this is rather not necessary if we use a push model. > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2741) Exposing Resources along with ResourceStatistics from resource monitor
[ https://issues.apache.org/jira/browse/MESOS-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550825#comment-14550825 ] haosdent commented on MESOS-2741: - And form implement this issue, how about change the interface in Container from {code} virtual process::Future usage( const ContainerID& containerId) = 0; {code} to {code} virtual process::Future usage( const ContainerID& containerId) = 0; {code} > Exposing Resources along with ResourceStatistics from resource monitor > -- > > Key: MESOS-2741 > URL: https://issues.apache.org/jira/browse/MESOS-2741 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > Labels: mesosphere, twitter > > Right now, the resource monitor returns a Usage which contains ContainerId, > ExecutorInfo and ResourceStatistics. In order for resource estimator/qos > controller to calculate usage slack, or tell if a container is using > revokable resources or not, we need to expose the Resources that are > currently assigned to the container. > This requires us the change the containerizer interface to get the Resources > as well while calling 'usage()'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550809#comment-14550809 ] Michael Lunøe edited comment on MESOS-2748 at 5/19/15 5:25 PM: --- [~haosd...@gmail.com] Yes, the problem is exactly the use of absolute paths. {{/mesos/help}} works (showing a page with urls), but the urls listed are absolute paths, i.e. {{/help/metrics}} or {{/help/\_\_processess\_\_}}. If it were to use relative paths, so they would show correct paths: {{/mesos/help/metrics}} and {{/mesos/help/\_\_processess\_\_}} in stead. Does that answer your question? was (Author: mlunoe): [~haosd...@gmail.com] Yes, the problem is exactly the use of absolute paths. {{/mesos/help}} works (showing a page with urls), but the urls listed are absolute paths, i.e. {{/help/metrics}} or {{/help/__processes__}}. If it were to use relative paths, so they would show correct paths: {{/mesos/help/metrics}} and {{/mesos/help/__processes__}} in stead. Does that answer your question? > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550809#comment-14550809 ] Michael Lunøe edited comment on MESOS-2748 at 5/19/15 5:22 PM: --- [~haosd...@gmail.com] Yes, the problem is exactly the use of absolute paths. {{/mesos/help}} works (showing a page with urls), but the urls listed are absolute paths, i.e. {{/help/metrics}} or {{/help/__processes__}}. If it were to use relative paths, so they would show correct paths: {{/mesos/help/metrics}} and {{/mesos/help/__processes__}} in stead. Does that answer your question? was (Author: mlunoe): [~haosd...@gmail.com] Yes, the problem is exactly the use of absolute paths. "/mesos/help" works (showing a page with urls), but the urls listed are absolute paths, i.e. "/help/metrics" or "/help/__processes__". If it were to use relative paths, so they would show correct paths: "/mesos/help/metrics" and "/mesos/help/__processes__" in stead. Does that answer your question? > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550809#comment-14550809 ] Michael Lunøe commented on MESOS-2748: -- [~haosd...@gmail.com] Yes, the problem is exactly the use of absolute paths. "/mesos/help" works (showing a page with urls), but the urls listed are absolute paths, i.e. "/help/metrics" or "/help/__processes__". If it were to use relative paths, so they would show correct paths: "/mesos/help/metrics" and "/mesos/help/__processes__" in stead. Does that answer your question? > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Lunøe updated MESOS-2748: - Description: As reported by Michael Lunøe (see also MESOS-329 and MESOS-913 for background): {quote} In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, which is then converted to html through a javascript library All endpoints point to {{/help/...}}, they need to work dynamically for reverse proxy to do its thing. {{/mesos/help}} works, and displays the endpoints, but they each need to go to their respective {{/help/...}} endpoint. Note that this needs to work both for master, and for slaves. I think the route to slaves help is something like this: {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please double check this. {quote} The fix appears to be not too complex (as it would require to simply manipulate the generated URL) but a quick skim of the code would suggest that something more substantial may be desirable too. was: As reported by Michael Lunøe (see also MESOS-329 and MESOS-913 for background): {quote} In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, which is then converted to html through a javascript library All endpoints point to {{/help/...}}, they need to work dynamically for reverse proxy to do its thing. {{/mesos/help}} works, and displays the endpoints, but they each need to go to their respective {{/mesos/help/...}} endpoint. Note that this needs to work both for master, and for slaves. I think the route to slaves help is something like this: {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please double check this. {quote} The fix appears to be not too complex (as it would require to simply manipulate the generated URL) but a quick skim of the code would suggest that something more substantial may be desirable too. > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2588) Create pre-create hook before a Docker container launches
[ https://issues.apache.org/jira/browse/MESOS-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550779#comment-14550779 ] haosdent commented on MESOS-2588: - [~baotiao]Sorry for not update this issue quickly. I unassigned it now. > Create pre-create hook before a Docker container launches > - > > Key: MESOS-2588 > URL: https://issues.apache.org/jira/browse/MESOS-2588 > Project: Mesos > Issue Type: Bug > Components: docker >Reporter: Timothy Chen > > To be able to support custom actions to be called before launching a docker > contianer, we should create a hook that can be extensible and allow > module/hooks to be performed before a docker container is launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2588) Create pre-create hook before a Docker container launches
[ https://issues.apache.org/jira/browse/MESOS-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-2588: Assignee: (was: haosdent) > Create pre-create hook before a Docker container launches > - > > Key: MESOS-2588 > URL: https://issues.apache.org/jira/browse/MESOS-2588 > Project: Mesos > Issue Type: Bug > Components: docker >Reporter: Timothy Chen > > To be able to support custom actions to be called before launching a docker > contianer, we should create a hook that can be extensible and allow > module/hooks to be performed before a docker container is launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2588) Create pre-create hook before a Docker container launches
[ https://issues.apache.org/jira/browse/MESOS-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550765#comment-14550765 ] chenzongzhi commented on MESOS-2588: Hey haosdent, Adam Avilla. Do you have any plan about this issue? We really need this feature, so if you don't have time, maybe you can assign this feature to me. > Create pre-create hook before a Docker container launches > - > > Key: MESOS-2588 > URL: https://issues.apache.org/jira/browse/MESOS-2588 > Project: Mesos > Issue Type: Bug > Components: docker >Reporter: Timothy Chen >Assignee: haosdent > > To be able to support custom actions to be called before launching a docker > contianer, we should create a hook that can be extensible and allow > module/hooks to be performed before a docker container is launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2731) Allow frameworks to deploy storage drivers on demand.
[ https://issues.apache.org/jira/browse/MESOS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2731: -- Description: Certain storage options require storage drivers to access them including HDFS driver, Quobyte client, Database driver, and so on. When Tasks in Mesos require access to such storage they also need access to the respective driver on the node where they were scheduled to. As it is not desirable to deploy the driver onto all nodes in the cluster, it would be good to deploy the driver on demand. Use Cases: 1. Fetcher Cache pulling resources from user-provided URIs 2. Framework executors/tasks requiring r/w access to HDFS/DFS 3. Framework executors/tasks requiring r/w Databases access (requiring drivers) was: Certain storage options require storage drivers to access them including HDFS driver, Quobyte client, Database driver, and so on. When Tasks in Mesos require access to such storage they also need access to the respective driver on the node where they were scheduled to. As it is not desirable to deploy the driver onto all nodes in the cluster, it would be good to deploy the driver on demand. Use Cases: 1. Fetcher Cache accessing resources from user-provided URIs 2. Framework executors/tasks requiring access to HDFS/DFS 3. Framework executors/tasks requiring Databases access (requiring drivers) > Allow frameworks to deploy storage drivers on demand. > - > > Key: MESOS-2731 > URL: https://issues.apache.org/jira/browse/MESOS-2731 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > Certain storage options require storage drivers to access them including HDFS > driver, Quobyte client, Database driver, and so on. > When Tasks in Mesos require access to such storage they also need access to > the respective driver on the node where they were scheduled to. > As it is not desirable to deploy the driver onto all nodes in the cluster, it > would be good to deploy the driver on demand. > Use Cases: > 1. Fetcher Cache pulling resources from user-provided URIs > 2. Framework executors/tasks requiring r/w access to HDFS/DFS > 3. Framework executors/tasks requiring r/w Databases access (requiring > drivers) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2731) Allow frameworks to deploy storage drivers on demand.
[ https://issues.apache.org/jira/browse/MESOS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2731: -- Labels: mesosphere (was: ) > Allow frameworks to deploy storage drivers on demand. > - > > Key: MESOS-2731 > URL: https://issues.apache.org/jira/browse/MESOS-2731 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > Certain storage options require storage drivers to access them including HDFS > driver, Quobyte client, Database driver, and so on. > When Tasks in Mesos require access to such storage they also need access to > the respective driver on the node where they were scheduled to. > As it is not desirable to deploy the driver onto all nodes in the cluster, it > would be good to deploy the driver on demand. > Use Cases: > 1. Fetcher Cache accessing resources from user-provided URIs > 2. Framework executors/tasks requiring access to HDFS/DFS > 3. Framework executors/tasks requiring Databases access (requiring drivers) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2728) Introduce concept of cluster wide resources.
[ https://issues.apache.org/jira/browse/MESOS-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2728: -- Description: There are resources which are not provided by a single node. Consider for example a external Network Bandwidth of a cluster. Being a limited resource it makes sense for Mesos to manage it but still it is not a resource being offered by a single node. A cluster-wide resource is still consumed by a task, and when that task completes, the resources are then available to be allocated to another framework/task. Use Cases: 1. Network Bandwidth 2. IP Addresses 3. Global Service Ports 2. Distributed File System Storage 3. Software Licences was: There are resources which are not provided by a single node. Consider for example a external Network Bandwidth of a cluster. Being a limited resource it makes sense for Mesos to manage it but still it is not a resource being offered by a single node. Use Cases: 1. Network Bandwidth 2. IP Addresses 3. Global Service Ports 2. Distributed File System Storage 3. Software Licences > Introduce concept of cluster wide resources. > > > Key: MESOS-2728 > URL: https://issues.apache.org/jira/browse/MESOS-2728 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > There are resources which are not provided by a single node. Consider for > example a external Network Bandwidth of a cluster. Being a limited resource > it makes sense for Mesos to manage it but still it is not a resource being > offered by a single node. A cluster-wide resource is still consumed by a > task, and when that task completes, the resources are then available to be > allocated to another framework/task. > Use Cases: > 1. Network Bandwidth > 2. IP Addresses > 3. Global Service Ports > 2. Distributed File System Storage > 3. Software Licences -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2732) Expose Mount Tables
[ https://issues.apache.org/jira/browse/MESOS-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2732: -- Description: When there are multiple distributed/network-attached filesystems connected to a Mesos cluster, clients (e.g. the Mesos fetcher, or a Mesos task) of those filesystems need a clear way to distinguish between them and Mesos needs a way to direct requests to the correct (distributed) filesystem. _Use Cases_: - Multiple HDFS clusters on the same Mesos cluster - Connecting HDFS, MapRFS, Ceph, Lustre, GlusterFS, S3, GCS, and other SAN/NAS to a Mesos cluster - The Mesos fetcher may want to pull from any of the above. - An executor or task may want to read or write to multiple filesystems, within the same process. _Traditional Operating System Analogy_: Each line in Linux's fstab describes a different filesystem to mount into the root filesystem: 1. The device name or remote filesystem to be mounted. 2. The mount point, where the data is to be attached to the root file system. 3. The file system type or algorithm used to interpret the file system. 4. Options to be used when mounting (e.g. Read-Only). _What we need for each filesystem in the Mesos ecosystem_: 1. The metadata server or dfs/san entrypoint host:port 2. Mount point, where this filesystem fits into the universal Mesos-accessible filesystem namespace. 3. The protocol to speak, perhaps acceptable URI prefixes. 4. Options, ACLs for which frameworks/principals can access a particular filesystem, and how. was: When there are multiple distributed filesystems connected to a Mesos cluster, clients (e.g. the Mesos fetcher, or a Mesos task) of those filesystems need a clear way to distinguish between them and Mesos needs a way to direct requests to the correct (distributed) filesystem. #Use Cases: - Multiple HDFS clusters on the same Mesos cluster - Connecting HDFS, MapRFS, Ceph, Lustre, GlusterFS, S3, GCS, and other SAN/NAS to a Mesos cluster - The Mesos fetcher may want to pull from any of the above. - An executor or task may want to read or write to multiple filesystems, within the same process. #Traditional Operating System Analogy: Each line in Linux's fstab describes a different filesystem to mount into the root filesystem: 1. The device name or remote filesystem to be mounted. 2. The mount point, where the data is to be attached to the root file system. 3. The file system type or algorithm used to interpret the file system. 4. Options to be used when mounting (e.g. Read-Only). What we need for each filesystem in the Mesos ecosystem: 1. The metadata server or dfs/san entrypoint host:port 2. Mount point, where this filesystem fits into the universal Mesos-accessible filesystem namespace. 3. The protocol to speak, perhaps acceptable URI prefixes. 4. Options, ACLs for which frameworks/principals can access a particular filesystem, and how. > Expose Mount Tables > --- > > Key: MESOS-2732 > URL: https://issues.apache.org/jira/browse/MESOS-2732 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > When there are multiple distributed/network-attached filesystems connected to > a Mesos cluster, clients (e.g. the Mesos fetcher, or a Mesos task) of those > filesystems need a clear way to distinguish between them and Mesos needs a > way to direct requests to the correct (distributed) filesystem. > _Use Cases_: > - Multiple HDFS clusters on the same Mesos cluster > - Connecting HDFS, MapRFS, Ceph, Lustre, GlusterFS, S3, GCS, and other > SAN/NAS to a Mesos cluster > - The Mesos fetcher may want to pull from any of the above. > - An executor or task may want to read or write to multiple filesystems, > within the same process. > _Traditional Operating System Analogy_: > Each line in Linux's fstab describes a different filesystem to mount into the > root filesystem: > 1. The device name or remote filesystem to be mounted. > 2. The mount point, where the data is to be attached to the root file system. > 3. The file system type or algorithm used to interpret the file system. > 4. Options to be used when mounting (e.g. Read-Only). > _What we need for each filesystem in the Mesos ecosystem_: > 1. The metadata server or dfs/san entrypoint host:port > 2. Mount point, where this filesystem fits into the universal > Mesos-accessible filesystem namespace. > 3. The protocol to speak, perhaps acceptable URI prefixes. > 4. Options, ACLs for which frameworks/principals can access a particular > filesystem, and how. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2732) Expose Mount Tables
[ https://issues.apache.org/jira/browse/MESOS-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2732: -- Labels: mesosphere (was: ) > Expose Mount Tables > --- > > Key: MESOS-2732 > URL: https://issues.apache.org/jira/browse/MESOS-2732 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > When there are multiple distributed filesystems connected to a Mesos cluster, > clients (e.g. the Mesos fetcher, or a Mesos task) of those filesystems need a > clear way to distinguish between them and Mesos needs a way to direct > requests to the correct (distributed) filesystem. > #Use Cases: > - Multiple HDFS clusters on the same Mesos cluster > - Connecting HDFS, MapRFS, Ceph, Lustre, GlusterFS, S3, GCS, and other > SAN/NAS to a Mesos cluster > - The Mesos fetcher may want to pull from any of the above. > - An executor or task may want to read or write to multiple filesystems, > within the same process. > #Traditional Operating System Analogy: > Each line in Linux's fstab describes a different filesystem to mount into the > root filesystem: > 1. The device name or remote filesystem to be mounted. > 2. The mount point, where the data is to be attached to the root file system. > 3. The file system type or algorithm used to interpret the file system. > 4. Options to be used when mounting (e.g. Read-Only). > What we need for each filesystem in the Mesos ecosystem: > 1. The metadata server or dfs/san entrypoint host:port > 2. Mount point, where this filesystem fits into the universal > Mesos-accessible filesystem namespace. > 3. The protocol to speak, perhaps acceptable URI prefixes. > 4. Options, ACLs for which frameworks/principals can access a particular > filesystem, and how. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2741) Exposing Resources along with ResourceStatistics from resource monitor
[ https://issues.apache.org/jira/browse/MESOS-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550708#comment-14550708 ] haosdent commented on MESOS-2741: - `calculate usage slack` or ` calculate usage stack` > Exposing Resources along with ResourceStatistics from resource monitor > -- > > Key: MESOS-2741 > URL: https://issues.apache.org/jira/browse/MESOS-2741 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > Labels: mesosphere, twitter > > Right now, the resource monitor returns a Usage which contains ContainerId, > ExecutorInfo and ResourceStatistics. In order for resource estimator/qos > controller to calculate usage slack, or tell if a container is using > revokable resources or not, we need to expose the Resources that are > currently assigned to the container. > This requires us the change the containerizer interface to get the Resources > as well while calling 'usage()'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2745) Add 'Path' to stout's user guide
[ https://issues.apache.org/jira/browse/MESOS-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550687#comment-14550687 ] haosdent commented on MESOS-2745: - Review board: https://reviews.apache.org/r/34416/ > Add 'Path' to stout's user guide > - > > Key: MESOS-2745 > URL: https://issues.apache.org/jira/browse/MESOS-2745 > Project: Mesos > Issue Type: Improvement >Reporter: Till Toenshoff > Labels: newbie > > stout's README does not yet include 'Path', we should fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2340) Publish JSON in ZK instead of serialized MasterInfo
[ https://issues.apache.org/jira/browse/MESOS-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550654#comment-14550654 ] Marco Massenzio commented on MESOS-2340: I'm not familiar with the {{multi}} operation, however, thinking a bit more about this, it turns out the solution should be simpler: post ephemeral node creation, create another "mirror" JSON-content znode, equally ephemeral, that will go away whenever the original PB-content znode does. This seems a simple enough approach (and, as such, I'm sure I'm overlooking something!) I'm looking into the code, and it seems to me that the {{GroupProcess::doJoin()}} is the place to do this (maybe?) > Publish JSON in ZK instead of serialized MasterInfo > --- > > Key: MESOS-2340 > URL: https://issues.apache.org/jira/browse/MESOS-2340 > Project: Mesos > Issue Type: Improvement >Reporter: Zameer Manji >Assignee: haosdent > > Currently to discover the master a client needs the ZK node location and > access to the MasterInfo protobuf so it can deserialize the binary blob in > the node. > I think it would be nice to publish JSON (like Twitter's ServerSets) so > clients are not tied to protobuf to do service discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550642#comment-14550642 ] haosdent commented on MESOS-2748: - Hi, [~marco-mesos]. I am sorry for could not got your idea here. Do you mean "/help" endpoint is a absolute path and could not work when user want to show as "/mesos/help" after reverse proxy? In nginx, could add a rewrite rule to solve this problem. > /help generated links point to wrong URLs > - > > Key: MESOS-2748 > URL: https://issues.apache.org/jira/browse/MESOS-2748 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Marco Massenzio >Priority: Minor > > As reported by Michael Lunøe (see also MESOS-329 and > MESOS-913 for background): > {quote} > In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, > which is then converted to html through a javascript library > All endpoints point to {{/help/...}}, they need to work dynamically for > reverse proxy to do its thing. {{/mesos/help}} works, and displays the > endpoints, but they each need to go to their respective {{/mesos/help/...}} > endpoint. > Note that this needs to work both for master, and for slaves. I think the > route to slaves help is something like this: > {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please > double check this. > {quote} > The fix appears to be not too complex (as it would require to simply > manipulate the generated URL) but a quick skim of the code would suggest that > something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2735) Change the interaction between the slave and the resource estimator from polling to pushing
[ https://issues.apache.org/jira/browse/MESOS-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550575#comment-14550575 ] Benjamin Hindman commented on MESOS-2735: - I'd also like to understand better why to go push instead of pull (poll). One of the advantages that we had discussed in the past was that the pull model enables us to move as fast as we possibly can, rather than just getting a bunch of messages queued up in the slave that we have to process. Even if we want to collect more fine-grained resource estimations a ResourceEstimator could do this and store this information until future polls. > Change the interaction between the slave and the resource estimator from > polling to pushing > > > Key: MESOS-2735 > URL: https://issues.apache.org/jira/browse/MESOS-2735 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > Labels: twitter > > This will make the semantics more clear. The resource estimator can control > the speed of sending resources estimation to the slave. > To avoid cyclic dependency, slave will register a callback with the resource > estimator and the resource estimator will simply invoke that callback when > there's a new estimation ready. The callback will be a defer to the slave's > main event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2747) Add "watch" to the state abstraction
[ https://issues.apache.org/jira/browse/MESOS-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Connor Doyle updated MESOS-2747: Description: Use case: Frameworks that intend to survive failover tend to implement leader election. Adding the ability to listen for changes to a variable's value could be a first step towards reusable leader election libraries that don't depend on a particular backing store. cc [~kozyraki] was: Use case: Frameworks that intend to survive failover tend to implement leader election. Watchable storage could be a first step towards reusable leader election libraries that don't depend on a particular backing store. cc [~kozyraki] > Add "watch" to the state abstraction > > > Key: MESOS-2747 > URL: https://issues.apache.org/jira/browse/MESOS-2747 > Project: Mesos > Issue Type: Wish > Components: c++ api, java api >Reporter: Connor Doyle >Priority: Minor > Labels: mesosphere > > Use case: Frameworks that intend to survive failover tend to implement leader > election. Adding the ability to listen for changes to a variable's value > could be a first step towards reusable leader election libraries that don't > depend on a particular backing store. > cc [~kozyraki] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-708) Static files missing "Last-Modified" HTTP headers
[ https://issues.apache.org/jira/browse/MESOS-708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320300#comment-14320300 ] Alexander Rojas edited comment on MESOS-708 at 5/19/15 11:41 AM: - https://reviews.apache.org/r/34392/ https://reviews.apache.org/r/30032/ was (Author: arojas): https://reviews.apache.org/r/30032/ > Static files missing "Last-Modified" HTTP headers > - > > Key: MESOS-708 > URL: https://issues.apache.org/jira/browse/MESOS-708 > Project: Mesos > Issue Type: Improvement > Components: libprocess, webui >Affects Versions: 0.13.0 >Reporter: Ross Allen >Assignee: Alexander Rojas > Labels: mesosphere > > Static assets served by the Mesos master don't return "Last-Modified" HTTP > headers. That means clients receive a 200 status code and re-download assets > on every page request even if the assets haven't changed. Because Angular JS > does most of the work, the downloading happens only when you navigate to > Mesos master in your browser or use the browser's refresh. > Example header for "mesos.css": > HTTP/1.1 200 OK > Date: Thu, 26 Sep 2013 17:18:52 GMT > Content-Length: 1670 > Content-Type: text/css > Clients sometimes use the "Date" header for the same effect as > "Last-Modified", but the date is always the time of the response from the > server, i.e. it changes on every request and makes the assets look new every > time. > The "Last-Modified" header should be added and should be the last modified > time of the file. On subsequent requests for the same files, the master > should return 304 responses with no content rather than 200 with the full > files. It could save clients a lot of download time since Mesos assets are > rather heavyweight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)