[jira] [Commented] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes

2016-04-20 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249903#comment-15249903
 ] 

Aaron Carey commented on MESOS-3059:


We also would love this feature!

> Allow http endpoint to dynamically change the slave attributes
> --
>
> Key: MESOS-3059
> URL: https://issues.apache.org/jira/browse/MESOS-3059
> Project: Mesos
>  Issue Type: Wish
>Reporter: Nitin
>  Labels: mesosphere
>
> This is well understood that - changing the attributes dynamically is not 
> safe without a restart because slave itself may not know which old framework 
> tasks are running on it that were dependent on previous attributes. 
> However, total restart makes lot of other history to delete. We need to 
> ensure a dynamic attribute changes with a soft restart. 
> It will be good to expose a rest endpoint either at slave or mesos-master 
> which directly changes the state in zookeeper.
> USE-CASE
> We use slave attributes/roles to direct the framework scheduling to use 
> specific slave as per it's requirements. Mesos scheduler only creates the 
> offer on the basis of some resources.
> In our use case, we have some categorization of our spark frameworks or jobs 
> with framework(like marathon) based on multiple factors. We want job or 
> frameworks belonging to one category be running into their specific cluster 
> of resources. We want to dynamically manage the slaves into these logical 
> sub-clusters.
> Since number of jobs that will be submitted or when it will be submitted is 
> very dynamic, it make sense to be able to dynamically assign roles or 
> attributes to slaves. It is not possible to gauge the requirements at time of 
> cluster provisioning. Static role or attribute assignment leads to 
> sub-optimal use of the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters

2016-04-01 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221357#comment-15221357
 ] 

Aaron Carey commented on MESOS-3548:


I'd love to see an overall design doc for this, have you tested it over WAN? 
What are you using as the replicated policy store?

> Investigate federations of Mesos masters
> 
>
> Key: MESOS-3548
> URL: https://issues.apache.org/jira/browse/MESOS-3548
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: federation, mesosphere, multi-dc
>
> In a large Mesos installation, the operator might want to ensure that even if 
> the Mesos masters are inaccessible or failed, new tasks can still be 
> scheduled (across multiple different frameworks). HA masters are only a 
> partial solution here: the masters might still be inaccessible due to a 
> correlated failure (e.g., Zookeeper misconfiguration/human error).
> To support this, we could support the notion of "hierarchies" or 
> "federations" of Mesos masters. In a Mesos installation with 10k machines, 
> the operator might configure 10 Mesos masters (each of which might be HA) to 
> manage 1k machines each. Then an additional "meta-Master" would manage the 
> allocation of cluster resources to the 10 masters. Hence, the failure of any 
> individual master would impact 1k machines at most. The meta-master might not 
> have a lot of work to do: e.g., it might be limited to occasionally 
> reallocating cluster resources among the 10 masters, or ensuring that newly 
> added cluster resources are allocated among the masters as appropriate. 
> Hence, the failure of the meta-master would not prevent any of the individual 
> masters from scheduling new tasks. A single framework instance probably 
> wouldn't be able to use more resources than have been assigned to a single 
> Master, but that seems like a reasonable restriction.
> This feature might also be a good fit for a multi-datacenter deployment of 
> Mesos: each Mesos master instance would manage a single DC. Naturally, 
> reducing the traffic between frameworks and the meta-master would be 
> important for performance reasons in a configuration like this.
> Operationally, this might be simpler if Mesos processes were self-hosting 
> ([MESOS-3547]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters

2016-04-01 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221359#comment-15221359
 ] 

Aaron Carey commented on MESOS-3548:


I'd love to see an overall design doc for this, have you tested it over WAN? 
What are you using as the replicated policy store?

> Investigate federations of Mesos masters
> 
>
> Key: MESOS-3548
> URL: https://issues.apache.org/jira/browse/MESOS-3548
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: federation, mesosphere, multi-dc
>
> In a large Mesos installation, the operator might want to ensure that even if 
> the Mesos masters are inaccessible or failed, new tasks can still be 
> scheduled (across multiple different frameworks). HA masters are only a 
> partial solution here: the masters might still be inaccessible due to a 
> correlated failure (e.g., Zookeeper misconfiguration/human error).
> To support this, we could support the notion of "hierarchies" or 
> "federations" of Mesos masters. In a Mesos installation with 10k machines, 
> the operator might configure 10 Mesos masters (each of which might be HA) to 
> manage 1k machines each. Then an additional "meta-Master" would manage the 
> allocation of cluster resources to the 10 masters. Hence, the failure of any 
> individual master would impact 1k machines at most. The meta-master might not 
> have a lot of work to do: e.g., it might be limited to occasionally 
> reallocating cluster resources among the 10 masters, or ensuring that newly 
> added cluster resources are allocated among the masters as appropriate. 
> Hence, the failure of the meta-master would not prevent any of the individual 
> masters from scheduling new tasks. A single framework instance probably 
> wouldn't be able to use more resources than have been assigned to a single 
> Master, but that seems like a reasonable restriction.
> This feature might also be a good fit for a multi-datacenter deployment of 
> Mesos: each Mesos master instance would manage a single DC. Naturally, 
> reducing the traffic between frameworks and the meta-master would be 
> important for performance reasons in a configuration like this.
> Operationally, this might be simpler if Mesos processes were self-hosting 
> ([MESOS-3547]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters

2016-03-21 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203909#comment-15203909
 ] 

Aaron Carey commented on MESOS-3548:


We're also very interested in this, we have datacentres distributed globally 
and are experimenting with ways to move workloads from one region to another 
during peak periods. This raises big questions with regards to storage and data 
locality for us, but having Mesos support multiple datacentres would be a huge 
step for us!

> Investigate federations of Mesos masters
> 
>
> Key: MESOS-3548
> URL: https://issues.apache.org/jira/browse/MESOS-3548
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: federation, mesosphere, multi-dc
>
> In a large Mesos installation, the operator might want to ensure that even if 
> the Mesos masters are inaccessible or failed, new tasks can still be 
> scheduled (across multiple different frameworks). HA masters are only a 
> partial solution here: the masters might still be inaccessible due to a 
> correlated failure (e.g., Zookeeper misconfiguration/human error).
> To support this, we could support the notion of "hierarchies" or 
> "federations" of Mesos masters. In a Mesos installation with 10k machines, 
> the operator might configure 10 Mesos masters (each of which might be HA) to 
> manage 1k machines each. Then an additional "meta-Master" would manage the 
> allocation of cluster resources to the 10 masters. Hence, the failure of any 
> individual master would impact 1k machines at most. The meta-master might not 
> have a lot of work to do: e.g., it might be limited to occasionally 
> reallocating cluster resources among the 10 masters, or ensuring that newly 
> added cluster resources are allocated among the masters as appropriate. 
> Hence, the failure of the meta-master would not prevent any of the individual 
> masters from scheduling new tasks. A single framework instance probably 
> wouldn't be able to use more resources than have been assigned to a single 
> Master, but that seems like a reasonable restriction.
> This feature might also be a good fit for a multi-datacenter deployment of 
> Mesos: each Mesos master instance would manage a single DC. Naturally, 
> reducing the traffic between frameworks and the meta-master would be 
> important for performance reasons in a configuration like this.
> Operationally, this might be simpler if Mesos processes were self-hosting 
> ([MESOS-3547]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3507) As an operator, I want a way to inspect queued tasks in running schedulers

2015-09-24 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906266#comment-14906266
 ] 

Aaron Carey commented on MESOS-3507:


As an interesting side note: it'd also be useful for us to see what kind of 
resources tasks are waiting for, for example we have some machines with GPUs 
in, which utilise a resource on the mesos-agent, thus no amount of launching 
non-gpu instances would help these tasks!

> As an operator, I want a way to inspect queued tasks in running schedulers
> --
>
> Key: MESOS-3507
> URL: https://issues.apache.org/jira/browse/MESOS-3507
> Project: Mesos
>  Issue Type: Story
>Reporter: Niklas Quarfot Nielsen
>
> Currently, there is no uniform way of getting a notion of 'awaiting' tasks 
> i.e. expressing that a framework has more work to do. This information is 
> useful for auto-scaling and anomaly detection systems. Schedulers tend to 
> expose this over their own http endpoints, but the format across schedulers 
> are most likely not compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3507) As an operator, I want a way to inspect queued tasks in running schedulers

2015-09-24 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906108#comment-14906108
 ] 

Aaron Carey commented on MESOS-3507:


The description sums up our use case pretty well: we have multiple different 
frameworks scheduling a variety of different jobs of varying sizes (in terms of 
RAM and CPU). We wanted to create a scheduler agnostic way of scaling up (and 
possibly down) agent nodes. Currently we're relying on metrics like CPU/Mem 
utilisation, whilst this is useful, it doesn't tell you how many tasks are 
waiting on resources. Knowing this would allow you to spin up more or less 
machines to cope with the load and give a better idea of pressure on the system.

As we're using marathon and chronos etc, we couldn't build autoscaling directly 
into these schedulers, and only having it built into our in-house framework 
would potentially miss many situations (eg if our in house scheduler had no 
jobs waiting, it would assume everything is happy, but marathon could have 
several jobs still waiting on resources).

> As an operator, I want a way to inspect queued tasks in running schedulers
> --
>
> Key: MESOS-3507
> URL: https://issues.apache.org/jira/browse/MESOS-3507
> Project: Mesos
>  Issue Type: Story
>Reporter: Niklas Quarfot Nielsen
>
> Currently, there is no uniform way of getting a notion of 'awaiting' tasks 
> i.e. expressing that a framework has more work to do. This information is 
> useful for auto-scaling and anomaly detection systems. Schedulers tend to 
> expose this over their own http endpoints, but the format across schedulers 
> are most likely not compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)