[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

2014-06-26 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044782#comment-14044782
 ] 

Tobias Weingartner commented on MESOS-1529:
---

Reading point #3 above, I believe you mean =.  Otherwise you could wait 
forever for a ping that will arrive at some point in the future.  :)

I think in the end, the most robust solution will be for the master to not be 
responsible for initiating/opening any connections to frameworks and/or slaves. 
 If we do this, then staying connected would be the slave's (framework's) 
responsibility.

For example, using the HTTP CONNECT method, a slave could request direct 
access to a master's particular pid endpoint, something like:
{noformat}
CONNECT pid1@master HTTP/1.0
Content-Transfer-Encoding: application/x-mesos-protobuf-v1
Authorization: token=..., ...

{noformat}

With the server responding with (only during connection):
{noformat}
HTTP/1.1 200 Connection established
X-Welcome-Message: Welcome to the cloud

{noformat}

At this point, the connection moves to a pure binary TCP connection, which the 
master can now use to send protobuf over tcp requests to, including ping/pong, 
etc.  If multiple pid endpoints are required, then their endpoints could 
possibly be multiplexed over this single link.  Instead of connecting directly 
to a particular pid, you could connect to a mux pid, and the messages would 
then be shunted to the correct pids.  Not sure if this makes any sense.

Anyways, I gather this would be a rather large re-write, and changing protocols 
in a live system is... well, interesting.
Note: rfc-6455 might be another option, albeit much more involved...

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

2014-06-25 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043853#comment-14043853
 ] 

Tobias Weingartner commented on MESOS-1529:
---

2) What does an exit event signify?  Why would we need to check that it was 
for a leading master?

3) How is the 75 seconds determined?  Does this lock us into a phased upgrade 
path if this timeout value needs to change?  If we get a ping from a 
non-leading master, we should likely ignore it and not immediately trigger 
re-registration.  IE: let the timeout take effect.

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

2014-06-25 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043985#comment-14043985
 ] 

Tobias Weingartner commented on MESOS-1529:
---

{quote}
An exited event signifies that a link between slave -- master is broken. 
This could be due to network partition or master failover. We need to check if 
it was from the leading master because, before exited event is received by 
the slave, the slave might have received a new master detected event from zk 
and re-registered with a new master. In that case, the slave can safely ignore 
the exited event.
{quote}
This sounds like it would be a race.  In the face of possibly having multiple 
masters connected to a slave, and master fail-over happening.

{quote}
 | Does this lock us into a phased upgrade path if this timeout value needs to 
change?
I don't see why it would lock us into an upgrade path.
{quote}
What I meant here, was if the operator decided that a 75s delay was too long, 
or too short, and needed to be changed in a running cluster.  At this point, it 
looks like the deploy of this change would be more involved, possibly requiring 
the coordination of thousands of machines.  If the option is not surfaced to 
the operator (no flags/etc), then if/when this single static number changes 
(adaptive based on the number of slaves, etc), then the modification of this 
will likely require a lot of planning and prep.

I see this as having a constant in two places without one informing the other 
what the constant should be.  When it changes in one (say a new master 
release is going to go with 150s pings due to load issues, if the masters roll 
before all the slaves have rolled to the new code, they'll end up flapping, 
etc), it can have a detrimental effect on the rest of the system.

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave

2014-06-24 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042284#comment-14042284
 ] 

Tobias Weingartner commented on MESOS-1529:
---

I don't think that #2 is an option.  We may be able to add extra 
information/messages to let the frameworks know that something has been lost 
potentially, and allow the frameworks to choose which side of CAP they land 
on.  With the current assumptions and implementation, I believe that modifying 
#2 would be a mistake.

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1335) Make state.json information partially accessible as well as via text

2014-05-14 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1335:
-

 Summary: Make state.json information partially accessible as well 
as via text
 Key: MESOS-1335
 URL: https://issues.apache.org/jira/browse/MESOS-1335
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Tobias Weingartner
Priority: Minor


The information returned by {{http://localhost:5051/slave(1)/state.json}} is 
rather volumous, especially if you're trying to do certain simple things like 
knowing which version of a slave happens to be running.

Possible improvement, allow to address portions of the endpoint:
{noformat}
curl -s 'localhost:5051/slave(1)/state/version.json'
curl -s 'localhost:5051/slave(1)/state/attributes.txt'
{noformat}

The above would return something like:
{noformat}
{version: 0.18.0}
{noformat}
{noformat}
/attributes/host some-hostname
/attributes/rack some-other-rack
/attributes/attr-name attr-value
{noformat}

Possibly an interim solution to the volume of data would be pull out certain 
information into another endpoint (something like stats.json, maybe 
version.json or environment.json?).  In particular, the keys I'd be looking for 
would be:
{noformat}
attributes
flags
build_*
hostname
id
log_dir
master_hostname
pid
resources
start_time
version
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1253) Make HTTP endpoint browsable

2014-04-29 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984426#comment-13984426
 ] 

Tobias Weingartner commented on MESOS-1253:
---

Not necessarily help endpoint, but they are the parent of the leafs.  So when 
I'm presented with:
{noformat}
http://host/parent/leaf
{noformat}
And I wish to explore for more/other things, I usually try a level up:
{noformat}
http://host/parent
{noformat}
And see what other options that I may have.  A directory of places I can go 
to...

 Make HTTP endpoint browsable
 

 Key: MESOS-1253
 URL: https://issues.apache.org/jira/browse/MESOS-1253
 Project: Mesos
  Issue Type: Bug
  Components: statistics
Reporter: Tobias Weingartner
Priority: Minor

 A number of the paths in the master/slave do not have index pages, making the 
 ability to browse and cut the URL path down harder.  For example, if you're 
 looking at:
 {noformat}
 http://host:port/metrics/snapshot
 {noformat}
 And decided to see what other options there are for metrics, it's not easy to 
 get that by just cutting out the last URL path part.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1253) Make HTTP endpoint browsable

2014-04-28 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983487#comment-13983487
 ] 

Tobias Weingartner commented on MESOS-1253:
---

I respectfully disagree.  :)

 Make HTTP endpoint browsable
 

 Key: MESOS-1253
 URL: https://issues.apache.org/jira/browse/MESOS-1253
 Project: Mesos
  Issue Type: Bug
  Components: statistics
Reporter: Tobias Weingartner
Priority: Minor

 A number of the paths in the master/slave do not have index pages, making the 
 ability to browse and cut the URL path down harder.  For example, if you're 
 looking at:
 {noformat}
 http://host:port/metrics/snapshot
 {noformat}
 And decided to see what other options there are for metrics, it's not easy to 
 get that by just cutting out the last URL path part.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1258) 0.18.0-rc3: F0427 02:48:30.603756 62192 group.cpp:326] Check failed: state == CONNECTED || state == AUTHENTICATED || state == READY 1

2014-04-28 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1258:
-

 Summary: 0.18.0-rc3: F0427 02:48:30.603756 62192 group.cpp:326] 
Check failed: state == CONNECTED || state == AUTHENTICATED || state == READY 1
 Key: MESOS-1258
 URL: https://issues.apache.org/jira/browse/MESOS-1258
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Tobias Weingartner






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1253) Make HTTP endpoint browsable

2014-04-27 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1253:
-

 Summary: Make HTTP endpoint browsable
 Key: MESOS-1253
 URL: https://issues.apache.org/jira/browse/MESOS-1253
 Project: Mesos
  Issue Type: Bug
  Components: statistics
Reporter: Tobias Weingartner
Priority: Minor


A number of the paths in the master/slave do not have index pages, making the 
ability to browse and cut the URL path down harder.  For example, if you're 
looking at:
{noformat}
http://host:port/metrics/snapshot
{noformat}
And decided to see what other options there are for metrics, it's not easy to 
get that by just cutting out the last URL path part.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1254) JSON endpoints should have url pointers to other locations

2014-04-27 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1254:
-

 Summary: JSON endpoints should have url pointers to other locations
 Key: MESOS-1254
 URL: https://issues.apache.org/jira/browse/MESOS-1254
 Project: Mesos
  Issue Type: Improvement
Reporter: Tobias Weingartner
Priority: Minor


Hitting an endpoint like:
{noformat}
http://host:port/state.json
{noformat}
Has a lot of information, including a list of slaves/etc.  However, if you'd 
like to hit a slave's JSON endpoint, you're basically left with grabbing the 
{{pid}} of the slave you're looking for, and interpreting that string in 
order to create a URL where you can now get the slave's JSON endpoint.

Having a key - value pair that allows traversing from JSON endpoint to JSON 
endpoint should make this easier and less error prone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1255) Master UI should show Mesos version

2014-04-27 Thread Tobias Weingartner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Weingartner updated MESOS-1255:
--

Description: 
The master UI should show the Mesos version on the main UI screen.

Would be awesome if there was a tab with the ability to browse software 
versions and/or attributes/resources in a coordinated easy fashion.

IE: being able to visually answer (and via curl to JSON endpoint) how many of 
the slaves and masters are which version of Mesos.  How many have which 
attributes and/or resources.

  was:
The master UI should show the Mesos version on the main UI screen.

Would be awesome if there was a tab with the ability to browse software 
versions and/or attributes/resources in a coordinated easy fashion.

IE: being able to visually answer (and via curl to JSON endpoint) how many of 
the slaves are which version of Mesos.  How many have which attributes and/or 
resources.


 Master UI should show Mesos version
 ---

 Key: MESOS-1255
 URL: https://issues.apache.org/jira/browse/MESOS-1255
 Project: Mesos
  Issue Type: Improvement
Reporter: Tobias Weingartner
Priority: Trivial

 The master UI should show the Mesos version on the main UI screen.
 Would be awesome if there was a tab with the ability to browse software 
 versions and/or attributes/resources in a coordinated easy fashion.
 IE: being able to visually answer (and via curl to JSON endpoint) how many of 
 the slaves and masters are which version of Mesos.  How many have which 
 attributes and/or resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1255) Master UI should show Mesos version

2014-04-27 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1255:
-

 Summary: Master UI should show Mesos version
 Key: MESOS-1255
 URL: https://issues.apache.org/jira/browse/MESOS-1255
 Project: Mesos
  Issue Type: Improvement
Reporter: Tobias Weingartner
Priority: Trivial


The master UI should show the Mesos version on the main UI screen.

Would be awesome if there was a tab with the ability to browse software 
versions and/or attributes/resources in a coordinated easy fashion.

IE: being able to visually answer (and via curl to JSON endpoint) how many of 
the slaves are which version of Mesos.  How many have which attributes and/or 
resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1225) Allow definition/use of shared resources

2014-04-19 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1225:
-

 Summary: Allow definition/use of shared resources
 Key: MESOS-1225
 URL: https://issues.apache.org/jira/browse/MESOS-1225
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, containerization, framework, isolation, slave
Reporter: Tobias Weingartner
Priority: Minor


It would be nice to be able to define a set of shared resources for a set of 
slaves (such as IP addresses, power, rack bandwidth, etc) that would be managed 
by the master/slaves, and exported to the frameworks for their use.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1164) URL encoded urls do not work in slave

2014-03-31 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1164:
-

 Summary: URL encoded urls do not work in slave
 Key: MESOS-1164
 URL: https://issues.apache.org/jira/browse/MESOS-1164
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Tobias Weingartner
Priority: Minor


As show here:
{noformat}
MiniMe:incubator-aurora weingart$ curl -X HEAD -sI 
'http://192.168.33.4:5051/slave%281%29/state.json'
HTTP/1.1 404 Not Found
Date: Mon, 31 Mar 2014 06:17:38 GMT
Content-Length: 0

MiniMe:incubator-aurora weingart$ curl -X HEAD -sI 
'http://192.168.33.4:5051/slave(1)/state.json'
HTTP/1.1 200 OK
Date: Mon, 31 Mar 2014 06:17:50 GMT
Content-Length: 8015
Content-Type: application/json

{noformat}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.

2013-11-05 Thread Tobias Weingartner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814227#comment-13814227
 ] 

Tobias Weingartner commented on MESOS-780:
--

I don't think that plug-able support is a must.  Having an endpoint that you 
can query/scrape should be enough.  There is nothing preventing the running of 
an agent that scrapes these endpoints and then pushed the data (if push is 
wanted) or offers the data up in a manner that is required for whatever health 
monitoring that is present within the infrastructure.

In many ways, I think that the support for 3rd part performance and health 
monitoring is already there.  Certainly there are improvements that can be done 
(exporting more information, etc), but I think that the basic framework is 
present and usable.

 Adding support for 3rd party performance and health monitoring.
 ---

 Key: MESOS-780
 URL: https://issues.apache.org/jira/browse/MESOS-780
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: Bernardo Gomez Palacio

 User Story:
 As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with
 3rd party tools such as:
 * [Ganglia|http://ganglia.sourceforge.net/]
 * [Graphite|http://graphite.wikidot.com/]
 * [Nagios|http://www.nagios.org/]
 * [Zabbix|http://www.zabbix.com/]



--
This message was sent by Atlassian JIRA
(v6.1#6144)