[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044782#comment-14044782 ] Tobias Weingartner commented on MESOS-1529: --- Reading point #3 above, I believe you mean =. Otherwise you could wait forever for a ping that will arrive at some point in the future. :) I think in the end, the most robust solution will be for the master to not be responsible for initiating/opening any connections to frameworks and/or slaves. If we do this, then staying connected would be the slave's (framework's) responsibility. For example, using the HTTP CONNECT method, a slave could request direct access to a master's particular pid endpoint, something like: {noformat} CONNECT pid1@master HTTP/1.0 Content-Transfer-Encoding: application/x-mesos-protobuf-v1 Authorization: token=..., ... {noformat} With the server responding with (only during connection): {noformat} HTTP/1.1 200 Connection established X-Welcome-Message: Welcome to the cloud {noformat} At this point, the connection moves to a pure binary TCP connection, which the master can now use to send protobuf over tcp requests to, including ping/pong, etc. If multiple pid endpoints are required, then their endpoints could possibly be multiplexed over this single link. Instead of connecting directly to a particular pid, you could connect to a mux pid, and the messages would then be shunted to the correct pids. Not sure if this makes any sense. Anyways, I gather this would be a rather large re-write, and changing protocols in a live system is... well, interesting. Note: rfc-6455 might be another option, albeit much more involved... Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043853#comment-14043853 ] Tobias Weingartner commented on MESOS-1529: --- 2) What does an exit event signify? Why would we need to check that it was for a leading master? 3) How is the 75 seconds determined? Does this lock us into a phased upgrade path if this timeout value needs to change? If we get a ping from a non-leading master, we should likely ignore it and not immediately trigger re-registration. IE: let the timeout take effect. Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043985#comment-14043985 ] Tobias Weingartner commented on MESOS-1529: --- {quote} An exited event signifies that a link between slave -- master is broken. This could be due to network partition or master failover. We need to check if it was from the leading master because, before exited event is received by the slave, the slave might have received a new master detected event from zk and re-registered with a new master. In that case, the slave can safely ignore the exited event. {quote} This sounds like it would be a race. In the face of possibly having multiple masters connected to a slave, and master fail-over happening. {quote} | Does this lock us into a phased upgrade path if this timeout value needs to change? I don't see why it would lock us into an upgrade path. {quote} What I meant here, was if the operator decided that a 75s delay was too long, or too short, and needed to be changed in a running cluster. At this point, it looks like the deploy of this change would be more involved, possibly requiring the coordination of thousands of machines. If the option is not surfaced to the operator (no flags/etc), then if/when this single static number changes (adaptive based on the number of slaves, etc), then the modification of this will likely require a lot of planning and prep. I see this as having a constant in two places without one informing the other what the constant should be. When it changes in one (say a new master release is going to go with 150s pings due to load issues, if the masters roll before all the slaves have rolled to the new code, they'll end up flapping, etc), it can have a detrimental effect on the rest of the system. Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042284#comment-14042284 ] Tobias Weingartner commented on MESOS-1529: --- I don't think that #2 is an option. We may be able to add extra information/messages to let the frameworks know that something has been lost potentially, and allow the frameworks to choose which side of CAP they land on. With the current assumptions and implementation, I believe that modifying #2 would be a mistake. Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1335) Make state.json information partially accessible as well as via text
Tobias Weingartner created MESOS-1335: - Summary: Make state.json information partially accessible as well as via text Key: MESOS-1335 URL: https://issues.apache.org/jira/browse/MESOS-1335 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Tobias Weingartner Priority: Minor The information returned by {{http://localhost:5051/slave(1)/state.json}} is rather volumous, especially if you're trying to do certain simple things like knowing which version of a slave happens to be running. Possible improvement, allow to address portions of the endpoint: {noformat} curl -s 'localhost:5051/slave(1)/state/version.json' curl -s 'localhost:5051/slave(1)/state/attributes.txt' {noformat} The above would return something like: {noformat} {version: 0.18.0} {noformat} {noformat} /attributes/host some-hostname /attributes/rack some-other-rack /attributes/attr-name attr-value {noformat} Possibly an interim solution to the volume of data would be pull out certain information into another endpoint (something like stats.json, maybe version.json or environment.json?). In particular, the keys I'd be looking for would be: {noformat} attributes flags build_* hostname id log_dir master_hostname pid resources start_time version {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1253) Make HTTP endpoint browsable
[ https://issues.apache.org/jira/browse/MESOS-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984426#comment-13984426 ] Tobias Weingartner commented on MESOS-1253: --- Not necessarily help endpoint, but they are the parent of the leafs. So when I'm presented with: {noformat} http://host/parent/leaf {noformat} And I wish to explore for more/other things, I usually try a level up: {noformat} http://host/parent {noformat} And see what other options that I may have. A directory of places I can go to... Make HTTP endpoint browsable Key: MESOS-1253 URL: https://issues.apache.org/jira/browse/MESOS-1253 Project: Mesos Issue Type: Bug Components: statistics Reporter: Tobias Weingartner Priority: Minor A number of the paths in the master/slave do not have index pages, making the ability to browse and cut the URL path down harder. For example, if you're looking at: {noformat} http://host:port/metrics/snapshot {noformat} And decided to see what other options there are for metrics, it's not easy to get that by just cutting out the last URL path part. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1253) Make HTTP endpoint browsable
[ https://issues.apache.org/jira/browse/MESOS-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983487#comment-13983487 ] Tobias Weingartner commented on MESOS-1253: --- I respectfully disagree. :) Make HTTP endpoint browsable Key: MESOS-1253 URL: https://issues.apache.org/jira/browse/MESOS-1253 Project: Mesos Issue Type: Bug Components: statistics Reporter: Tobias Weingartner Priority: Minor A number of the paths in the master/slave do not have index pages, making the ability to browse and cut the URL path down harder. For example, if you're looking at: {noformat} http://host:port/metrics/snapshot {noformat} And decided to see what other options there are for metrics, it's not easy to get that by just cutting out the last URL path part. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1258) 0.18.0-rc3: F0427 02:48:30.603756 62192 group.cpp:326] Check failed: state == CONNECTED || state == AUTHENTICATED || state == READY 1
Tobias Weingartner created MESOS-1258: - Summary: 0.18.0-rc3: F0427 02:48:30.603756 62192 group.cpp:326] Check failed: state == CONNECTED || state == AUTHENTICATED || state == READY 1 Key: MESOS-1258 URL: https://issues.apache.org/jira/browse/MESOS-1258 Project: Mesos Issue Type: Bug Components: slave Reporter: Tobias Weingartner -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1253) Make HTTP endpoint browsable
Tobias Weingartner created MESOS-1253: - Summary: Make HTTP endpoint browsable Key: MESOS-1253 URL: https://issues.apache.org/jira/browse/MESOS-1253 Project: Mesos Issue Type: Bug Components: statistics Reporter: Tobias Weingartner Priority: Minor A number of the paths in the master/slave do not have index pages, making the ability to browse and cut the URL path down harder. For example, if you're looking at: {noformat} http://host:port/metrics/snapshot {noformat} And decided to see what other options there are for metrics, it's not easy to get that by just cutting out the last URL path part. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1254) JSON endpoints should have url pointers to other locations
Tobias Weingartner created MESOS-1254: - Summary: JSON endpoints should have url pointers to other locations Key: MESOS-1254 URL: https://issues.apache.org/jira/browse/MESOS-1254 Project: Mesos Issue Type: Improvement Reporter: Tobias Weingartner Priority: Minor Hitting an endpoint like: {noformat} http://host:port/state.json {noformat} Has a lot of information, including a list of slaves/etc. However, if you'd like to hit a slave's JSON endpoint, you're basically left with grabbing the {{pid}} of the slave you're looking for, and interpreting that string in order to create a URL where you can now get the slave's JSON endpoint. Having a key - value pair that allows traversing from JSON endpoint to JSON endpoint should make this easier and less error prone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1255) Master UI should show Mesos version
[ https://issues.apache.org/jira/browse/MESOS-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tobias Weingartner updated MESOS-1255: -- Description: The master UI should show the Mesos version on the main UI screen. Would be awesome if there was a tab with the ability to browse software versions and/or attributes/resources in a coordinated easy fashion. IE: being able to visually answer (and via curl to JSON endpoint) how many of the slaves and masters are which version of Mesos. How many have which attributes and/or resources. was: The master UI should show the Mesos version on the main UI screen. Would be awesome if there was a tab with the ability to browse software versions and/or attributes/resources in a coordinated easy fashion. IE: being able to visually answer (and via curl to JSON endpoint) how many of the slaves are which version of Mesos. How many have which attributes and/or resources. Master UI should show Mesos version --- Key: MESOS-1255 URL: https://issues.apache.org/jira/browse/MESOS-1255 Project: Mesos Issue Type: Improvement Reporter: Tobias Weingartner Priority: Trivial The master UI should show the Mesos version on the main UI screen. Would be awesome if there was a tab with the ability to browse software versions and/or attributes/resources in a coordinated easy fashion. IE: being able to visually answer (and via curl to JSON endpoint) how many of the slaves and masters are which version of Mesos. How many have which attributes and/or resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1255) Master UI should show Mesos version
Tobias Weingartner created MESOS-1255: - Summary: Master UI should show Mesos version Key: MESOS-1255 URL: https://issues.apache.org/jira/browse/MESOS-1255 Project: Mesos Issue Type: Improvement Reporter: Tobias Weingartner Priority: Trivial The master UI should show the Mesos version on the main UI screen. Would be awesome if there was a tab with the ability to browse software versions and/or attributes/resources in a coordinated easy fashion. IE: being able to visually answer (and via curl to JSON endpoint) how many of the slaves are which version of Mesos. How many have which attributes and/or resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1225) Allow definition/use of shared resources
Tobias Weingartner created MESOS-1225: - Summary: Allow definition/use of shared resources Key: MESOS-1225 URL: https://issues.apache.org/jira/browse/MESOS-1225 Project: Mesos Issue Type: Improvement Components: allocation, containerization, framework, isolation, slave Reporter: Tobias Weingartner Priority: Minor It would be nice to be able to define a set of shared resources for a set of slaves (such as IP addresses, power, rack bandwidth, etc) that would be managed by the master/slaves, and exported to the frameworks for their use. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1164) URL encoded urls do not work in slave
Tobias Weingartner created MESOS-1164: - Summary: URL encoded urls do not work in slave Key: MESOS-1164 URL: https://issues.apache.org/jira/browse/MESOS-1164 Project: Mesos Issue Type: Bug Components: slave Reporter: Tobias Weingartner Priority: Minor As show here: {noformat} MiniMe:incubator-aurora weingart$ curl -X HEAD -sI 'http://192.168.33.4:5051/slave%281%29/state.json' HTTP/1.1 404 Not Found Date: Mon, 31 Mar 2014 06:17:38 GMT Content-Length: 0 MiniMe:incubator-aurora weingart$ curl -X HEAD -sI 'http://192.168.33.4:5051/slave(1)/state.json' HTTP/1.1 200 OK Date: Mon, 31 Mar 2014 06:17:50 GMT Content-Length: 8015 Content-Type: application/json {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.
[ https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814227#comment-13814227 ] Tobias Weingartner commented on MESOS-780: -- I don't think that plug-able support is a must. Having an endpoint that you can query/scrape should be enough. There is nothing preventing the running of an agent that scrapes these endpoints and then pushed the data (if push is wanted) or offers the data up in a manner that is required for whatever health monitoring that is present within the infrastructure. In many ways, I think that the support for 3rd part performance and health monitoring is already there. Certainly there are improvements that can be done (exporting more information, etc), but I think that the basic framework is present and usable. Adding support for 3rd party performance and health monitoring. --- Key: MESOS-780 URL: https://issues.apache.org/jira/browse/MESOS-780 Project: Mesos Issue Type: Improvement Components: framework Reporter: Bernardo Gomez Palacio User Story: As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with 3rd party tools such as: * [Ganglia|http://ganglia.sourceforge.net/] * [Graphite|http://graphite.wikidot.com/] * [Nagios|http://www.nagios.org/] * [Zabbix|http://www.zabbix.com/] -- This message was sent by Atlassian JIRA (v6.1#6144)