[jira] [Commented] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug
[ https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858431#comment-15858431 ] Steven Schlansker commented on MESOS-7085: -- More evidence of confusion over this in the ecosystem: https://github.com/mesosphere/marathon/issues/1917 > Consider reducing processing of DECLINE calls log from info to debug > > > Key: MESOS-7085 > URL: https://issues.apache.org/jira/browse/MESOS-7085 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.0.1 >Reporter: Steven Schlansker > > The Mesos master gets resource decline messages as a normal matter of course. > It repeatedly logs the offers declined from schedulers. This is critical > diagnostics information, but unless your scheduler is broken or buggy, > usually uninteresting. > In our production environment this ended up being a significant fraction of > all logging. One of our operators got paged: > > Checking to see what I can delete. > > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also > > outputting this to syslog ) : > > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for > > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework > > Singularity (Singularity) at > > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for > > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework > > Singularity (Singularity) at > > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for > > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework > > Singularity (Singularity) at > > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > ➢ wc -l > mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 > 6796024 > mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 > ➢ grep -c DECLINE > mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 > 5846770 > It seems that this line looks scary ("DECLINE" is a scary word to an > operator), is a huge percentage of log output, and is part of normal > operation. > Should it be reduced to DEBUG? Or could Mesos print it out in a time based > manner? ("654 offers declined in last 1 minute") -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug
[ https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-7085: - Description: The Mesos master gets resource decline messages as a normal matter of course. It repeatedly logs the offers declined from schedulers. This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting. In our production environment this ended up being a significant fraction of all logging. One of our operators got paged: > Checking to see what I can delete. > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also > outputting this to syslog ) : > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 ➢ wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 ➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 5846770 It seems that this line looks scary ("DECLINE" is a scary word to an operator), is a huge percentage of log output, and is part of normal operation. Should it be reduced to DEBUG? Or could Mesos print it out in a time based manner? ("654 offers declined in last 1 minute") was: The Mesos master gets resource decline messages as a normal matter of course. It repeatedly logs the offers declined from schedulers. This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting. In our production environment this ended up being a significant fraction of all logging. One of our operators got paged: > Checking to see what I can delete. > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also > outputting this to syslog ) : > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 ➢ wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 ➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 5846770 It seems that this line looks scary ("DECLINE") is a scary word to an operator, is a huge percentage of log output, and is part of normal operation. Should it be reduced to DEBUG? Or could Mesos print it out in a time based manner? ("654 offers declined in last 1 minute") > Consider reducing processing of DECLINE calls log from info to debug > > > Key: MESOS-7085 > URL: https://issues.apache.org/jira/browse/MESOS-7085 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.0.1 >Reporter: Steven Schlansker > > The Mesos master gets resource decline messages as a normal matter of course. > It repeatedly logs the offers declined from schedulers. This is critical > diagnostics information, but unless your scheduler is broken or buggy, > usually uninteresting. > In our production environment this ended up being a significant fraction of > all logging. One of our operators got paged: > > Checking to see what I can delete. > > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also > > outputting this to syslog ) : > > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for > > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework > > Singularity (Singularity) at
[jira] [Created] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug
Steven Schlansker created MESOS-7085: Summary: Consider reducing processing of DECLINE calls log from info to debug Key: MESOS-7085 URL: https://issues.apache.org/jira/browse/MESOS-7085 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 1.0.1 Reporter: Steven Schlansker The Mesos master gets resource decline messages as a normal matter of course. It repeatedly logs the offers declined from schedulers. This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting. In our production environment this ended up being a significant fraction of all logging. One of our operators got paged: > Checking to see what I can delete. > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also > outputting this to syslog ) : > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework > Singularity (Singularity) at > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844 ➢ wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 ➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812 5846770 It seems that this line looks scary ("DECLINE") is a scary word to an operator, is a huge percentage of log output, and is part of normal operation. Should it be reduced to DEBUG? Or could Mesos print it out in a time based manner? ("654 offers declined in last 1 minute") -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-6066) Operator SUBSCRIBE api should include timestamps
Steven Schlansker created MESOS-6066: Summary: Operator SUBSCRIBE api should include timestamps Key: MESOS-6066 URL: https://issues.apache.org/jira/browse/MESOS-6066 Project: Mesos Issue Type: Bug Components: HTTP API, json api Affects Versions: 1.0.0 Reporter: Steven Schlansker Events coming from the Mesos master are delivered asynchronously. While usually they are processed in a timely fashion, it really scares me that updates do not have a timestamp: {code} 301 { "task_updated": { "agent_id": { "value": "fdbb3ff5-47c2-4b49-a521-b52b9acf74dd-S14" }, "framework_id": { "value": "Singularity" }, "state": "TASK_KILLED", "task_id": { "value": "pp-demoservice-steven.2016.07.05T17.00.06-1471901722511-1-mesos_slave17_qa_uswest2.qasql.opentable.com-us_west_2b" } }, "type": "TASK_UPDATED" } {code} Events should have a timestamp that indicates the time that they happened at, otherwise your timestamps include delivery and processing delays. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4862) Setting failover_timeout in FrameworkInfo to Double.MAX_VALUE causes it to be set to zero
[ https://issues.apache.org/jira/browse/MESOS-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403028#comment-15403028 ] Steven Schlansker commented on MESOS-4862: -- Is this a duplicate of https://issues.apache.org/jira/browse/MESOS-1575 ? > Setting failover_timeout in FrameworkInfo to Double.MAX_VALUE causes it to be > set to zero > - > > Key: MESOS-4862 > URL: https://issues.apache.org/jira/browse/MESOS-4862 > Project: Mesos > Issue Type: Bug > Components: master, stout >Reporter: Timothy Chen > > Currently we expose framework failover_timeout as a double in Proto, and if > users set the failover_timeout to Double.MAX_VALUE, the Master will actually > set it to zero which is the complete opposite of the original intent. > The problem is that in stout/duration.hpp we only store down to the > nanoseconds with int64_t, and it gives an error when we pass double.max as it > goes out of the int64_t bounds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5936) Operator SUBSCRIBE api should provdide more task metadata than just state changes
Steven Schlansker created MESOS-5936: Summary: Operator SUBSCRIBE api should provdide more task metadata than just state changes Key: MESOS-5936 URL: https://issues.apache.org/jira/browse/MESOS-5936 Project: Mesos Issue Type: Improvement Components: HTTP API, json api Affects Versions: 1.0.0 Reporter: Steven Schlansker I am starting to use the new Operator event subscription API to get notified of task changes. The initial TASK_STAGING event has a good amount of information, but unfortunately future events like TASK_RUNNING include almost no metadata. This means that any task information that cannot be determined until the task launches (in my particular case, the IP address assigned by the Docker containerizer) is not available through the event stream. Here is a gist of a single task that was launched, comparing the information in 'state.json' vs the subscribed events: https://gist.github.com/stevenschlansker/c1d32aa9ce37a73f9c4d64347397d3b8 Note particularly how the IP address never shows in the event stream. Task updates should expose the task information that changed. If this is too onerous, maybe the subscription call can take optional sets of data to include, with the first one being additional task metadata. A possible workaround is to use the task events to fetch 'state.json' separately, but this is inherently racy and totally undermines the utility of the event stream api. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3866) The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors
[ https://issues.apache.org/jira/browse/MESOS-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398420#comment-15398420 ] Steven Schlansker commented on MESOS-3866: -- We just independently rediscovered this bug -- since the containerizer sets the version agent side, if you are in a "live upgrade" situation where you upgrade the agent but not the library baked into your containers yet, an otherwise safe upgrade turns into a breaking change! > The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors > --- > > Key: MESOS-3866 > URL: https://issues.apache.org/jira/browse/MESOS-3866 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.25.0 >Reporter: Michael Gummelt > > It's set here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 > And passed to the docker executor here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 > This leaks the host path of the library into the docker image, which of > course can't see it. This is breaking DCOS Spark, which runs in a docker > image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime
[ https://issues.apache.org/jira/browse/MESOS-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394584#comment-15394584 ] Steven Schlansker commented on MESOS-5910: -- It seems that it actually gives you a current snapshot when you initially subscribe, so perhaps this really is only an issue during master failovers. So this is probably of somewhat lower importance than I thought, although correctly handling master failover without losing events is still desirable. > Operator SUBSCRIBE api should provide a method to get all events without > requiring 100% uptime > -- > > Key: MESOS-5910 > URL: https://issues.apache.org/jira/browse/MESOS-5910 > Project: Mesos > Issue Type: Improvement > Components: HTTP API, json api >Affects Versions: 1.0.0 >Reporter: Steven Schlansker > > The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of > events as they occur. This is going to be extremely useful for monitoring > and management jobs, as they can now have timely information about Mesos's > operation without requiring repeated polling or other ugly solutions. > Unfortunately, the SUBSCRIBE call always returns from the time the call is > made. This means that any consumer cannot reliably subscribe to "all > events"; if the application goes offline (network blip, code upgrade, etc) > all events during that downtime are lost. > You could instead have a cluster of applications receiving the events and > coordinating to deduplicate them to increase reliability, but this pushes a > lot of complexity into clients, and I suspect most users would not do this > correctly and would potentially lose events. > It would be extremely useful for a single client to be able to get a reliable > event stream without requiring a single HTTP connection to be 100% available. > One possible solution is to assign every event an ID. Then, extend the API > to take a "start position" in the log. The API immediately streams out all > events from the start event up until the tail of the log, and then continues > emitting new events are they occur. This provides a reliable way for a > consumer to get "at least once" semantics on events. The caveat is that the > consumer may only be down for as long as the master retains event history, > but this is a much easier pill to swallow. This is similar to etcd's "watch" > api, if you are looking for an actual implementation to reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime
[ https://issues.apache.org/jira/browse/MESOS-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-5910: - Description: The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of events as they occur. This is going to be extremely useful for monitoring and management jobs, as they can now have timely information about Mesos's operation without requiring repeated polling or other ugly solutions. Unfortunately, the SUBSCRIBE call always returns from the time the call is made. This means that any consumer cannot reliably subscribe to "all events"; if the application goes offline (network blip, code upgrade, etc) all events during that downtime are lost. You could instead have a cluster of applications receiving the events and coordinating to deduplicate them to increase reliability, but this pushes a lot of complexity into clients, and I suspect most users would not do this correctly and would potentially lose events. It would be extremely useful for a single client to be able to get a reliable event stream without requiring a single HTTP connection to be 100% available. One possible solution is to assign every event an ID. Then, extend the API to take a "start position" in the log. The API immediately streams out all events from the start event up until the tail of the log, and then continues emitting new events are they occur. This provides a reliable way for a consumer to get "at least once" semantics on events. The caveat is that the consumer may only be down for as long as the master retains event history, but this is a much easier pill to swallow. This is similar to etcd's "watch" api, if you are looking for an actual implementation to reference. was: The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of events as they occur. This is going to be extremely useful for monitoring and management jobs, as they can now have timely information about Mesos's operation without requiring repeated polling or other ugly solutions. Unfortunately, the SUBSCRIBE call always returns from the time the call is made. This means that any consumer cannot reliably subscribe to "all events"; if the application goes offline (network blip, code upgrade, etc) all events during that downtime are lost. You could instead have a cluster of applications receiving the events and coordinating to deduplicate them to increase reliability, but this pushes a lot of complexity into clients, and I suspect most users would not do this correctly and would potentially lose events. It would be extremely useful for a single client to be able to get a reliable event stream without requiring a single HTTP connection to be 100% available. One possible solution is to assign every event an ID. Then, extend the API to take a "start position" in the log. The API immediately streams out all events from the start event up until the tail of the log, and then continues emitting new events are they occur. This provides a reliable way for a consumer to get "at least once" semantics on events. The caveat is that the consumer may only be down for as long as the master retains event history, but this is a much easier pill to swallow. > Operator SUBSCRIBE api should provide a method to get all events without > requiring 100% uptime > -- > > Key: MESOS-5910 > URL: https://issues.apache.org/jira/browse/MESOS-5910 > Project: Mesos > Issue Type: Improvement > Components: HTTP API, json api >Affects Versions: 1.0.0 >Reporter: Steven Schlansker > > The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of > events as they occur. This is going to be extremely useful for monitoring > and management jobs, as they can now have timely information about Mesos's > operation without requiring repeated polling or other ugly solutions. > Unfortunately, the SUBSCRIBE call always returns from the time the call is > made. This means that any consumer cannot reliably subscribe to "all > events"; if the application goes offline (network blip, code upgrade, etc) > all events during that downtime are lost. > You could instead have a cluster of applications receiving the events and > coordinating to deduplicate them to increase reliability, but this pushes a > lot of complexity into clients, and I suspect most users would not do this > correctly and would potentially lose events. > It would be extremely useful for a single client to be able to get a reliable > event stream without requiring a single HTTP connection to be 100% available. > One possible solution is to assign every event an ID. Then, extend the API > to take a "start position" in the log. The API immediately streams out all > ev
[jira] [Created] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime
Steven Schlansker created MESOS-5910: Summary: Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime Key: MESOS-5910 URL: https://issues.apache.org/jira/browse/MESOS-5910 Project: Mesos Issue Type: Improvement Components: HTTP API, json api Affects Versions: 1.0.0 Reporter: Steven Schlansker The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of events as they occur. This is going to be extremely useful for monitoring and management jobs, as they can now have timely information about Mesos's operation without requiring repeated polling or other ugly solutions. Unfortunately, the SUBSCRIBE call always returns from the time the call is made. This means that any consumer cannot reliably subscribe to "all events"; if the application goes offline (network blip, code upgrade, etc) all events during that downtime are lost. You could instead have a cluster of applications receiving the events and coordinating to deduplicate them to increase reliability, but this pushes a lot of complexity into clients, and I suspect most users would not do this correctly and would potentially lose events. It would be extremely useful for a single client to be able to get a reliable event stream without requiring a single HTTP connection to be 100% available. One possible solution is to assign every event an ID. Then, extend the API to take a "start position" in the log. The API immediately streams out all events from the start event up until the tail of the log, and then continues emitting new events are they occur. This provides a reliable way for a consumer to get "at least once" semantics on events. The caveat is that the consumer may only be down for as long as the master retains event history, but this is a much easier pill to swallow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON
[ https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319175#comment-15319175 ] Steven Schlansker commented on MESOS-4642: -- I really think a documentation "fix" is a bad solution for this issue. This is an API that can be broken solely based on user controlled data. Allowing a (potentially malicious) "isolated" process to cause Mesos APIs to produce semantically invalid responses is a bad behavior that is worthy of a breaking change IMO. Even if the cluster admin understands the gotcha, end users can break it unwittingly. We never intended to make non-UTF8 data, that was an accident, so any existing documentation would have helped us understand and recover but could not have prevented this outage. > Mesos Agent Json API can dump binary data from log files out as invalid JSON > > > Key: MESOS-4642 > URL: https://issues.apache.org/jira/browse/MESOS-4642 > Project: Mesos > Issue Type: Bug > Components: json api, slave >Affects Versions: 0.27.0 >Reporter: Steven Schlansker >Priority: Critical > > One of our tasks accidentally started logging binary data to stderr. This > was not intentional and generally should not happen -- however, it causes > severe problems with the Mesos Agent "files/read.json" API, since it gladly > dumps this binary data out as invalid JSON. > {code} > # hexdump -C /path/to/task/stderr | tail > 0003d1f0 6f 6e 6e 65 63 74 69 6f 6e 0a 4e 45 54 3a 20 31 |onnection.NET: 1| > 0003d200 20 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 | onread ENOENT 2| > 0003d210 39 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 |95456 251 295707| > 0003d220 0a 01 00 00 00 00 00 00 ac 57 65 64 2c 20 31 30 |.Wed, 10| > 0003d230 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 20 69 6e | Unrecognized in| > 0003d240 70 75 74 20 68 65 61 64 65 72 0a |put header.| > {code} > {code} > # curl > 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep=' > | hexdump -C > 7970 6e 65 63 74 69 6f 6e 5c 6e 4e 45 54 3a 20 31 20 |nection\nNET: 1 | > 7980 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 39 |onread ENOENT 29| > 7990 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 5c |5456 251 295707\| > 79a0 6e 5c 75 30 30 30 31 5c 75 30 30 30 30 5c 75 30 |n\u0001\u\u0| > 79b0 30 30 30 5c 75 30 30 30 30 5c 75 30 30 30 30 5c |000\u\u\| > 79c0 75 30 30 30 30 5c 75 30 30 30 30 ac 57 65 64 2c |u\u.Wed,| > 79d0 20 31 30 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 | 10 Unrecognized| > 79e0 20 69 6e 70 75 74 20 68 65 61 64 65 72 5c 6e 22 | input header\n"| > 79f0 2c 22 6f 66 66 73 65 74 22 3a 32 32 30 34 34 33 |,"offset":220443| > 7a00 7d|}| > {code} > This causes downstream sadness: > {code} > ERROR [2016-02-10 18:55:12,303] > io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: > 0ee749630f8b26f1 > ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac > ! at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: > 1, column: 31181] > ! at > com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381) > ~[singularity-0.4
[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241456#comment-15241456 ] Steven Schlansker commented on MESOS-1865: -- 301 is supposed to be "permanent" -- whereas the leader will continue to move over time. Would 307 (Temporary Redirect) be more appropriate? > Redirect to the leader master when current master is not a leader > - > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker >Assignee: haosdent > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239607#comment-15239607 ] Steven Schlansker commented on MESOS-1865: -- I also agree that fixing the "common" case by redirecting is okay -- "advanced" users that wish to query non-leading masters can easily ignore the redirect, and having 'curl' work reliably is of huge benefit. > Redirect to the leader master when current master is not a leader > - > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker >Assignee: haosdent > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239594#comment-15239594 ] Steven Schlansker commented on MESOS-1865: -- This issue came up on the mailing list again -- Guillermo wrote, {noformat} I have an URL mesos-master.mydomain.com pointing to the leader and that works fine because it returns the slave list which I need for my autoscaler. But I'm afraid if the master fails the URL will no longer be valid. So I added the three IPs to the router (AWS Route53) so it would round robin, but of course this will return an empty list sometimes because it hits a follower which returns empty. {noformat} > Redirect to the leader master when current master is not a leader > - > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker >Assignee: haosdent > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239594#comment-15239594 ] Steven Schlansker edited comment on MESOS-1865 at 4/13/16 4:57 PM: --- This issue came up on the mailing list again -- Guillermo wrote, {quote} I have an URL mesos-master.mydomain.com pointing to the leader and that works fine because it returns the slave list which I need for my autoscaler. But I'm afraid if the master fails the URL will no longer be valid. So I added the three IPs to the router (AWS Route53) so it would round robin, but of course this will return an empty list sometimes because it hits a follower which returns empty. {quote} was (Author: stevenschlansker): This issue came up on the mailing list again -- Guillermo wrote, {noformat} I have an URL mesos-master.mydomain.com pointing to the leader and that works fine because it returns the slave list which I need for my autoscaler. But I'm afraid if the master fails the URL will no longer be valid. So I added the three IPs to the router (AWS Route53) so it would round robin, but of course this will return an empty list sometimes because it hits a follower which returns empty. {noformat} > Redirect to the leader master when current master is not a leader > - > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker >Assignee: haosdent > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5195) Docker executor: task logs lost on shutdown
Steven Schlansker created MESOS-5195: Summary: Docker executor: task logs lost on shutdown Key: MESOS-5195 URL: https://issues.apache.org/jira/browse/MESOS-5195 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 0.27.2 Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS" Reporter: Steven Schlansker When you try to kill a task running in the Docker executor (in our case via Singularity), the task shuts down cleanly but the last logs to standard out / standard error are lost in teardown. For example, we run dumb-init. With debugging on, you can see it should write: {noformat} DEBUG("Forwarded signal %d to children.\n", signum); {noformat} If you attach strace to the process, you can see it clearly writes the text to stderr. But that message is lost and never is written to the sandbox 'stderr' file. We believe the issue starts here, in Docker executor.cpp: {code} void shutdown(ExecutorDriver* driver) { cout << "Shutting down" << endl; if (run.isSome() && !killed) { // The docker daemon might still be in progress starting the // container, therefore we kill both the docker run process // and also ask the daemon to stop the container. // Making a mutable copy of the future so we can call discard. Future(run.get()).discard(); stop = docker->stop(containerName, stopTimeout); killed = true; } } {code} Notice how the "run" future is discarded *before* the Docker daemon is told to stop -- now what will discarding it do? {code} void commandDiscarded(const Subprocess& s, const string& cmd) { VLOG(1) << "'" << cmd << "' is being discarded"; os::killtree(s.pid(), SIGKILL); } {code} Oops, just sent SIGKILL to the entire process tree... You can see another (harmless?) side effect in the Docker daemon logs, it never gets a chance to kill the task: {noformat} ERROR Handler for DELETE /v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 returned error: No such container: mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 {noformat} I suspect that the fix is wait for 'docker->stop()' to complete before discarding the 'run' future. Happy to provide more information if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2840) MesosContainerizer support multiple image provisioners
[ https://issues.apache.org/jira/browse/MESOS-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159389#comment-15159389 ] Steven Schlansker commented on MESOS-2840: -- Hi, I'm sorry I can't comment too much on the Mesos internals, but I just wanted to throw in my two cents as an end user. The Docker daemon is continually a thorn in our side, their "daemon forks the container process" model introduces an unnecessary single point of failure and just generally is not stable enough to be in that position. We are extremely excited to try out this work and look forward to being early adopters and finding all the bugs ;) The design document looks well thought out and seems like an excellent approach. > MesosContainerizer support multiple image provisioners > -- > > Key: MESOS-2840 > URL: https://issues.apache.org/jira/browse/MESOS-2840 > Project: Mesos > Issue Type: Epic > Components: containerization, docker >Affects Versions: 0.23.0 >Reporter: Marco Massenzio >Assignee: Timothy Chen > Labels: mesosphere, twitter > > We want to utilize the Appc integration interfaces to further make > MesosContainerizers to support multiple image formats. > This allows our future work on isolators to support any container image > format. > Design > [open to public comments] > https://docs.google.com/document/d/1oUpJNjJ0l51fxaYut21mKPwJUiAcAdgbdF7SAdAW2PA/edit?usp=sharing > [original document, requires permission] > https://docs.google.com/a/twitter.com/document/d/1Fx5TS0LytV7u5MZExQS0-g-gScX2yKCKQg9UPFzhp6U/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON
[ https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151273#comment-15151273 ] Steven Schlansker commented on MESOS-4642: -- It seems that the first option, doing proper JSON encoding, is the correct fix to me > Mesos Agent Json API can dump binary data from log files out as invalid JSON > > > Key: MESOS-4642 > URL: https://issues.apache.org/jira/browse/MESOS-4642 > Project: Mesos > Issue Type: Bug > Components: json api, slave >Affects Versions: 0.27.0 >Reporter: Steven Schlansker >Priority: Critical > > One of our tasks accidentally started logging binary data to stderr. This > was not intentional and generally should not happen -- however, it causes > severe problems with the Mesos Agent "files/read.json" API, since it gladly > dumps this binary data out as invalid JSON. > {code} > # hexdump -C /path/to/task/stderr | tail > 0003d1f0 6f 6e 6e 65 63 74 69 6f 6e 0a 4e 45 54 3a 20 31 |onnection.NET: 1| > 0003d200 20 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 | onread ENOENT 2| > 0003d210 39 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 |95456 251 295707| > 0003d220 0a 01 00 00 00 00 00 00 ac 57 65 64 2c 20 31 30 |.Wed, 10| > 0003d230 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 20 69 6e | Unrecognized in| > 0003d240 70 75 74 20 68 65 61 64 65 72 0a |put header.| > {code} > {code} > # curl > 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep=' > | hexdump -C > 7970 6e 65 63 74 69 6f 6e 5c 6e 4e 45 54 3a 20 31 20 |nection\nNET: 1 | > 7980 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 39 |onread ENOENT 29| > 7990 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 5c |5456 251 295707\| > 79a0 6e 5c 75 30 30 30 31 5c 75 30 30 30 30 5c 75 30 |n\u0001\u\u0| > 79b0 30 30 30 5c 75 30 30 30 30 5c 75 30 30 30 30 5c |000\u\u\| > 79c0 75 30 30 30 30 5c 75 30 30 30 30 ac 57 65 64 2c |u\u.Wed,| > 79d0 20 31 30 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 | 10 Unrecognized| > 79e0 20 69 6e 70 75 74 20 68 65 61 64 65 72 5c 6e 22 | input header\n"| > 79f0 2c 22 6f 66 66 73 65 74 22 3a 32 32 30 34 34 33 |,"offset":220443| > 7a00 7d|}| > {code} > This causes downstream sadness: > {code} > ERROR [2016-02-10 18:55:12,303] > io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: > 0ee749630f8b26f1 > ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac > ! at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: > 1, column: 31181] > ! at > com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:142) > ~[singularity-0.4.9.jar:0.4.9] > ! at >
[jira] [Created] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON
Steven Schlansker created MESOS-4642: Summary: Mesos Agent Json API can dump binary data from log files out as invalid JSON Key: MESOS-4642 URL: https://issues.apache.org/jira/browse/MESOS-4642 Project: Mesos Issue Type: Bug Components: json api, slave Affects Versions: 0.27.0 Reporter: Steven Schlansker Priority: Critical One of our tasks accidentally started logging binary data to stderr. This was not intentional and generally should not happen -- however, it causes severe problems with the Mesos Agent "files/read.json" API, since it gladly dumps this binary data out as invalid JSON. {code} # hexdump -C /path/to/task/stderr | tail 0003d1f0 6f 6e 6e 65 63 74 69 6f 6e 0a 4e 45 54 3a 20 31 |onnection.NET: 1| 0003d200 20 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 | onread ENOENT 2| 0003d210 39 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 |95456 251 295707| 0003d220 0a 01 00 00 00 00 00 00 ac 57 65 64 2c 20 31 30 |.Wed, 10| 0003d230 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 20 69 6e | Unrecognized in| 0003d240 70 75 74 20 68 65 61 64 65 72 0a |put header.| {code} {code} # curl 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep=' | hexdump -C 7970 6e 65 63 74 69 6f 6e 5c 6e 4e 45 54 3a 20 31 20 |nection\nNET: 1 | 7980 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 39 |onread ENOENT 29| 7990 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 5c |5456 251 295707\| 79a0 6e 5c 75 30 30 30 31 5c 75 30 30 30 30 5c 75 30 |n\u0001\u\u0| 79b0 30 30 30 5c 75 30 30 30 30 5c 75 30 30 30 30 5c |000\u\u\| 79c0 75 30 30 30 30 5c 75 30 30 30 30 ac 57 65 64 2c |u\u.Wed,| 79d0 20 31 30 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 | 10 Unrecognized| 79e0 20 69 6e 70 75 74 20 68 65 61 64 65 72 5c 6e 22 | input header\n"| 79f0 2c 22 6f 66 66 73 65 74 22 3a 32 32 30 34 34 33 |,"offset":220443| 7a00 7d|}| {code} This causes downstream sadness: {code} ERROR [2016-02-10 18:55:12,303] io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 0ee749630f8b26f1 ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac ! at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 1, column: 31181] ! at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:142) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserialize(SuperSonicBeanDeserializer.java:117) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3562) ~[singularity-0.4.9.jar:0.4.9] ! at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2648) ~[singularity-0.4.9.jar:0.4.9] ! at com.hubspot.singularity.data.SandboxManager.read(SandboxManager.java:97) ~[singularity-0.4.9.j
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970064#comment-14970064 ] Steven Schlansker commented on MESOS-3771: -- Sounds good to me. I think removing that field from JSON is fine for us. > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Assignee: Joseph Wu >Priority: Critical > Labels: mesosphere > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968622#comment-14968622 ] Steven Schlansker commented on MESOS-2186: -- That's a bummer. Thank you everyone for looking and your time. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slave
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968143#comment-14968143 ] Steven Schlansker commented on MESOS-2186: -- Maybe this will end up being too hard to fix, since it seems to be a limitation of the ZK C API. It's just surprising from an end user perspective that a single name failing to resolve (even when two are still happy) causes such a disruptive failure. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] M
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968128#comment-14968128 ] Steven Schlansker commented on MESOS-2186: -- This is true in the case the DNS resolution failure is temporary. If it is not temporary, you are still SOL. Imagine $JUNIOR_ADMIN removes one of the ZooKeeper nodes from DNS. You may then have an inoperable Mesos cluster for a long time if you have aggressive DNS caching, even though a ZK quorum is still up and alive. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaste
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968073#comment-14968073 ] Steven Schlansker commented on MESOS-2186: -- If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah? > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968072#comment-14968072 ] Steven Schlansker commented on MESOS-2186: -- If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah? > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master
[jira] [Issue Comment Deleted] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2186: - Comment: was deleted (was: If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is unrelated, yeah?) > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968056#comment-14968056 ] Steven Schlansker commented on MESOS-2186: -- Well, rgs above called into question whether that is truly the case. Additionally at least as of now the "check failure stack trace" is entirely in C++ code, seemingly not in the Zookeeper library (pure C). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:51 PM: I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_00' in ZooKeeper I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master! {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. I fully admit the linked ZK bug may not be the root cause. But Mesos is still trivial to crash if one of the ZK members are not valid (even if a quorum are). was (Author: stevenschlansker): I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/w
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968027#comment-14968027 ] Steven Schlansker commented on MESOS-2186: -- I reopened the ticket since it is still a crasher in master. I hope that is appropriate, I apologize in advance if not. Not trying to be a stick in the mud but this compromises the "high availability" of Mesos which is a critical piece of infrastructure. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:3
[jira] [Updated] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2186: - Affects Version/s: 0.26.0 > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.26.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slaves to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM: I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_00' in ZooKeeper I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master! {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). was (Author: stevenschlansker): I am still able to easily reproduce this, even with master built from today: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct thou
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] Steven Schlansker commented on MESOS-2186: -- I am still able to easily reproduce this, even with master built from today: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967962#comment-14967962 ] Steven Schlansker commented on MESOS-3771: -- Okay, I have distilled down the reproduction case. Using the Python test-framework with the following diff applied: {code} diff --git a/src/examples/python/test_framework.py b/src/examples/python/test_framework.py index 6af6d22..95abb97 100755 --- a/src/examples/python/test_framework.py +++ b/src/examples/python/test_framework.py @@ -150,6 +150,7 @@ class TestScheduler(mesos.interface.Scheduler): print "but received", self.messagesReceived sys.exit(1) print "All tasks done, and all messages received, exiting" +time.sleep(30) driver.stop() if __name__ == "__main__": @@ -158,6 +159,7 @@ if __name__ == "__main__": sys.exit(1) executor = mesos_pb2.ExecutorInfo() +executor.data = b'\xAC\xED' executor.executor_id.value = "default" executor.command.value = os.path.abspath("./test-executor") executor.name = "Test Executor (Python)" {code} if you run the test framework, and during the 30 second wait after it finishes, try to grab the {{/master/state.json}} endpoint, you will get a response that has invalid UTF8 in it: {code} Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@54c8158d; line: 1, column: 6432] {code} I tested against both 0.24.1 and current master, both exhibit the bad behavior. > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-3771: - Affects Version/s: 0.26.0 > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1, 0.26.0 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965945#comment-14965945 ] Steven Schlansker commented on MESOS-3771: -- Yeah, we could try to patch Spark. However I'm sure someone else will make exactly the same mistake down the road -- it seems to work as long as you use the protobuf api only. It really seems wrong to just assume that arbitrary bytes are valid UTF-8. Note that ASCII is a real misnomer here, the only things that matter are "arbitrary binary data" (the type of 'data') and "UTF8" (the format that the rendered JSON *must* be in). I don't see anywhere here that ASCII is relevant. Maybe it's possible to escape the 0xACED sequence we see with \uXXX escapes, but I'm not sure that's possible, as those escapes produce UTF-16 codepoints not binary data... > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965762#comment-14965762 ] Steven Schlansker commented on MESOS-3771: -- Similar, but potentially unrelated, issue: https://issues.apache.org/jira/browse/MESOS-3284 > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| > 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| > 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| > 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| > 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| > 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| > {code} > I suspect this is because the HTTP api emits the executorInfo.data directly: > {code} > JSON::Object model(const ExecutorInfo& executorInfo) > { > JSON::Object object; > object.values["executor_id"] = executorInfo.executor_id().value(); > object.values["name"] = executorInfo.name(); > object.values["data"] = executorInfo.data(); > object.values["framework_id"] = executorInfo.framework_id().value(); > object.values["command"] = model(executorInfo.command()); > object.values["resources"] = model(executorInfo.resources()); > return object; > } > {code} > I think this may be because the custom JSON processing library in stout seems > to not have any idea of what a byte array is. I'm guessing that some > implicit conversion makes it get written as a String instead, but: > {code} > inline std::ostream& operator<<(std::ostream& out, const String& string) > { > // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. > // See RFC4627 for the JSON string specificiation. > return out << picojson::value(string.value).serialize(); > } > {code} > Thank you for any assistance here. Our cluster is currently entirely down -- > the frameworks cannot handle parsing the invalid JSON produced (it is not > even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
Steven Schlansker created MESOS-3771: Summary: Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling Key: MESOS-3771 URL: https://issues.apache.org/jira/browse/MESOS-3771 Project: Mesos Issue Type: Bug Components: HTTP API Affects Versions: 0.24.1 Reporter: Steven Schlansker Priority: Critical Spark encodes some binary data into the ExecutorInfo.data field. This field is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. If you have such a field, it seems that it is splatted out into JSON without any regards to proper character encoding: {quote} 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| {quote} I suspect this is because the HTTP api emits the executorInfo.data directly: {code} JSON::Object model(const ExecutorInfo& executorInfo) { JSON::Object object; object.values["executor_id"] = executorInfo.executor_id().value(); object.values["name"] = executorInfo.name(); object.values["data"] = executorInfo.data(); object.values["framework_id"] = executorInfo.framework_id().value(); object.values["command"] = model(executorInfo.command()); object.values["resources"] = model(executorInfo.resources()); return object; } {code} I think this may be because the custom JSON processing library in stout seems to not have any idea of what a byte array is. I'm guessing that some implicit conversion makes it get written as a String instead, but: {code} inline std::ostream& operator<<(std::ostream& out, const String& string) { // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. // See RFC4627 for the JSON string specificiation. return out << picojson::value(string.value).serialize(); } {code} Thank you for any assistance here. Our cluster is currently entirely down -- the frameworks cannot handle parsing the invalid JSON produced (it is not even valid utf-8) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling
[ https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-3771: - Description: Spark encodes some binary data into the ExecutorInfo.data field. This field is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. If you have such a field, it seems that it is splatted out into JSON without any regards to proper character encoding: {code} 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| {code} I suspect this is because the HTTP api emits the executorInfo.data directly: {code} JSON::Object model(const ExecutorInfo& executorInfo) { JSON::Object object; object.values["executor_id"] = executorInfo.executor_id().value(); object.values["name"] = executorInfo.name(); object.values["data"] = executorInfo.data(); object.values["framework_id"] = executorInfo.framework_id().value(); object.values["command"] = model(executorInfo.command()); object.values["resources"] = model(executorInfo.resources()); return object; } {code} I think this may be because the custom JSON processing library in stout seems to not have any idea of what a byte array is. I'm guessing that some implicit conversion makes it get written as a String instead, but: {code} inline std::ostream& operator<<(std::ostream& out, const String& string) { // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. // See RFC4627 for the JSON string specificiation. return out << picojson::value(string.value).serialize(); } {code} Thank you for any assistance here. Our cluster is currently entirely down -- the frameworks cannot handle parsing the invalid JSON produced (it is not even valid utf-8) was: Spark encodes some binary data into the ExecutorInfo.data field. This field is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. If you have such a field, it seems that it is splatted out into JSON without any regards to proper character encoding: {quote} 0006b0b0 2e 73 70 61 72 6b 2e 65 78 65 63 75 74 6f 72 2e |.spark.executor.| 0006b0c0 4d 65 73 6f 73 45 78 65 63 75 74 6f 72 42 61 63 |MesosExecutorBac| 0006b0d0 6b 65 6e 64 22 7d 2c 22 64 61 74 61 22 3a 22 ac |kend"},"data":".| 0006b0e0 ed 5c 75 30 30 30 30 5c 75 30 30 30 35 75 72 5c |.\u\u0005ur\| 0006b0f0 75 30 30 30 30 5c 75 30 30 30 66 5b 4c 73 63 61 |u\u000f[Lsca| 0006b100 6c 61 2e 54 75 70 6c 65 32 3b 2e cc 5c 75 30 30 |la.Tuple2;..\u00| {quote} I suspect this is because the HTTP api emits the executorInfo.data directly: {code} JSON::Object model(const ExecutorInfo& executorInfo) { JSON::Object object; object.values["executor_id"] = executorInfo.executor_id().value(); object.values["name"] = executorInfo.name(); object.values["data"] = executorInfo.data(); object.values["framework_id"] = executorInfo.framework_id().value(); object.values["command"] = model(executorInfo.command()); object.values["resources"] = model(executorInfo.resources()); return object; } {code} I think this may be because the custom JSON processing library in stout seems to not have any idea of what a byte array is. I'm guessing that some implicit conversion makes it get written as a String instead, but: {code} inline std::ostream& operator<<(std::ostream& out, const String& string) { // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII. // See RFC4627 for the JSON string specificiation. return out << picojson::value(string.value).serialize(); } {code} Thank you for any assistance here. Our cluster is currently entirely down -- the frameworks cannot handle parsing the invalid JSON produced (it is not even valid utf-8) > Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII > handling > --- > > Key: MESOS-3771 > URL: https://issues.apache.org/jira/browse/MESOS-3771 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.1 >Reporter: Steven Schlansker >Priority: Critical > > Spark encodes some binary data into the ExecutorInfo.data field. This field > is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data. > If you have such a field, it seems that it is splatted out into JSON without > any regards to proper character encoding: > {code} > 0006b0b0 2e 73 70 61 7
[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723936#comment-14723936 ] Steven Schlansker commented on MESOS-2684: -- Here is a similar presumed unintentional crasher that another user reported on the mailing list: tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_KILLED > when mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2684: - Comment: was deleted (was: Here is a similar presumed unintentional crasher that another user reported on the mailing list: tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory ) > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_KILLED > when mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682072#comment-14682072 ] Steven Schlansker edited comment on MESOS-2186 at 8/11/15 4:54 PM: --- I strongly disagree with closing this bug, it is not fixed, and is a serious issue affecting multiple end users. We too have suffered production downtime directly attributable to this issue. The ZOOKEEPER- bug tracks the actual fix, IMO this bug then should track integrating a fixed library into Mesos. was (Author: stevenschlansker): I strongly disagree with closing this bug, it is not fixed, and is a serious issue affecting multiple end users. The ZOOKEEPER- bug tracks the actual fix, IMO this bug then should track integrating a fixed library into Mesos. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]:
[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682072#comment-14682072 ] Steven Schlansker commented on MESOS-2186: -- I strongly disagree with closing this bug, it is not fixed, and is a serious issue affecting multiple end users. The ZOOKEEPER- bug tracks the actual fix, IMO this bug then should track integrating a fixed library into Mesos. > Mesos crashes if any configured zookeeper does not resolve. > --- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) >Reporter: Daniel Hall >Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated fra
[jira] [Commented] (MESOS-1375) Log rotation capable
[ https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545915#comment-14545915 ] Steven Schlansker commented on MESOS-1375: -- Sorry to sound frustrated. After digging a little bit, I realize that my issue is as much with Mesosphere's packaging of Mesos as it is the Mesos configuration itself. There's a couple of issues here that all come together to make it very hard to create a "production ready" logrotate setup. * GLOG's log rotation is wacky. It seems to rotate logs in part based on service restarts, so the interval between rotations is extremely irregular. We will have 10 log files created in quick succession if a slave has issues starting up (right now I have 20 files for a single day we had a lot of issues in). Other times during periods of great stability but high task load, we will end up with a single log file covering most of a month and grow to 10GB * Mesosphere's init scripts do not allow easy customization of GLOG configuration (not that it is very configurable to start with) * Mesosphere's init scripts hardwire stdout / stderr from mesos-master and mesos-slave to go to syslog's user facility, which is overloaded by just about every project that uses syslog My ideal setup honestly would be to to pipe process stdout / stderr through something like Apache's 'rotatelogs' command, or to improve the Mesos integration with 'logrotate' so it can signal properly and not need 'copytruncate' which has known race conditions. I tried the logrotate 'hack' linked above and we did not find much success over three or four iterations. It may be possible to get it working nicely, in which case maybe the only change needed is a documentation fix of "This is the official way to get Mesos logs rotation to work" along with some user education. Happy to expand on any of these points if that would be helpful. Thanks! > Log rotation capable > > > Key: MESOS-1375 > URL: https://issues.apache.org/jira/browse/MESOS-1375 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.18.0 >Reporter: Damien Hardy > Labels: ops, twitter > > Please provide a way to let ops manage logs. > A log4j like configuration would be hard but make rotation capable without > restarting the service at least. > Based on external logrotate tool would be great : > * write to a constant log file name > * check for file change (recreated by logrotate) before write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1375) Log rotation capable
[ https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533595#comment-14533595 ] Steven Schlansker edited comment on MESOS-1375 at 5/7/15 11:30 PM: --- A year later, and our disks are still filling up due to this issue. This is a really sad problem to have in 2015 :( was (Author: stevenschlansker): A year later, and our disks are still filling up due to this issue. This is a really sad problem to have in 2015 :( Standard GLOG mechanics are probably not good enough until MESOS-2193 gets some attention. > Log rotation capable > > > Key: MESOS-1375 > URL: https://issues.apache.org/jira/browse/MESOS-1375 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.18.0 >Reporter: Damien Hardy > Labels: ops, twitter > > Please provide a way to let ops manage logs. > A log4j like configuration would be hard but make rotation capable without > restarting the service at least. > Based on external logrotate tool would be great : > * write to a constant log file name > * check for file change (recreated by logrotate) before write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1375) Log rotation capable
[ https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533595#comment-14533595 ] Steven Schlansker commented on MESOS-1375: -- A year later, and our disks are still filling up due to this issue. This is a really sad problem to have in 2015 :( Standard GLOG mechanics are probably not good enough until MESOS-2193 gets some attention. > Log rotation capable > > > Key: MESOS-1375 > URL: https://issues.apache.org/jira/browse/MESOS-1375 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.18.0 >Reporter: Damien Hardy > Labels: ops, twitter > > Please provide a way to let ops manage logs. > A log4j like configuration would be hard but make rotation capable without > restarting the service at least. > Based on external logrotate tool would be great : > * write to a constant log file name > * check for file change (recreated by logrotate) before write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524149#comment-14524149 ] Steven Schlansker commented on MESOS-2684: -- Sorry if I wasn't clear, I mentioned that these files have no non-application output. For completeness: stderr: I0421 21:05:14.850749 13546 exec.cpp:132] Version: 0.21.1 I0421 21:05:14.862670 13559 exec.cpp:206] Executor registered on slave 20150327-194449-419644938-5050-1649-S71 stdout: Registered executor on 10.70.8.160 Starting task pp-request-bookings-teamcity.2015.04.02T15.58.28-1429650229399-2-10.70.8.160-us_west_2b Forked command at 13575 /bin/sh -c exit `docker wait mesos-8d3b46d5-99d6-4994-a7e4-df66aa34ae89` 2015-04-21T21:05:15.954Z, LOGGER ERROR, log client is undefined!, {"@timestamp":"2015-04-21T21:05:15.954Z","servicetype":"requestbookings","logname":"LOGGER ERROR","formatversion":"v1","type":"requestbookings-LOGGER ERROR-v1","host":"10.70.8.160","sequencenumber":1,"message":"log client is undefined!"} 2015-04-21T21:05:15.953Z, salesforceConnection, , {"@timestamp":"2015-04-21T21:05:15.953Z","servicetype":"requestbookings","logname":"salesforceConnection","formatversion":"v1","type":"requestbookings-salesforceConnection-v1","host":"10.70.8.160","sequencenumber":1,"durationMs":226} Connection to redis closed. It will reopen when logs will need sending. Connection to redis closed. It will reopen when logs will need sending. Yeah, those errors are not great, those pesky end users... but I don't see any executor output just the same > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_KILLED > when mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524121#comment-14524121 ] Steven Schlansker commented on MESOS-2684: -- We are using the built-in Docker executor. The attachment is the complete contents of mesos-slave.INFO.log up through the point in time when new containers start launching to replace the lost ones. I am not aware of separate executor logs, is there somewhere else I should look? The task stderr and stdout do not have any non-application output past executor launch. > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_KILLED > when mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2684: - Description: mesos-slave can encounter a variety of problems while attempting to launch a task. If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected. However, if the task failure happens to trigger an assertion, the entire slave comes crashing down: F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' Immediately afterwards, all tasks on this slave were declared TASK_KILLED when mesos-slave restarted. Something as simple as a 'mkdir' failing is not worthy of an assertion failure. was: mesos-slave can encounter a variety of problems while attempting to launch a task. If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected. However, if the task failure happens to trigger an assertion, the entire slave comes crashing down: F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' Immediately afterwards, all tasks on this slave were declared TASK_LOST when mesos-slave restarted. Something as simple as a 'mkdir' failing is not worthy of an assertion failure. > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_KILLED > when mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524006#comment-14524006 ] Steven Schlansker edited comment on MESOS-2684 at 5/1/15 9:35 PM: -- I've attached the log from slave restart. The FATAL error above was the last line written before the abort, this is the head of the new log file created on restart. I misspoke about LOST, it was actually KILLED. was (Author: stevenschlansker): I've attached the log from slave restart. The FATAL error above was the last line written before the abort, this is the head of the new log file created on restart. > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_LOST when > mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
[ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2684: - Attachment: mesos-slave-restart.txt I've attached the log from slave restart. The FATAL error above was the last line written before the abort, this is the head of the new log file created on restart. > mesos-slave should not abort when a single task has e.g. a 'mkdir' failure > -- > > Key: MESOS-2684 > URL: https://issues.apache.org/jira/browse/MESOS-2684 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.1 >Reporter: Steven Schlansker > Attachments: mesos-slave-restart.txt > > > mesos-slave can encounter a variety of problems while attempting to launch a > task. If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > However, if the task failure happens to trigger an assertion, the entire > slave comes crashing down: > F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left > on device Failed to create executor directory > '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' > Immediately afterwards, all tasks on this slave were declared TASK_LOST when > mesos-slave restarted. > Something as simple as a 'mkdir' failing is not worthy of an assertion > failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
Steven Schlansker created MESOS-2684: Summary: mesos-slave should not abort when a single task has e.g. a 'mkdir' failure Key: MESOS-2684 URL: https://issues.apache.org/jira/browse/MESOS-2684 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.21.1 Reporter: Steven Schlansker mesos-slave can encounter a variety of problems while attempting to launch a task. If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected. However, if the task failure happens to trigger an assertion, the entire slave comes crashing down: F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01' Immediately afterwards, all tasks on this slave were declared TASK_LOST when mesos-slave restarted. Something as simple as a 'mkdir' failing is not worthy of an assertion failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494506#comment-14494506 ] Steven Schlansker commented on MESOS-1865: -- Yes. Or it should return the correct results. Really, it should do just about anything rather than returning a valid but incorrect result. > Mesos APIs for non-leading masters should return copies of the leader's state > or an error, not a success with incorrect information > --- > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377941#comment-14377941 ] Steven Schlansker commented on MESOS-2162: -- There is a library linked above which may be a good starting point. Alternately, if you want to start off simple, you could consider instead executing the 'rkt' command line tool rather than building it into the executor directly. I believe this is how Docker works now. It may be actually very straightforward to clone the Docker approach and replace all Docker options with their Rocket equivalent. > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: gsoc2015, mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295983#comment-14295983 ] Steven Schlansker commented on MESOS-2162: -- I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help... > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295981#comment-14295981 ] Steven Schlansker commented on MESOS-2162: -- I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help... > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2162: - Comment: was deleted (was: I would love to help out in any way I can, but I am not much of a C++ guy. But at the very least I would happily test it, or if you have other suggestions for how I can help...) > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295944#comment-14295944 ] Steven Schlansker commented on MESOS-2162: -- This library may be a good starting point: https://github.com/cdaylward/libappc/ > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295939#comment-14295939 ] Steven Schlansker commented on MESOS-2162: -- Any possibility of getting this scheduled for an upcoming release? > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
[ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277742#comment-14277742 ] Steven Schlansker commented on MESOS-1949: -- Yes, it'd be good enough for this specific case. But this has been a pattern and I'm sure we'll find more cases as we go along :) > All log messages from master, slave, executor, etc. should be collected on a > per-task basis > --- > > Key: MESOS-1949 > URL: https://issues.apache.org/jira/browse/MESOS-1949 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Currently through a task's lifecycle, various debugging information is > created at different layers of the Mesos ecosystem. The framework will log > task information, the master deals with resource allocation, the slave > actually allocates those resources, and the executor does the work of > launching the task. > If anything through that pipeline fails, the end user is left with little but > a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful > information (for example a "Docker pull failed because repository didn't > exist") is hidden in one of four or five different places, potentially spread > across as many different machines. This leads to unpleasant and repetitive > searching through logs looking for a clue to what went wrong. > Collating logs on a per-task basis would give the end user a much friendlier > way of figuring out exactly where in this process something went wrong, and > likely much faster resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
[ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399 ] Steven Schlansker edited comment on MESOS-1949 at 1/14/15 6:33 PM: --- Well, it's not quite as urgent as I thought. But there's still a lot of information that is hidden in log files and is very hard to correlate. For example, I had a task die with {code} I0106 20:08:04.998108 1625 docker.cpp:928] Starting container '78065406-449e-4103-85c1-bbfab09d7372' for task 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' (and executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a') of framework 'Singularity' E0106 20:08:05.221181 1624 slave.cpp:2787] Container '78065406-449e-4103-85c1-bbfab09d7372' for executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed to start: Port [4111] not included in resources E0106 20:08:05.277864 1622 slave.cpp:2882] Termination of executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed: Unknown container: 78065406-449e-4103-85c1-bbfab09d7372 {code} but the "message" field only has "Abnormal executor termination" Whenever something like this happens, application developers come to me -- they don't have the knowledge to trawl through Mesos logs (arguably a developer education problem, but the tools could help much more!). You can find the Mesos slave logs through the UI, but you have to do a lot of correlation yourself -- you have to find the right slave, dig through the messages looking only for the ones relevant to your task, etc. If all of the relevant logs to one task were collected in one place, this would be much easier. Makes sense? was (Author: stevenschlansker): Well, it's not quite as urgent as I thought. But there's still a lot of information that is hidden in log files and is very hard to correlate. For example, I had a task die with {code} I0106 20:08:04.998108 1625 docker.cpp:928] Starting container '78065406-449e-4103-85c1-bbfab09d7372' for task 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' (and executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a') of framework 'Singularity' E0106 20:08:05.221181 1624 slave.cpp:2787] Container '78065406-449e-4103-85c1-bbfab09d7372' for executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed to start: Port [4111] not included in resources E0106 20:08:05.277864 1622 slave.cpp:2882] Termination of executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed: Unknown container: 78065406-449e-4103-85c1-bbfab09d7372 {code} but the "message" field only has "Abnormal executor termination" Whenever something like this happens, application developers come to me -- they don't have any way to see the Mesos slave logs (no login permissions in general). You can find the Mesos slave logs through the UI, but you have to do a lot of correlation yourself -- you have to find the right slave, dig through the messages, etc. If all of the relevant logs to one task were collected in one place, this would be much easier. Makes sense? > All log messages from master, slave, executor, etc. should be collected on a > per-task basis > --- > > Key: MESOS-1949 > URL: https://issues.apache.org/jira/browse/MESOS-1949 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Currently through a task's lifecycle, various debugging information is > created at different layers of the Mesos ecosystem. The framework will log > task information, the master deals with resource allocation, the slave > actually allocates those resources, and the executor does the work of > launching the task. > If anything through that pipeline fails, the end user is left with little but > a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful > information (for example a "Docker pull failed because repository didn't > exist") is hidden in one of four or five different places, potentially spread > across as many different machines. This leads to unpleasant and repetitive > searching through logs looking for a clue to what went wrong. > Collating logs on a per-task basis would give the end user a much friendlier > way of figuring out exactly where in this process something went wrong, and > likely m
[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
[ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399 ] Steven Schlansker commented on MESOS-1949: -- Well, it's not quite as urgent as I thought. But there's still a lot of information that is hidden in log files and is very hard to correlate. For example, I had a task die with {code} I0106 20:08:04.998108 1625 docker.cpp:928] Starting container '78065406-449e-4103-85c1-bbfab09d7372' for task 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' (and executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a') of framework 'Singularity' E0106 20:08:05.221181 1624 slave.cpp:2787] Container '78065406-449e-4103-85c1-bbfab09d7372' for executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed to start: Port [4111] not included in resources E0106 20:08:05.277864 1622 slave.cpp:2882] Termination of executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed: Unknown container: 78065406-449e-4103-85c1-bbfab09d7372 {code} but the "message" field only has "Abnormal executor termination" Whenever something like this happens, application developers come to me -- they don't have any way to see the Mesos slave logs (no login permissions in general). You can find the Mesos slave logs through the UI, but you have to do a lot of correlation yourself -- you have to find the right slave, dig through the messages, etc. If all of the relevant logs to one task were collected in one place, this would be much easier. Makes sense? > All log messages from master, slave, executor, etc. should be collected on a > per-task basis > --- > > Key: MESOS-1949 > URL: https://issues.apache.org/jira/browse/MESOS-1949 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Currently through a task's lifecycle, various debugging information is > created at different layers of the Mesos ecosystem. The framework will log > task information, the master deals with resource allocation, the slave > actually allocates those resources, and the executor does the work of > launching the task. > If anything through that pipeline fails, the end user is left with little but > a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful > information (for example a "Docker pull failed because repository didn't > exist") is hidden in one of four or five different places, potentially spread > across as many different machines. This leads to unpleasant and repetitive > searching through logs looking for a clue to what went wrong. > Collating logs on a per-task basis would give the end user a much friendlier > way of figuring out exactly where in this process something went wrong, and > likely much faster resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2212) Better handling of errors during `docker wait`
[ https://issues.apache.org/jira/browse/MESOS-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2212: - Description: Currently, the Docker containerizer executes a "exit $(docker wait $CONTAINER_NAME)". This misses a couple of edge cases in the 'docker wait' API -- notably, if an OOM condition occurs, it will return "-1" (which is not a valid exit code for sh, causing an error, see https://issues.apache.org/jira/browse/MESOS-2209. If a Docker container OOMs, the 'docker inspect' output will set 'State.OOMKilled' to 'true' and 'docker wait' will return -1. This should be handled more gracefully. In particular, setting the message to indicate that the OOM killer intervened would be very useful as then end users can know the real reason their task died. {code} "State": { "Error": "", "ExitCode": -1, "FinishedAt": "2015-01-08T18:38:39.834089879Z", "OOMKilled": true, "Paused": false, "Pid": 0, "Restarting": false, "Running": false, "StartedAt": "2015-01-08T18:38:39.309034983Z" } {code} I've filed a bug on Docker as well: https://github.com/docker/docker/issues/9979 was: Currently, the Docker containerizer executes a "exit $(docker wait $CONTAINER_NAME)". This misses a couple of edge cases in the 'docker wait' API -- notably, if an OOM condition occurs, it will return "-1" (which is not a valid exit code for sh, causing an error, see https://issues.apache.org/jira/browse/MESOS-2209. If a Docker container OOMs, the 'docker inspect' output will set 'State.OOMKilled' to 'true' and 'docker wait' will return -1. This should be handled more gracefully. {code} "State": { "Error": "", "ExitCode": -1, "FinishedAt": "2015-01-08T18:38:39.834089879Z", "OOMKilled": true, "Paused": false, "Pid": 0, "Restarting": false, "Running": false, "StartedAt": "2015-01-08T18:38:39.309034983Z" } {code} I've filed a but on Docker as well: https://github.com/docker/docker/issues/9979 > Better handling of errors during `docker wait` > -- > > Key: MESOS-2212 > URL: https://issues.apache.org/jira/browse/MESOS-2212 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.21.0 >Reporter: Steven Schlansker > > Currently, the Docker containerizer executes a "exit $(docker wait > $CONTAINER_NAME)". This misses a couple of edge cases in the 'docker wait' > API -- notably, if an OOM condition occurs, it will return "-1" (which is not > a valid exit code for sh, causing an error, see > https://issues.apache.org/jira/browse/MESOS-2209. > If a Docker container OOMs, the 'docker inspect' output will set > 'State.OOMKilled' to 'true' and 'docker wait' will return -1. This should be > handled more gracefully. In particular, setting the message to indicate that > the OOM killer intervened would be very useful as then end users can know the > real reason their task died. > {code} > "State": { > "Error": "", > "ExitCode": -1, > "FinishedAt": "2015-01-08T18:38:39.834089879Z", > "OOMKilled": true, > "Paused": false, > "Pid": 0, > "Restarting": false, > "Running": false, > "StartedAt": "2015-01-08T18:38:39.309034983Z" > } > {code} > I've filed a bug on Docker as well: > https://github.com/docker/docker/issues/9979 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2212) Better handling of errors during `docker wait`
Steven Schlansker created MESOS-2212: Summary: Better handling of errors during `docker wait` Key: MESOS-2212 URL: https://issues.apache.org/jira/browse/MESOS-2212 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.21.0 Reporter: Steven Schlansker Currently, the Docker containerizer executes a "exit $(docker wait $CONTAINER_NAME)". This misses a couple of edge cases in the 'docker wait' API -- notably, if an OOM condition occurs, it will return "-1" (which is not a valid exit code for sh, causing an error, see https://issues.apache.org/jira/browse/MESOS-2209. If a Docker container OOMs, the 'docker inspect' output will set 'State.OOMKilled' to 'true' and 'docker wait' will return -1. This should be handled more gracefully. {code} "State": { "Error": "", "ExitCode": -1, "FinishedAt": "2015-01-08T18:38:39.834089879Z", "OOMKilled": true, "Paused": false, "Pid": 0, "Restarting": false, "Running": false, "StartedAt": "2015-01-08T18:38:39.309034983Z" } {code} I've filed a but on Docker as well: https://github.com/docker/docker/issues/9979 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2209) Mesos should not use negative exit codes
[ https://issues.apache.org/jira/browse/MESOS-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2209: - Description: POSIX restricts exit codes to an unsigned 8 bit integer. Mesos has a number of places where it exits with status -1. outside that range. With some shells (notably dash, which is /bin/sh on Debian based systems) this causes an error: {code} /bin/sh: 1: exit: Illegal number: -1 {code} An example of where this is done is in exec.cpp {code} void kill() { VLOG(1) << "Committing suicide by killing the process group"; // TODO(vinod): Invoke killtree without killing ourselves. // Kill the process group (including ourself). killpg(0, SIGKILL); // The signal might not get delivered immediately, so sleep for a // few seconds. Worst case scenario, exit abnormally. os::sleep(Seconds(5)); exit(-1); } {code} The code needs to be audited to always return valid exit codes. was: POSIX restricts exit codes to an unsigned 8 bit integer. Mesos has a number of places where it exits with status -1. outside that range. With some shells (notably dash, which is /bin/sh on Debian based systems) this causes an error: {code} /bin/sh: 1: exit: Illegal number: -1 {code} An example of where this is done is in exec.cpp {code} void kill() { VLOG(1) << "Committing suicide by killing the process group"; // TODO(vinod): Invoke killtree without killing ourselves. // Kill the process group (including ourself). killpg(0, SIGKILL); // The signal might not get delivered immediately, so sleep for a // few seconds. Worst case scenario, exit abnormally. os::sleep(Seconds(5)); exit(-1); } {code} > Mesos should not use negative exit codes > > > Key: MESOS-2209 > URL: https://issues.apache.org/jira/browse/MESOS-2209 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.21.0 >Reporter: Steven Schlansker >Priority: Minor > > POSIX restricts exit codes to an unsigned 8 bit integer. > Mesos has a number of places where it exits with status -1. outside that > range. With some shells (notably dash, which is /bin/sh on Debian based > systems) this causes an error: > {code} > /bin/sh: 1: exit: Illegal number: -1 > {code} > An example of where this is done is in exec.cpp > {code} > void kill() > { > VLOG(1) << "Committing suicide by killing the process group"; > // TODO(vinod): Invoke killtree without killing ourselves. > // Kill the process group (including ourself). > killpg(0, SIGKILL); > // The signal might not get delivered immediately, so sleep for a > // few seconds. Worst case scenario, exit abnormally. > os::sleep(Seconds(5)); > exit(-1); > } > {code} > The code needs to be audited to always return valid exit codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2209) Mesos should not use negative exit codes
Steven Schlansker created MESOS-2209: Summary: Mesos should not use negative exit codes Key: MESOS-2209 URL: https://issues.apache.org/jira/browse/MESOS-2209 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.21.0 Reporter: Steven Schlansker Priority: Minor POSIX restricts exit codes to an unsigned 8 bit integer. Mesos has a number of places where it exits with status -1. outside that range. With some shells (notably dash, which is /bin/sh on Debian based systems) this causes an error: {code} /bin/sh: 1: exit: Illegal number: -1 {code} An example of where this is done is in exec.cpp {code} void kill() { VLOG(1) << "Committing suicide by killing the process group"; // TODO(vinod): Invoke killtree without killing ourselves. // Kill the process group (including ourself). killpg(0, SIGKILL); // The signal might not get delivered immediately, so sleep for a // few seconds. Worst case scenario, exit abnormally. os::sleep(Seconds(5)); exit(-1); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2024) mesos debian packaging should work on a java 8 install without java7
[ https://issues.apache.org/jira/browse/MESOS-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192288#comment-14192288 ] Steven Schlansker commented on MESOS-2024: -- Hm, this might actually be a Mesosphere issue (Mesos should really take the packaging into the main build!) Maybe this PR addresses the issue? https://github.com/mesosphere/mesos-deb-packaging/pull/24 > mesos debian packaging should work on a java 8 install without java7 > > > Key: MESOS-2024 > URL: https://issues.apache.org/jira/browse/MESOS-2024 > Project: Mesos > Issue Type: Improvement > Components: build >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > The mesos .deb file: > root@myhost:~# dpkg-query -s mesos > Version: 0.20.1-1.0.ubuntu1404 > Depends: java7-runtime-headless | java6-runtime-headless, libcurl3 > Recommends: zookeeper, zookeeperd, zookeeper-bin > We run java8, but installing the mesos package always drags in java7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2024) mesos debian packaging should work on a java 8 install without java7
[ https://issues.apache.org/jira/browse/MESOS-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-2024: - Summary: mesos debian packaging should work on a java 8 install without java7 (was: mesos debian packaging should work on a java 8 install) > mesos debian packaging should work on a java 8 install without java7 > > > Key: MESOS-2024 > URL: https://issues.apache.org/jira/browse/MESOS-2024 > Project: Mesos > Issue Type: Improvement > Components: build >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > The mesos .deb file: > root@myhost:~# dpkg-query -s mesos > Version: 0.20.1-1.0.ubuntu1404 > Depends: java7-runtime-headless | java6-runtime-headless, libcurl3 > Recommends: zookeeper, zookeeperd, zookeeper-bin > We run java8, but installing the mesos package always drags in java7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2024) mesos debian packaging should work on a java 8 install
Steven Schlansker created MESOS-2024: Summary: mesos debian packaging should work on a java 8 install Key: MESOS-2024 URL: https://issues.apache.org/jira/browse/MESOS-2024 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.20.1 Reporter: Steven Schlansker The mesos .deb file: root@myhost:~# dpkg-query -s mesos Version: 0.20.1-1.0.ubuntu1404 Depends: java7-runtime-headless | java6-runtime-headless, libcurl3 Recommends: zookeeper, zookeeperd, zookeeper-bin We run java8, but installing the mesos package always drags in java7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2023) mesos-execute should allow setting environment variables
Steven Schlansker created MESOS-2023: Summary: mesos-execute should allow setting environment variables Key: MESOS-2023 URL: https://issues.apache.org/jira/browse/MESOS-2023 Project: Mesos Issue Type: Improvement Affects Versions: 0.20.1 Reporter: Steven Schlansker mesos-execute does not allow setting various properties of the 'CommandInfo' protobuf. Most notably, being able to set environment variables and URIs would be very useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1976) Sandbox browse UI has path which is not selectable
Steven Schlansker created MESOS-1976: Summary: Sandbox browse UI has path which is not selectable Key: MESOS-1976 URL: https://issues.apache.org/jira/browse/MESOS-1976 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 0.20.1 Reporter: Steven Schlansker Priority: Minor The Sandbox UI displays the path being browsed as a series of links. It is not possible to copy the path from this, it ends up being formatted as e.g. {code} mnt mesos slaves 20141022-230146-2500085258-5050-1554-3 frameworks Singularity executors ci-discovery-singularity-bridge-steven.2014.10.21T21.00.04-1414092693380-2-10-us_west_2a runs 554eebb3-126d-42bd-95c2-aa8282b05522 {code} instead of the expected {code} /mnt/mesos/slaves/20141022-230146-2500085258-5050-1554-3/frameworks/Singularity/executors/ci-discovery-singularity-bridge-steven.2014.10.21T21.00.04-1414092693380-2-10-us_west_2a/runs/554eebb3-126d-42bd-95c2-aa8282b05522 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
[ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178899#comment-14178899 ] Steven Schlansker commented on MESOS-1949: -- I did, and it turns out at least in Singularity's case I'd just missed it (hidden on a detail view I don't click often) https://github.com/HubSpot/Singularity/issues/266 > All log messages from master, slave, executor, etc. should be collected on a > per-task basis > --- > > Key: MESOS-1949 > URL: https://issues.apache.org/jira/browse/MESOS-1949 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Currently through a task's lifecycle, various debugging information is > created at different layers of the Mesos ecosystem. The framework will log > task information, the master deals with resource allocation, the slave > actually allocates those resources, and the executor does the work of > launching the task. > If anything through that pipeline fails, the end user is left with little but > a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful > information (for example a "Docker pull failed because repository didn't > exist") is hidden in one of four or five different places, potentially spread > across as many different machines. This leads to unpleasant and repetitive > searching through logs looking for a clue to what went wrong. > Collating logs on a per-task basis would give the end user a much friendlier > way of figuring out exactly where in this process something went wrong, and > likely much faster resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
[ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178614#comment-14178614 ] Steven Schlansker commented on MESOS-1949: -- Interesting. It seems that neither Singularity nor Marathon expose the `message` field of TaskStatus anywhere useful? Would this be possible to expose via the Mesos web interface? Having the message would be great and solve 80% of the use cases here. I still think collating the logs is a useful feature, though, for when something more complex goes wrong. > All log messages from master, slave, executor, etc. should be collected on a > per-task basis > --- > > Key: MESOS-1949 > URL: https://issues.apache.org/jira/browse/MESOS-1949 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Currently through a task's lifecycle, various debugging information is > created at different layers of the Mesos ecosystem. The framework will log > task information, the master deals with resource allocation, the slave > actually allocates those resources, and the executor does the work of > launching the task. > If anything through that pipeline fails, the end user is left with little but > a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful > information (for example a "Docker pull failed because repository didn't > exist") is hidden in one of four or five different places, potentially spread > across as many different machines. This leads to unpleasant and repetitive > searching through logs looking for a clue to what went wrong. > Collating logs on a per-task basis would give the end user a much friendlier > way of figuring out exactly where in this process something went wrong, and > likely much faster resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis
Steven Schlansker created MESOS-1949: Summary: All log messages from master, slave, executor, etc. should be collected on a per-task basis Key: MESOS-1949 URL: https://issues.apache.org/jira/browse/MESOS-1949 Project: Mesos Issue Type: Improvement Components: master, slave Affects Versions: 0.20.1 Reporter: Steven Schlansker Currently through a task's lifecycle, various debugging information is created at different layers of the Mesos ecosystem. The framework will log task information, the master deals with resource allocation, the slave actually allocates those resources, and the executor does the work of launching the task. If anything through that pipeline fails, the end user is left with little but a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful information (for example a "Docker pull failed because repository didn't exist") is hidden in one of four or five different places, potentially spread across as many different machines. This leads to unpleasant and repetitive searching through logs looking for a clue to what went wrong. Collating logs on a per-task basis would give the end user a much friendlier way of figuring out exactly where in this process something went wrong, and likely much faster resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-1865: - Description: Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {code} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [ { "executor_id": "", "framework_id": "20140724-231003-419644938-5050-1707-", "id": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "name": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "resources": { "cpus": 0.25, "disk": 0, {code} This is very hard for end-users to work around. For example if I query "which master is leading" followed by "leader: which tasks are running" it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. was: Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {code} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [ { "executor_id": "", "framework_id": "20140724-231003-419644938-5050-1707-", "id": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "name": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "resources": { "cpus": 0.25, "disk": 0, {code} This is very hard for end-users to work around. For example if I query "which master is leading" followed by "master: which tasks are running" it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. > Mesos APIs for non-leading masters should return copies of the leader's state > or an error, not a success with incorrect information > --- > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Schlansker updated MESOS-1865: - Description: Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {code} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { "tasks": [ { "executor_id": "", "framework_id": "20140724-231003-419644938-5050-1707-", "id": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "name": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "resources": { "cpus": 0.25, "disk": 0, {code} This is very hard for end-users to work around. For example if I query "which master is leading" followed by "master: which tasks are running" it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. was: Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {quote} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10012 100120 0 21 0 --:--:-- --:--:-- --:--:--21 { "tasks": [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10012 100120 0105 0 --:--:-- --:--:-- --:--:-- 106 { "tasks": [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 43081 100 430810 0 196k 0 --:--:-- --:--:-- --:--:-- 196k { "tasks": [ { "executor_id": "", "framework_id": "20140724-231003-419644938-5050-1707-", "id": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "name": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "resources": { "cpus": 0.25, "disk": 0, {quote} This is very hard for end-users to work around. For example if I query "which master is leading" followed by "master: which tasks are running" it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. > Mesos APIs for non-leading masters should return copies of the leader's state > or an error, not a success with incorrect information > --- > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, >
[jira] [Created] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information
Steven Schlansker created MESOS-1865: Summary: Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information Key: MESOS-1865 URL: https://issues.apache.org/jira/browse/MESOS-1865 Project: Mesos Issue Type: Bug Components: json api Affects Versions: 0.20.1 Reporter: Steven Schlansker Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {quote} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10012 100120 0 21 0 --:--:-- --:--:-- --:--:--21 { "tasks": [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10012 100120 0105 0 --:--:-- --:--:-- --:--:-- 106 { "tasks": [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 43081 100 430810 0 196k 0 --:--:-- --:--:-- --:--:-- 196k { "tasks": [ { "executor_id": "", "framework_id": "20140724-231003-419644938-5050-1707-", "id": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "name": "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", "resources": { "cpus": 0.25, "disk": 0, {quote} This is very hard for end-users to work around. For example if I query "which master is leading" followed by "master: which tasks are running" it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1755) Add docker support to mesos-execute
[ https://issues.apache.org/jira/browse/MESOS-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120061#comment-14120061 ] Steven Schlansker commented on MESOS-1755: -- Hi Vinod and Timothy, sorry for the confusion. I looped back with Henning and Sean and this commit is indeed unrelated as you suspected. That said, it is a small commit, and I think it is worthy to be included in 0.20.1 even if Singularity does not need it -- it will make testing things somewhat easier and it seems low-risk. > Add docker support to mesos-execute > --- > > Key: MESOS-1755 > URL: https://issues.apache.org/jira/browse/MESOS-1755 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Timothy Chen > Fix For: 0.21.0 > > > The fix for this is already committed at https://reviews.apache.org/r/24808/. > I'm creating this ticket to track that this patch gets included in 0.20.1 > release, since apparently Singularity framework depends on this patch to work > with Docker !?!? > https://groups.google.com/forum/#!topic/singularity-users/GzzswbpI92E > [~tnachen]: Can you confirm if this has to be included in 0.20.1? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1554) Persistent resources support for storage-like services
[ https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120041#comment-14120041 ] Steven Schlansker commented on MESOS-1554: -- It would be nice to be able to manage e.g. Amazon EBS (or generic SAN) volumes in this way. That would be very powerful indeed. > Persistent resources support for storage-like services > -- > > Key: MESOS-1554 > URL: https://issues.apache.org/jira/browse/MESOS-1554 > Project: Mesos > Issue Type: Epic > Components: general, hadoop >Reporter: Nikita Vetoshkin >Priority: Minor > > This question came up in [dev mailing > list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E]. > It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use > Mesos to manage it's instances. But right now if we'd like to restart > instance (e.g. to spin up a new version) - all previous instance version > sandbox filesystem resources will be recycled by slave's garbage collector. > At the moment filesystem resources can be managed out of band - i.e. > instances can save their data in some database specific placed, that various > instances can share (e.g. {{/var/lib/cassandra}}). > [~benjaminhindman] suggested an idea in the mailing list (though it still > needs some fleshing out): > {quote} > The idea originally came about because, even today, if we allocate some > file system space to a task/executor, and then that task/executor > terminates, we haven't officially "freed" those file system resources until > after we garbage collect the task/executor sandbox! (We keep the sandbox > around so a user/operator can get the stdout/stderr or anything else left > around from their task/executor.) > To solve this problem we wanted to be able to let a task/executor terminate > but not *give up* all of it's resources, hence: persistent resources. > Pushing this concept even further you could imagine always reallocating > resources to a framework that had already been allocated those resources > for a previous task/executor. Looked at from another perspective, these are > "late-binding", or "lazy", resource reservations. > At one point in time we had considered just doing 'right-of-first-refusal' > for allocations after a task/executor terminate. But this is really > insufficient for supporting storage-like frameworks well (and likely even > harder to reliably implement then 'persistent resources' IMHO). > There are a ton of things that need to get worked out in this model, > including (but not limited to), how should a file system (or disk) be > exposed in order to be made persistent? How should persistent resources be > returned to a master? How many persistent resources can a framework get > allocated? > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.
[ https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097763#comment-14097763 ] Steven Schlansker commented on MESOS-780: - I've started a (very basic) Nagios check: https://github.com/opentable/nagios-mesos Hopefully someone finds it useful / contributes improvements :) > Adding support for 3rd party performance and health monitoring. > --- > > Key: MESOS-780 > URL: https://issues.apache.org/jira/browse/MESOS-780 > Project: Mesos > Issue Type: Improvement > Components: framework >Reporter: Bernardo Gomez Palacio > > User Story: > As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with > 3rd party tools such as: > * [Ganglia|http://ganglia.sourceforge.net/] > * [Graphite|http://graphite.wikidot.com/] > * [Nagios|http://www.nagios.org/] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071950#comment-14071950 ] Steven Schlansker commented on MESOS-1193: -- No, we aren't. Maybe we should be... > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f5209cd0bf2 process::ProcessManager::resume() > @ 0x7f5209cd0eec process::schedule() > @ 0x7f5208b48f6e start_thread > @ 0x7f52088739cd (unknown) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:45 PM: That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. was (Author: stevenschlansker): That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe (?) the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS
[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:40 PM: That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe (?) the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. was (Author: stevenschlansker): That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f5209cd0bf2 process::ProcessM
[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032 ] Steven Schlansker commented on MESOS-1193: -- That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f5209cd0bf2 process::ProcessManager::resume() > @ 0x7f5209cd0eec process::schedule() > @ 0x7f5208b48f6e start_thread > @ 0x7f52088739cd (unknown) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 6:08 PM: --- This same issue just took down our entire cluster this morning. Not cool! I wish I had more debugging information, but here's the last 10 log lines: {code} I0722 17:41:26.189750 1376 slave.cpp:2552] Cleaning up executor 'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 201403072353-2969781770-5050-852- I0722 17:41:26.189893 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f' for gc 6.9780272days in the future I0722 17:41:26.189980 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60' for gc 6.9780201482days in the future I0722 17:41:26.737553 1380 slave.cpp:933] Got assigned task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.737844 1380 slave.cpp:1043] Launching task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.739146 1375 mesos_containerizer.cpp:537] Starting container '17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework '201403072353-2969781770-5050-852-' I0722 17:41:26.739151 1380 slave.cpp:1153] Queuing task 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework '201403072353-2969781770-5050-852- I0722 17:41:26.748342 1375 launcher.cpp:117] Forked child with pid '12376' for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' I0722 17:41:26.752080 1380 mesos_containerizer.cpp:647] Fetching URIs for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command '/usr/local/libexec/mesos/mesos-fetcher' F0722 17:41:52.215634 1377 mesos_containerizer.cpp:862] Check failed: promises.contains(containerId) {code} {code} I0722 17:56:35.702491 13428 main.cpp:126] Build: 2014-06-09 21:08:25 by root I0722 17:56:35.702517 13428 main.cpp:128] Version: 0.19.0 I0722 17:56:35.702535 13428 main.cpp:131] Git tag: 0.19.0 I0722 17:56:35.702553 13428 main.cpp:135] Git SHA: 51e047524cf744ee257870eb479345646c0428ff I0722 17:56:35.702590 13428 mesos_containerizer.cpp:124] Using isolation: posix/cpu,posix/mem I0722 17:56:35.702942 13428 main.cpp:149] Starting Mesos slave I0722 17:56:35.703721 13428 slave.cpp:143] Slave started on 1)@10.70.6.32:5051 I0722 17:56:35.704082 13428 slave.cpp:255] Slave resources: cpus(*):8; mem(*):29077; disk(*):70336; ports(*):[31000-32000] I0722 17:56:35.705883 13428 slave.cpp:283] Slave hostname: 10.70.6.32 I0722 17:56:35.705915 13428 slave.cpp:284] Slave checkpoint: true {code} was (Author: stevenschlansker): This same issue just took down our entire cluster this morning. Not cool! I wish I had more debugging information, but here's the last 10 log lines: {code} I0722 17:41:26.189750 1376 slave.cpp:2552] Cleaning up executor 'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 201403072353-2969781770-5050-852- I0722 17:41:26.189893 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f' for gc 6.9780272days in the future I0722 17:41:26.189980 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60' for gc 6.9780201482days in the future I0722 17:41:26.737553 1380 slave.cpp:933] Got assigned task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.737844 1380 slave.cpp:1043] Launching task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.739146 1375 mesos_containerizer.cpp:537] Starting container '17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework '201403072353-2969781770-5050-852-' I0722 17:41:26.739151 1380 slave.cpp:1153] Queuing task 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework '201403072353-2969781770-5050-852- I0722 17:41:26.748342 1375 launcher.cpp:117] Forked child with pid '12376' for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' I0722 17:41:26.752080
[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592 ] Steven Schlansker commented on MESOS-1193: -- This same issue just took down our entire cluster this morning. Not cool! I wish I had more debugging information, but here's the last 10 log lines: {code} I0722 17:41:26.189750 1376 slave.cpp:2552] Cleaning up executor 'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 201403072353-2969781770-5050-852- I0722 17:41:26.189893 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f' for gc 6.9780272days in the future I0722 17:41:26.189980 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60' for gc 6.9780201482days in the future I0722 17:41:26.737553 1380 slave.cpp:933] Got assigned task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.737844 1380 slave.cpp:1043] Launching task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.739146 1375 mesos_containerizer.cpp:537] Starting container '17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework '201403072353-2969781770-5050-852-' I0722 17:41:26.739151 1380 slave.cpp:1153] Queuing task 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework '201403072353-2969781770-5050-852- I0722 17:41:26.748342 1375 launcher.cpp:117] Forked child with pid '12376' for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' I0722 17:41:26.752080 1380 mesos_containerizer.cpp:647] Fetching URIs for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command '/usr/local/libexec/mesos/mesos-fetcher' F0722 17:41:52.215634 1377 mesos_containerizer.cpp:862] Check failed: promises.contains(containerId) {code} > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.