[jira] [Commented] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug

2017-02-08 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858431#comment-15858431
 ] 

Steven Schlansker commented on MESOS-7085:
--

More evidence of confusion over this in the ecosystem: 
https://github.com/mesosphere/marathon/issues/1917

> Consider reducing processing of DECLINE calls log from info to debug
> 
>
> Key: MESOS-7085
> URL: https://issues.apache.org/jira/browse/MESOS-7085
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.1
>Reporter: Steven Schlansker
>
> The Mesos master gets resource decline messages as a normal matter of course.
> It repeatedly logs the offers declined from schedulers.  This is critical 
> diagnostics information, but unless your scheduler is broken or buggy, 
> usually uninteresting.
> In our production environment this ended up being a significant fraction of 
> all logging.  One of our operators got paged:
> > Checking to see what I can delete.
> > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> > outputting this to syslog ) :
> > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> > Singularity (Singularity) at 
> > scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> ➢  wc -l 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 6796024 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> ➢ grep -c DECLINE 
> mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 5846770
> It seems that this line looks scary ("DECLINE" is a scary word to an 
> operator), is a huge percentage of log output, and is part of normal 
> operation.
> Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
> manner?  ("654 offers declined in last 1 minute")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug

2017-02-08 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-7085:
-
Description: 
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE" is a scary word to an operator), 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")

  was:
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE") is a scary word to an operator, 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")


> Consider reducing processing of DECLINE calls log from info to debug
> 
>
> Key: MESOS-7085
> URL: https://issues.apache.org/jira/browse/MESOS-7085
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.1
>Reporter: Steven Schlansker
>
> The Mesos master gets resource decline messages as a normal matter of course.
> It repeatedly logs the offers declined from schedulers.  This is critical 
> diagnostics information, but unless your scheduler is broken or buggy, 
> usually uninteresting.
> In our production environment this ended up being a significant fraction of 
> all logging.  One of our operators got paged:
> > Checking to see what I can delete.
> > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> > outputting this to syslog ) :
> > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> > offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> > Singularity (Singularity) at

[jira] [Created] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug

2017-02-08 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-7085:


 Summary: Consider reducing processing of DECLINE calls log from 
info to debug
 Key: MESOS-7085
 URL: https://issues.apache.org/jira/browse/MESOS-7085
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 1.0.1
Reporter: Steven Schlansker


The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical 
diagnostics information, but unless your scheduler is broken or buggy, usually 
uninteresting.

In our production environment this ended up being a significant fraction of all 
logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also 
> outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for 
> offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework 
> Singularity (Singularity) at 
> scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE 
mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE") is a scary word to an operator, 
is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based 
manner?  ("654 offers declined in last 1 minute")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-6066) Operator SUBSCRIBE api should include timestamps

2016-08-22 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-6066:


 Summary: Operator SUBSCRIBE api should include timestamps
 Key: MESOS-6066
 URL: https://issues.apache.org/jira/browse/MESOS-6066
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API, json api
Affects Versions: 1.0.0
Reporter: Steven Schlansker


Events coming from the Mesos master are delivered asynchronously.  While 
usually they are processed in a timely fashion, it really scares me that 
updates do not have a timestamp:

{code}
301
{
  "task_updated": {
"agent_id": {
  "value": "fdbb3ff5-47c2-4b49-a521-b52b9acf74dd-S14"
},
"framework_id": {
  "value": "Singularity"
},
"state": "TASK_KILLED",
"task_id": {
  "value": 
"pp-demoservice-steven.2016.07.05T17.00.06-1471901722511-1-mesos_slave17_qa_uswest2.qasql.opentable.com-us_west_2b"
}
  },
  "type": "TASK_UPDATED"
}
{code}

Events should have a timestamp that indicates the time that they happened at, 
otherwise your timestamps include delivery and processing delays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4862) Setting failover_timeout in FrameworkInfo to Double.MAX_VALUE causes it to be set to zero

2016-08-01 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403028#comment-15403028
 ] 

Steven Schlansker commented on MESOS-4862:
--

Is this a duplicate of https://issues.apache.org/jira/browse/MESOS-1575 ?

> Setting failover_timeout in FrameworkInfo to Double.MAX_VALUE causes it to be 
> set to zero
> -
>
> Key: MESOS-4862
> URL: https://issues.apache.org/jira/browse/MESOS-4862
> Project: Mesos
>  Issue Type: Bug
>  Components: master, stout
>Reporter: Timothy Chen
>
> Currently we expose framework failover_timeout as a double in Proto, and if 
> users set the failover_timeout to Double.MAX_VALUE, the Master will actually 
> set it to zero which is the complete opposite of the original intent.
> The problem is that in stout/duration.hpp we only store down to the 
> nanoseconds with int64_t, and it gives an error when we pass double.max as it 
> goes out of the int64_t bounds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5936) Operator SUBSCRIBE api should provdide more task metadata than just state changes

2016-07-29 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-5936:


 Summary: Operator SUBSCRIBE api should provdide more task metadata 
than just state changes
 Key: MESOS-5936
 URL: https://issues.apache.org/jira/browse/MESOS-5936
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, json api
Affects Versions: 1.0.0
Reporter: Steven Schlansker


I am starting to use the new Operator event subscription API to get notified of 
task changes.  The initial TASK_STAGING event has a good amount of information, 
but unfortunately future events like TASK_RUNNING include almost no metadata.  
This means that any task information that cannot be determined until the task 
launches (in my particular case, the IP address assigned by the Docker 
containerizer) is not available through the event stream.

Here is a gist of a single task that was launched, comparing the information in 
'state.json' vs the subscribed events:

https://gist.github.com/stevenschlansker/c1d32aa9ce37a73f9c4d64347397d3b8

Note particularly how the IP address never shows in the event stream.

Task updates should expose the task information that changed.  If this is too 
onerous, maybe the subscription call can take optional sets of data to include, 
with the first one being additional task metadata.

A possible workaround is to use the task events to fetch 'state.json' 
separately, but this is inherently racy and totally undermines the utility of 
the event stream api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3866) The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors

2016-07-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398420#comment-15398420
 ] 

Steven Schlansker commented on MESOS-3866:
--

We just independently rediscovered this bug -- since the containerizer sets the 
version agent side, if you are in a "live upgrade" situation where you upgrade 
the agent but not the library baked into your containers yet, an otherwise safe 
upgrade turns into a breaking change!

> The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors
> ---
>
> Key: MESOS-3866
> URL: https://issues.apache.org/jira/browse/MESOS-3866
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.25.0
>Reporter: Michael Gummelt
>
> It's set here: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281
> And passed to the docker executor here: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844
> This leaks the host path of the library into the docker image, which of 
> course can't see it. This is breaking DCOS Spark, which runs in a docker 
> image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime

2016-07-26 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394584#comment-15394584
 ] 

Steven Schlansker commented on MESOS-5910:
--

It seems that it actually gives you a current snapshot when you initially 
subscribe, so perhaps this really is only an issue during master failovers.  So 
this is probably of somewhat lower importance than I thought, although 
correctly handling master failover without losing events is still desirable.

> Operator SUBSCRIBE api should provide a method to get all events without 
> requiring 100% uptime
> --
>
> Key: MESOS-5910
> URL: https://issues.apache.org/jira/browse/MESOS-5910
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, json api
>Affects Versions: 1.0.0
>Reporter: Steven Schlansker
>
> The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
> events as they occur.  This is going to be extremely useful for monitoring 
> and management jobs, as they can now have timely information about Mesos's 
> operation without requiring repeated polling or other ugly solutions.
> Unfortunately, the SUBSCRIBE call always returns from the time the call is 
> made.  This means that any consumer cannot reliably subscribe to "all 
> events"; if the application goes offline (network blip, code upgrade, etc) 
> all events during that downtime are lost.
> You could instead have a cluster of applications receiving the events and 
> coordinating to deduplicate them to increase reliability, but this pushes a 
> lot of complexity into clients, and I suspect most users would not do this 
> correctly and would potentially lose events.
> It would be extremely useful for a single client to be able to get a reliable 
> event stream without requiring a single HTTP connection to be 100% available.
> One possible solution is to assign every event an ID.  Then, extend the API 
> to take a "start position" in the log.  The API immediately streams out all 
> events from the start event up until the tail of the log, and then continues 
> emitting new events are they occur.  This provides a reliable way for a 
> consumer to get "at least once" semantics on events.  The caveat is that the 
> consumer may only be down for as long as the master retains event history, 
> but this is a much easier pill to swallow.  This is similar to etcd's "watch" 
> api, if you are looking for an actual implementation to reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime

2016-07-26 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-5910:
-
Description: 
The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
events as they occur.  This is going to be extremely useful for monitoring and 
management jobs, as they can now have timely information about Mesos's 
operation without requiring repeated polling or other ugly solutions.

Unfortunately, the SUBSCRIBE call always returns from the time the call is 
made.  This means that any consumer cannot reliably subscribe to "all events"; 
if the application goes offline (network blip, code upgrade, etc) all events 
during that downtime are lost.

You could instead have a cluster of applications receiving the events and 
coordinating to deduplicate them to increase reliability, but this pushes a lot 
of complexity into clients, and I suspect most users would not do this 
correctly and would potentially lose events.

It would be extremely useful for a single client to be able to get a reliable 
event stream without requiring a single HTTP connection to be 100% available.

One possible solution is to assign every event an ID.  Then, extend the API to 
take a "start position" in the log.  The API immediately streams out all events 
from the start event up until the tail of the log, and then continues emitting 
new events are they occur.  This provides a reliable way for a consumer to get 
"at least once" semantics on events.  The caveat is that the consumer may only 
be down for as long as the master retains event history, but this is a much 
easier pill to swallow.  This is similar to etcd's "watch" api, if you are 
looking for an actual implementation to reference.

  was:
The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
events as they occur.  This is going to be extremely useful for monitoring and 
management jobs, as they can now have timely information about Mesos's 
operation without requiring repeated polling or other ugly solutions.

Unfortunately, the SUBSCRIBE call always returns from the time the call is 
made.  This means that any consumer cannot reliably subscribe to "all events"; 
if the application goes offline (network blip, code upgrade, etc) all events 
during that downtime are lost.

You could instead have a cluster of applications receiving the events and 
coordinating to deduplicate them to increase reliability, but this pushes a lot 
of complexity into clients, and I suspect most users would not do this 
correctly and would potentially lose events.

It would be extremely useful for a single client to be able to get a reliable 
event stream without requiring a single HTTP connection to be 100% available.

One possible solution is to assign every event an ID.  Then, extend the API to 
take a "start position" in the log.  The API immediately streams out all events 
from the start event up until the tail of the log, and then continues emitting 
new events are they occur.  This provides a reliable way for a consumer to get 
"at least once" semantics on events.  The caveat is that the consumer may only 
be down for as long as the master retains event history, but this is a much 
easier pill to swallow.


> Operator SUBSCRIBE api should provide a method to get all events without 
> requiring 100% uptime
> --
>
> Key: MESOS-5910
> URL: https://issues.apache.org/jira/browse/MESOS-5910
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, json api
>Affects Versions: 1.0.0
>Reporter: Steven Schlansker
>
> The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
> events as they occur.  This is going to be extremely useful for monitoring 
> and management jobs, as they can now have timely information about Mesos's 
> operation without requiring repeated polling or other ugly solutions.
> Unfortunately, the SUBSCRIBE call always returns from the time the call is 
> made.  This means that any consumer cannot reliably subscribe to "all 
> events"; if the application goes offline (network blip, code upgrade, etc) 
> all events during that downtime are lost.
> You could instead have a cluster of applications receiving the events and 
> coordinating to deduplicate them to increase reliability, but this pushes a 
> lot of complexity into clients, and I suspect most users would not do this 
> correctly and would potentially lose events.
> It would be extremely useful for a single client to be able to get a reliable 
> event stream without requiring a single HTTP connection to be 100% available.
> One possible solution is to assign every event an ID.  Then, extend the API 
> to take a "start position" in the log.  The API immediately streams out all 
> ev

[jira] [Created] (MESOS-5910) Operator SUBSCRIBE api should provide a method to get all events without requiring 100% uptime

2016-07-26 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-5910:


 Summary: Operator SUBSCRIBE api should provide a method to get all 
events without requiring 100% uptime
 Key: MESOS-5910
 URL: https://issues.apache.org/jira/browse/MESOS-5910
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, json api
Affects Versions: 1.0.0
Reporter: Steven Schlansker


The v1.0 Operator API adds a new SUBSCRIBE call, which returns a stream of 
events as they occur.  This is going to be extremely useful for monitoring and 
management jobs, as they can now have timely information about Mesos's 
operation without requiring repeated polling or other ugly solutions.

Unfortunately, the SUBSCRIBE call always returns from the time the call is 
made.  This means that any consumer cannot reliably subscribe to "all events"; 
if the application goes offline (network blip, code upgrade, etc) all events 
during that downtime are lost.

You could instead have a cluster of applications receiving the events and 
coordinating to deduplicate them to increase reliability, but this pushes a lot 
of complexity into clients, and I suspect most users would not do this 
correctly and would potentially lose events.

It would be extremely useful for a single client to be able to get a reliable 
event stream without requiring a single HTTP connection to be 100% available.

One possible solution is to assign every event an ID.  Then, extend the API to 
take a "start position" in the log.  The API immediately streams out all events 
from the start event up until the tail of the log, and then continues emitting 
new events are they occur.  This provides a reliable way for a consumer to get 
"at least once" semantics on events.  The caveat is that the consumer may only 
be down for as long as the master retains event history, but this is a much 
easier pill to swallow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-06-07 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319175#comment-15319175
 ] 

Steven Schlansker commented on MESOS-4642:
--

I really think a documentation "fix" is a bad solution for this issue.  This is 
an API that can be broken solely based on user controlled data.  Allowing a 
(potentially malicious) "isolated" process to cause Mesos APIs to produce 
semantically invalid responses is a bad behavior that is worthy of a breaking 
change IMO.

Even if the cluster admin understands the gotcha, end users can break it 
unwittingly.  We never intended to make non-UTF8 data, that was an accident, so 
any existing documentation would have helped us understand and recover but 
could not have prevented this outage.


> Mesos Agent Json API can dump binary data from log files out as invalid JSON
> 
>
> Key: MESOS-4642
> URL: https://issues.apache.org/jira/browse/MESOS-4642
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.27.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> One of our tasks accidentally started logging binary data to stderr.  This 
> was not intentional and generally should not happen -- however, it causes 
> severe problems with the Mesos Agent "files/read.json" API, since it gladly 
> dumps this binary data out as invalid JSON.
> {code}
> # hexdump -C /path/to/task/stderr | tail
> 0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
> 0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
> 0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
> 0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
> 0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
> 0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
> {code}
> {code}
> # curl 
> 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
>  | hexdump -C
> 7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
> 7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
> 7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
> 79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
> 79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
> 79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
> 79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
> 79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
> 79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
> 7a00  7d|}|
> {code}
> This causes downstream sadness:
> {code}
> ERROR [2016-02-10 18:55:12,303] 
> io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
> 0ee749630f8b26f1
> ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
> !  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
> 1, column: 31181]
> ! at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
> ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
>  ~[singularity-0.4

[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader

2016-04-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241456#comment-15241456
 ] 

Steven Schlansker commented on MESOS-1865:
--

301 is supposed to be "permanent" -- whereas the leader will continue to move 
over time.  Would 307 (Temporary Redirect) be more appropriate?

> Redirect to the leader master when current master is not a leader
> -
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>Assignee: haosdent
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader

2016-04-13 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239607#comment-15239607
 ] 

Steven Schlansker commented on MESOS-1865:
--

I also agree that fixing the "common" case by redirecting is okay -- "advanced" 
users that wish to query non-leading masters can easily ignore the redirect, 
and having 'curl' work reliably is of huge benefit.

> Redirect to the leader master when current master is not a leader
> -
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>Assignee: haosdent
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader

2016-04-13 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239594#comment-15239594
 ] 

Steven Schlansker commented on MESOS-1865:
--

This issue came up on the mailing list again -- Guillermo wrote,
{noformat}
I have an URL mesos-master.mydomain.com pointing to the leader and that works 
fine because it returns the slave list which I need for my autoscaler. But I'm 
afraid if the master fails the URL will no longer be valid. So I added the 
three IPs to the router (AWS Route53)  so it would round robin, but of course 
this will return an empty list sometimes because it hits a follower which 
returns empty.
{noformat}


> Redirect to the leader master when current master is not a leader
> -
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>Assignee: haosdent
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1865) Redirect to the leader master when current master is not a leader

2016-04-13 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239594#comment-15239594
 ] 

Steven Schlansker edited comment on MESOS-1865 at 4/13/16 4:57 PM:
---

This issue came up on the mailing list again -- Guillermo wrote,
{quote}
I have an URL mesos-master.mydomain.com pointing to the leader and that works 
fine because it returns the slave list which I need for my autoscaler. But I'm 
afraid if the master fails the URL will no longer be valid. So I added the 
three IPs to the router (AWS Route53)  so it would round robin, but of course 
this will return an empty list sometimes because it hits a follower which 
returns empty.
{quote}



was (Author: stevenschlansker):
This issue came up on the mailing list again -- Guillermo wrote,
{noformat}
I have an URL mesos-master.mydomain.com pointing to the leader and that works 
fine because it returns the slave list which I need for my autoscaler. But I'm 
afraid if the master fails the URL will no longer be valid. So I added the 
three IPs to the router (AWS Route53)  so it would round robin, but of course 
this will return an empty list sometimes because it hits a follower which 
returns empty.
{noformat}


> Redirect to the leader master when current master is not a leader
> -
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>Assignee: haosdent
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5195) Docker executor: task logs lost on shutdown

2016-04-12 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-5195:


 Summary: Docker executor: task logs lost on shutdown
 Key: MESOS-5195
 URL: https://issues.apache.org/jira/browse/MESOS-5195
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.27.2
 Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
Reporter: Steven Schlansker


When you try to kill a task running in the Docker executor (in our case via 
Singularity), the task shuts down cleanly but the last logs to standard out / 
standard error are lost in teardown.

For example, we run dumb-init.  With debugging on, you can see it should write:
{noformat}
DEBUG("Forwarded signal %d to children.\n", signum);
{noformat}

If you attach strace to the process, you can see it clearly writes the text to 
stderr.  But that message is lost and never is written to the sandbox 'stderr' 
file.

We believe the issue starts here, in Docker executor.cpp:

{code}
  void shutdown(ExecutorDriver* driver)
  {
cout << "Shutting down" << endl;

if (run.isSome() && !killed) {
  // The docker daemon might still be in progress starting the
  // container, therefore we kill both the docker run process
  // and also ask the daemon to stop the container.

  // Making a mutable copy of the future so we can call discard.
  Future(run.get()).discard();
  stop = docker->stop(containerName, stopTimeout);
  killed = true;
}
  }
{code}

Notice how the "run" future is discarded *before* the Docker daemon is told to 
stop -- now what will discarding it do?

{code}
void commandDiscarded(const Subprocess& s, const string& cmd)
{
  VLOG(1) << "'" << cmd << "' is being discarded";
  os::killtree(s.pid(), SIGKILL);
}
{code}

Oops, just sent SIGKILL to the entire process tree...

You can see another (harmless?) side effect in the Docker daemon logs, it never 
gets a chance to kill the task:

{noformat}
ERROR Handler for DELETE 
/v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
 returned error: No such container: 
mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
{noformat}

I suspect that the fix is wait for 'docker->stop()' to complete before 
discarding the 'run' future.

Happy to provide more information if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2840) MesosContainerizer support multiple image provisioners

2016-02-23 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159389#comment-15159389
 ] 

Steven Schlansker commented on MESOS-2840:
--

Hi, I'm sorry I can't comment too much on the Mesos internals, but I just 
wanted to throw in my two cents as an end user.  The Docker daemon is 
continually a thorn in our side, their "daemon forks the container process" 
model introduces an unnecessary single point of failure and just generally is 
not stable enough to be in that position.  We are extremely excited to try out 
this work and look forward to being early adopters and finding all the bugs ;)

The design document looks well thought out and seems like an excellent approach.


> MesosContainerizer support multiple image provisioners
> --
>
> Key: MESOS-2840
> URL: https://issues.apache.org/jira/browse/MESOS-2840
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, docker
>Affects Versions: 0.23.0
>Reporter: Marco Massenzio
>Assignee: Timothy Chen
>  Labels: mesosphere, twitter
>
> We want to utilize the Appc integration interfaces to further make 
> MesosContainerizers to support multiple image formats.
> This allows our future work on isolators to support any container image 
> format.
> Design
> [open to public comments]
> https://docs.google.com/document/d/1oUpJNjJ0l51fxaYut21mKPwJUiAcAdgbdF7SAdAW2PA/edit?usp=sharing
> [original document, requires permission]
> https://docs.google.com/a/twitter.com/document/d/1Fx5TS0LytV7u5MZExQS0-g-gScX2yKCKQg9UPFzhp6U/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-02-17 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151273#comment-15151273
 ] 

Steven Schlansker commented on MESOS-4642:
--

It seems that the first option, doing proper JSON encoding, is the correct fix 
to me

> Mesos Agent Json API can dump binary data from log files out as invalid JSON
> 
>
> Key: MESOS-4642
> URL: https://issues.apache.org/jira/browse/MESOS-4642
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.27.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> One of our tasks accidentally started logging binary data to stderr.  This 
> was not intentional and generally should not happen -- however, it causes 
> severe problems with the Mesos Agent "files/read.json" API, since it gladly 
> dumps this binary data out as invalid JSON.
> {code}
> # hexdump -C /path/to/task/stderr | tail
> 0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
> 0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
> 0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
> 0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
> 0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
> 0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
> {code}
> {code}
> # curl 
> 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
>  | hexdump -C
> 7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
> 7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
> 7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
> 79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
> 79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
> 79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
> 79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
> 79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
> 79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
> 7a00  7d|}|
> {code}
> This causes downstream sadness:
> {code}
> ERROR [2016-02-10 18:55:12,303] 
> io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
> 0ee749630f8b26f1
> ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
> !  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
> 1, column: 31181]
> ! at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
> ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:142)
>  ~[singularity-0.4.9.jar:0.4.9]
> ! at 
> 

[jira] [Created] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON

2016-02-10 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-4642:


 Summary: Mesos Agent Json API can dump binary data from log files 
out as invalid JSON
 Key: MESOS-4642
 URL: https://issues.apache.org/jira/browse/MESOS-4642
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Affects Versions: 0.27.0
Reporter: Steven Schlansker
Priority: Critical


One of our tasks accidentally started logging binary data to stderr.  This was 
not intentional and generally should not happen -- however, it causes severe 
problems with the Mesos Agent "files/read.json" API, since it gladly dumps this 
binary data out as invalid JSON.

{code}
# hexdump -C /path/to/task/stderr | tail
0003d1f0  6f 6e 6e 65 63 74 69 6f  6e 0a 4e 45 54 3a 20 31  |onnection.NET: 1|
0003d200  20 6f 6e 72 65 61 64 20  45 4e 4f 45 4e 54 20 32  | onread ENOENT 2|
0003d210  39 35 34 35 36 20 32 35  31 20 32 39 35 37 30 37  |95456 251 295707|
0003d220  0a 01 00 00 00 00 00 00  ac 57 65 64 2c 20 31 30  |.Wed, 10|
0003d230  20 55 6e 72 65 63 6f 67  6e 69 7a 65 64 20 69 6e  | Unrecognized in|
0003d240  70 75 74 20 68 65 61 64  65 72 0a |put header.|
{code}

{code}
# curl 
'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep='
 | hexdump -C
7970  6e 65 63 74 69 6f 6e 5c  6e 4e 45 54 3a 20 31 20  |nection\nNET: 1 |
7980  6f 6e 72 65 61 64 20 45  4e 4f 45 4e 54 20 32 39  |onread ENOENT 29|
7990  35 34 35 36 20 32 35 31  20 32 39 35 37 30 37 5c  |5456 251 295707\|
79a0  6e 5c 75 30 30 30 31 5c  75 30 30 30 30 5c 75 30  |n\u0001\u\u0|
79b0  30 30 30 5c 75 30 30 30  30 5c 75 30 30 30 30 5c  |000\u\u\|
79c0  75 30 30 30 30 5c 75 30  30 30 30 ac 57 65 64 2c  |u\u.Wed,|
79d0  20 31 30 20 55 6e 72 65  63 6f 67 6e 69 7a 65 64  | 10 Unrecognized|
79e0  20 69 6e 70 75 74 20 68  65 61 64 65 72 5c 6e 22  | input header\n"|
79f0  2c 22 6f 66 66 73 65 74  22 3a 32 32 30 34 34 33  |,"offset":220443|
7a00  7d|}|
{code}

This causes downstream sadness:
{code}
ERROR [2016-02-10 18:55:12,303] 
io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 
0ee749630f8b26f1
! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac
!  at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: 
1, column: 31181]
! at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) 
~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:29)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:12)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:523)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:381)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1073)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserializeFromObject(SuperSonicBeanDeserializer.java:196)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:142)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.module.afterburner.deser.SuperSonicBeanDeserializer.deserialize(SuperSonicBeanDeserializer.java:117)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3562)
 ~[singularity-0.4.9.jar:0.4.9]
! at 
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2648) 
~[singularity-0.4.9.jar:0.4.9]
! at com.hubspot.singularity.data.SandboxManager.read(SandboxManager.java:97) 
~[singularity-0.4.9.j

[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970064#comment-14970064
 ] 

Steven Schlansker commented on MESOS-3771:
--

Sounds good to me.  I think removing that field from JSON is fine for us.

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968622#comment-14968622
 ] 

Steven Schlansker commented on MESOS-2186:
--

That's a bummer.  Thank you everyone for looking and your time.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slave

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968143#comment-14968143
 ] 

Steven Schlansker commented on MESOS-2186:
--

Maybe this will end up being too hard to fix, since it seems to be a limitation 
of the ZK C API.  It's just surprising from an end user perspective that a 
single name failing to resolve (even when two are still happy) causes such a 
disruptive failure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] M

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968128#comment-14968128
 ] 

Steven Schlansker commented on MESOS-2186:
--

This is true in the case the DNS resolution failure is temporary.  If it is not 
temporary, you are still SOL.  Imagine $JUNIOR_ADMIN removes one of the 
ZooKeeper nodes from DNS.  You may then have an inoperable Mesos cluster for a 
long time if you have aggressive DNS caching, even though a ZK quorum is still 
up and alive.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaste

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968073#comment-14968073
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968072#comment-14968072
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master 

[jira] [Issue Comment Deleted] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2186:
-
Comment: was deleted

(was: If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 
is unrelated, yeah?)

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968056#comment-14968056
 ] 

Steven Schlansker commented on MESOS-2186:
--

Well, rgs above called into question whether that is truly the case.  
Additionally at least as of now the "check failure stack trace" is entirely in 
C++ code, seemingly not in the Zookeeper library (pure C).

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to 

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:51 PM:


I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_00' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  I fully admit the linked ZK bug may not be the root cause.  But 
Mesos is still trivial to crash if one of the ZK members are not valid (even if 
a quorum are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/w

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968027#comment-14968027
 ] 

Steven Schlansker commented on MESOS-2186:
--

I reopened the ticket since it is still a crasher in master.  I hope that is 
appropriate, I apologize in advance if not.  Not trying to be a stick in the 
mud but this compromises the "high availability" of Mesos which is a critical 
piece of infrastructure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:3

[jira] [Updated] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2186:
-
Affects Version/s: 0.26.0

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM:


I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_00' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  But Mesos is still trivial to crash if one of the ZK members 
are not valid (even if two are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
thou

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker commented on MESOS-2186:
--

I am still able to easily reproduce this, even with master built from today:

{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat
I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec6044c2  google::LogMessage::Fail()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec60440e  google::LogMessage::SendToLog()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603e10  google::LogMessage::Flush()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25  google::LogMessage::~LogMessage()
@ 0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec00b825  ZooKeeperProcess::initialize()
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d  process::ProcessManager::resume()
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40  (unknown)
@ 0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182  start_thread
@ 0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  But Mesos is still trivial to crash if one of the ZK members 
are not valid (even if two are).


> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from

[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967962#comment-14967962
 ] 

Steven Schlansker commented on MESOS-3771:
--

Okay, I have distilled down the reproduction case.
Using the Python test-framework with the following diff applied:

{code}
diff --git a/src/examples/python/test_framework.py 
b/src/examples/python/test_framework.py
index 6af6d22..95abb97 100755
--- a/src/examples/python/test_framework.py
+++ b/src/examples/python/test_framework.py
@@ -150,6 +150,7 @@ class TestScheduler(mesos.interface.Scheduler):
 print "but received", self.messagesReceived
 sys.exit(1)
 print "All tasks done, and all messages received, exiting"
+time.sleep(30)
 driver.stop()
 
 if __name__ == "__main__":
@@ -158,6 +159,7 @@ if __name__ == "__main__":
 sys.exit(1)
 
 executor = mesos_pb2.ExecutorInfo()
+executor.data = b'\xAC\xED'
 executor.executor_id.value = "default"
 executor.command.value = os.path.abspath("./test-executor")
 executor.name = "Test Executor (Python)"
{code}

if you run the test framework, and during the 30 second wait after it finishes, 
try to grab the {{/master/state.json}} endpoint, you will get a response that 
has invalid UTF8 in it:

{code}
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start 
byte 0xac
 at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@54c8158d; line: 1, 
column: 6432]
{code}

I tested against both 0.24.1 and current master, both exhibit the bad behavior.

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-21 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-3771:
-
Affects Version/s: 0.26.0

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1, 0.26.0
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-20 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965945#comment-14965945
 ] 

Steven Schlansker commented on MESOS-3771:
--

Yeah, we could try to patch Spark.  However I'm sure someone else will make 
exactly the same mistake down the road -- it seems to work as long as you use 
the protobuf api only.  It really seems wrong to just assume that arbitrary 
bytes are valid UTF-8.  Note that ASCII is a real misnomer here, the only 
things that matter are "arbitrary binary data" (the type of 'data') and "UTF8" 
(the format that the rendered JSON *must* be in).  I don't see anywhere here 
that ASCII is relevant.

Maybe it's possible to escape the 0xACED sequence we see with \uXXX escapes, 
but I'm not sure that's possible, as those escapes produce UTF-16 codepoints 
not binary data...

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-20 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965762#comment-14965762
 ] 

Steven Schlansker commented on MESOS-3771:
--

Similar, but potentially unrelated, issue: 
https://issues.apache.org/jira/browse/MESOS-3284

> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
> 0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
> 0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
> 0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
> 0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
> 0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
> {code}
> I suspect this is because the HTTP api emits the executorInfo.data directly:
> {code}
> JSON::Object model(const ExecutorInfo& executorInfo)
> {
>   JSON::Object object;
>   object.values["executor_id"] = executorInfo.executor_id().value();
>   object.values["name"] = executorInfo.name();
>   object.values["data"] = executorInfo.data();
>   object.values["framework_id"] = executorInfo.framework_id().value();
>   object.values["command"] = model(executorInfo.command());
>   object.values["resources"] = model(executorInfo.resources());
>   return object;
> }
> {code}
> I think this may be because the custom JSON processing library in stout seems 
> to not have any idea of what a byte array is.  I'm guessing that some 
> implicit conversion makes it get written as a String instead, but:
> {code}
> inline std::ostream& operator<<(std::ostream& out, const String& string)
> {
>   // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
>   // See RFC4627 for the JSON string specificiation.
>   return out << picojson::value(string.value).serialize();
> }
> {code}
> Thank you for any assistance here.  Our cluster is currently entirely down -- 
> the frameworks cannot handle parsing the invalid JSON produced (it is not 
> even valid utf-8)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-20 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-3771:


 Summary: Mesos JSON API creates invalid JSON due to lack of binary 
data / non-ASCII handling
 Key: MESOS-3771
 URL: https://issues.apache.org/jira/browse/MESOS-3771
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Affects Versions: 0.24.1
Reporter: Steven Schlansker
Priority: Critical


Spark encodes some binary data into the ExecutorInfo.data field.  This field is 
sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.

If you have such a field, it seems that it is splatted out into JSON without 
any regards to proper character encoding:

{quote}
0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
{quote}

I suspect this is because the HTTP api emits the executorInfo.data directly:

{code}
JSON::Object model(const ExecutorInfo& executorInfo)
{
  JSON::Object object;
  object.values["executor_id"] = executorInfo.executor_id().value();
  object.values["name"] = executorInfo.name();
  object.values["data"] = executorInfo.data();
  object.values["framework_id"] = executorInfo.framework_id().value();
  object.values["command"] = model(executorInfo.command());
  object.values["resources"] = model(executorInfo.resources());
  return object;
}
{code}

I think this may be because the custom JSON processing library in stout seems 
to not have any idea of what a byte array is.  I'm guessing that some implicit 
conversion makes it get written as a String instead, but:

{code}
inline std::ostream& operator<<(std::ostream& out, const String& string)
{
  // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
  // See RFC4627 for the JSON string specificiation.
  return out << picojson::value(string.value).serialize();
}
{code}

Thank you for any assistance here.  Our cluster is currently entirely down -- 
the frameworks cannot handle parsing the invalid JSON produced (it is not even 
valid utf-8)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3771) Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII handling

2015-10-20 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-3771:
-
Description: 
Spark encodes some binary data into the ExecutorInfo.data field.  This field is 
sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.

If you have such a field, it seems that it is splatted out into JSON without 
any regards to proper character encoding:

{code}
0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
{code}

I suspect this is because the HTTP api emits the executorInfo.data directly:

{code}
JSON::Object model(const ExecutorInfo& executorInfo)
{
  JSON::Object object;
  object.values["executor_id"] = executorInfo.executor_id().value();
  object.values["name"] = executorInfo.name();
  object.values["data"] = executorInfo.data();
  object.values["framework_id"] = executorInfo.framework_id().value();
  object.values["command"] = model(executorInfo.command());
  object.values["resources"] = model(executorInfo.resources());
  return object;
}
{code}

I think this may be because the custom JSON processing library in stout seems 
to not have any idea of what a byte array is.  I'm guessing that some implicit 
conversion makes it get written as a String instead, but:

{code}
inline std::ostream& operator<<(std::ostream& out, const String& string)
{
  // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
  // See RFC4627 for the JSON string specificiation.
  return out << picojson::value(string.value).serialize();
}
{code}

Thank you for any assistance here.  Our cluster is currently entirely down -- 
the frameworks cannot handle parsing the invalid JSON produced (it is not even 
valid utf-8)


  was:
Spark encodes some binary data into the ExecutorInfo.data field.  This field is 
sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.

If you have such a field, it seems that it is splatted out into JSON without 
any regards to proper character encoding:

{quote}
0006b0b0  2e 73 70 61 72 6b 2e 65  78 65 63 75 74 6f 72 2e  |.spark.executor.|
0006b0c0  4d 65 73 6f 73 45 78 65  63 75 74 6f 72 42 61 63  |MesosExecutorBac|
0006b0d0  6b 65 6e 64 22 7d 2c 22  64 61 74 61 22 3a 22 ac  |kend"},"data":".|
0006b0e0  ed 5c 75 30 30 30 30 5c  75 30 30 30 35 75 72 5c  |.\u\u0005ur\|
0006b0f0  75 30 30 30 30 5c 75 30  30 30 66 5b 4c 73 63 61  |u\u000f[Lsca|
0006b100  6c 61 2e 54 75 70 6c 65  32 3b 2e cc 5c 75 30 30  |la.Tuple2;..\u00|
{quote}

I suspect this is because the HTTP api emits the executorInfo.data directly:

{code}
JSON::Object model(const ExecutorInfo& executorInfo)
{
  JSON::Object object;
  object.values["executor_id"] = executorInfo.executor_id().value();
  object.values["name"] = executorInfo.name();
  object.values["data"] = executorInfo.data();
  object.values["framework_id"] = executorInfo.framework_id().value();
  object.values["command"] = model(executorInfo.command());
  object.values["resources"] = model(executorInfo.resources());
  return object;
}
{code}

I think this may be because the custom JSON processing library in stout seems 
to not have any idea of what a byte array is.  I'm guessing that some implicit 
conversion makes it get written as a String instead, but:

{code}
inline std::ostream& operator<<(std::ostream& out, const String& string)
{
  // TODO(benh): This escaping DOES NOT handle unicode, it encodes as ASCII.
  // See RFC4627 for the JSON string specificiation.
  return out << picojson::value(string.value).serialize();
}
{code}

Thank you for any assistance here.  Our cluster is currently entirely down -- 
the frameworks cannot handle parsing the invalid JSON produced (it is not even 
valid utf-8)



> Mesos JSON API creates invalid JSON due to lack of binary data / non-ASCII 
> handling
> ---
>
> Key: MESOS-3771
> URL: https://issues.apache.org/jira/browse/MESOS-3771
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.1
>Reporter: Steven Schlansker
>Priority: Critical
>
> Spark encodes some binary data into the ExecutorInfo.data field.  This field 
> is sent as a "bytes" Protobuf value, which can have arbitrary non-UTF8 data.
> If you have such a field, it seems that it is splatted out into JSON without 
> any regards to proper character encoding:
> {code}
> 0006b0b0  2e 73 70 61 7

[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-08-31 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723936#comment-14723936
 ] 

Steven Schlansker commented on MESOS-2684:
--

Here is a similar presumed unintentional crasher that another user reported on 
the mailing list:

tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory 


> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED 
> when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-08-31 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2684:
-
Comment: was deleted

(was: Here is a similar presumed unintentional crasher that another user 
reported on the mailing list:

tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory 
)

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED 
> when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-08-11 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682072#comment-14682072
 ] 

Steven Schlansker edited comment on MESOS-2186 at 8/11/15 4:54 PM:
---

I strongly disagree with closing this bug, it is not fixed, and is a serious 
issue affecting multiple end users.  We too have suffered production downtime 
directly attributable to this issue.  The ZOOKEEPER- bug tracks the actual fix, 
IMO this bug then should track integrating a fixed library into Mesos.


was (Author: stevenschlansker):
I strongly disagree with closing this bug, it is not fixed, and is a serious 
issue affecting multiple end users.  The ZOOKEEPER- bug tracks the actual fix, 
IMO this bug then should track integrating a fixed library into Mesos.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:   

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-08-11 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682072#comment-14682072
 ] 

Steven Schlansker commented on MESOS-2186:
--

I strongly disagree with closing this bug, it is not fixed, and is a serious 
issue affecting multiple end users.  The ZOOKEEPER- bug tracks the actual fix, 
IMO this bug then should track integrating a fixed library into Mesos.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated fra

[jira] [Commented] (MESOS-1375) Log rotation capable

2015-05-15 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545915#comment-14545915
 ] 

Steven Schlansker commented on MESOS-1375:
--

Sorry to sound frustrated.  After digging a little bit, I realize that my issue 
is as much with Mesosphere's packaging of Mesos as it is the Mesos 
configuration itself.  There's a couple of issues here that all come together 
to make it very hard to create a "production ready" logrotate setup.

* GLOG's log rotation is wacky.  It seems to rotate logs in part based on 
service restarts, so the interval between rotations is extremely irregular.  We 
will have 10 log files created in quick succession if a slave has issues 
starting up (right now I have 20 files for a single day we had a lot of issues 
in).  Other times during periods of great stability but high task load, we will 
end up with a single log file covering most of a month and grow to 10GB
* Mesosphere's init scripts do not allow easy customization of GLOG 
configuration (not that it is very configurable to start with)
* Mesosphere's init scripts hardwire stdout / stderr from mesos-master and 
mesos-slave to go to syslog's user facility, which is overloaded by just about 
every project that uses syslog

My ideal setup honestly would be to to pipe process stdout / stderr through 
something like Apache's 'rotatelogs' command, or to improve the Mesos 
integration with 'logrotate' so it can signal properly and not need 
'copytruncate' which has known race conditions.  I tried the logrotate 'hack' 
linked above and we did not find much success over three or four iterations.

It may be possible to get it working nicely, in which case maybe the only 
change needed is a documentation fix of "This is the official way to get Mesos 
logs rotation to work" along with some user education.  Happy to expand on any 
of these points if that would be helpful.  Thanks!


> Log rotation capable
> 
>
> Key: MESOS-1375
> URL: https://issues.apache.org/jira/browse/MESOS-1375
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.18.0
>Reporter: Damien Hardy
>  Labels: ops, twitter
>
> Please provide a way to let ops manage logs.
> A log4j like configuration would be hard but make rotation capable without 
> restarting the service at least. 
> Based on external logrotate tool would be great :
>  * write to a constant log file name
>  * check for file change (recreated by logrotate) before write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1375) Log rotation capable

2015-05-07 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533595#comment-14533595
 ] 

Steven Schlansker edited comment on MESOS-1375 at 5/7/15 11:30 PM:
---

A year later, and our disks are still filling up due to this issue.  This is a 
really sad problem to have in 2015 :(



was (Author: stevenschlansker):
A year later, and our disks are still filling up due to this issue.  This is a 
really sad problem to have in 2015 :(
Standard GLOG mechanics are probably not good enough until MESOS-2193 gets some 
attention. 

> Log rotation capable
> 
>
> Key: MESOS-1375
> URL: https://issues.apache.org/jira/browse/MESOS-1375
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.18.0
>Reporter: Damien Hardy
>  Labels: ops, twitter
>
> Please provide a way to let ops manage logs.
> A log4j like configuration would be hard but make rotation capable without 
> restarting the service at least. 
> Based on external logrotate tool would be great :
>  * write to a constant log file name
>  * check for file change (recreated by logrotate) before write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1375) Log rotation capable

2015-05-07 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533595#comment-14533595
 ] 

Steven Schlansker commented on MESOS-1375:
--

A year later, and our disks are still filling up due to this issue.  This is a 
really sad problem to have in 2015 :(
Standard GLOG mechanics are probably not good enough until MESOS-2193 gets some 
attention. 

> Log rotation capable
> 
>
> Key: MESOS-1375
> URL: https://issues.apache.org/jira/browse/MESOS-1375
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.18.0
>Reporter: Damien Hardy
>  Labels: ops, twitter
>
> Please provide a way to let ops manage logs.
> A log4j like configuration would be hard but make rotation capable without 
> restarting the service at least. 
> Based on external logrotate tool would be great :
>  * write to a constant log file name
>  * check for file change (recreated by logrotate) before write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524149#comment-14524149
 ] 

Steven Schlansker commented on MESOS-2684:
--

Sorry if I wasn't clear, I mentioned that these files have no non-application 
output.  For completeness:

stderr:

I0421 21:05:14.850749 13546 exec.cpp:132] Version: 0.21.1
I0421 21:05:14.862670 13559 exec.cpp:206] Executor registered on slave 
20150327-194449-419644938-5050-1649-S71

stdout:

Registered executor on 10.70.8.160
Starting task 
pp-request-bookings-teamcity.2015.04.02T15.58.28-1429650229399-2-10.70.8.160-us_west_2b
Forked command at 13575
/bin/sh -c exit `docker wait mesos-8d3b46d5-99d6-4994-a7e4-df66aa34ae89` 
2015-04-21T21:05:15.954Z, LOGGER ERROR, log client is undefined!, 
{"@timestamp":"2015-04-21T21:05:15.954Z","servicetype":"requestbookings","logname":"LOGGER
 ERROR","formatversion":"v1","type":"requestbookings-LOGGER 
ERROR-v1","host":"10.70.8.160","sequencenumber":1,"message":"log client is 
undefined!"}
2015-04-21T21:05:15.953Z, salesforceConnection, , 
{"@timestamp":"2015-04-21T21:05:15.953Z","servicetype":"requestbookings","logname":"salesforceConnection","formatversion":"v1","type":"requestbookings-salesforceConnection-v1","host":"10.70.8.160","sequencenumber":1,"durationMs":226}
Connection to redis closed. It will reopen when logs will need sending.
Connection to redis closed. It will reopen when logs will need sending.

Yeah, those errors are not great, those pesky end users... but I don't see any 
executor output just the same

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED 
> when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524121#comment-14524121
 ] 

Steven Schlansker commented on MESOS-2684:
--

We are using the built-in Docker executor.  The attachment is the complete 
contents of mesos-slave.INFO.log up through the point in time when new 
containers start launching to replace the lost ones.  I am not aware of 
separate executor logs, is there somewhere else I should look?  The task stderr 
and stdout do not have any non-application output past executor launch.


> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED 
> when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2684:
-
Description: 
mesos-slave can encounter a variety of problems while attempting to launch a 
task.  If the task fails, that is unfortunate, but not the end of the world.  
Other tasks should not be affected.

However, if the task failure happens to trigger an assertion, the entire slave 
comes crashing down:

F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on 
device Failed to create executor directory 
'/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'

Immediately afterwards, all tasks on this slave were declared TASK_KILLED when 
mesos-slave restarted.

Something as simple as a 'mkdir' failing is not worthy of an assertion failure.

  was:
mesos-slave can encounter a variety of problems while attempting to launch a 
task.  If the task fails, that is unfortunate, but not the end of the world.  
Other tasks should not be affected.

However, if the task failure happens to trigger an assertion, the entire slave 
comes crashing down:

F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on 
device Failed to create executor directory 
'/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'

Immediately afterwards, all tasks on this slave were declared TASK_LOST when 
mesos-slave restarted.

Something as simple as a 'mkdir' failing is not worthy of an assertion failure.


> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED 
> when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524006#comment-14524006
 ] 

Steven Schlansker edited comment on MESOS-2684 at 5/1/15 9:35 PM:
--

I've attached the log from slave restart.  The FATAL error above was the last 
line written before the abort, this is the head of the new log file created on 
restart.  I misspoke about LOST, it was actually KILLED.


was (Author: stevenschlansker):
I've attached the log from slave restart.  The FATAL error above was the last 
line written before the abort, this is the head of the new log file created on 
restart.

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_LOST when 
> mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2684:
-
Attachment: mesos-slave-restart.txt

I've attached the log from slave restart.  The FATAL error above was the last 
line written before the abort, this is the head of the new log file created on 
restart.

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --
>
> Key: MESOS-2684
> URL: https://issues.apache.org/jira/browse/MESOS-2684
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.1
>Reporter: Steven Schlansker
> Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a 
> task.  If the task fails, that is unfortunate, but not the end of the world.  
> Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire 
> slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left 
> on device Failed to create executor directory 
> '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_LOST when 
> mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion 
> failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

2015-05-01 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-2684:


 Summary: mesos-slave should not abort when a single task has e.g. 
a 'mkdir' failure
 Key: MESOS-2684
 URL: https://issues.apache.org/jira/browse/MESOS-2684
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.21.1
Reporter: Steven Schlansker


mesos-slave can encounter a variety of problems while attempting to launch a 
task.  If the task fails, that is unfortunate, but not the end of the world.  
Other tasks should not be affected.

However, if the task failure happens to trigger an assertion, the entire slave 
comes crashing down:

F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on 
device Failed to create executor directory 
'/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'

Immediately afterwards, all tasks on this slave were declared TASK_LOST when 
mesos-slave restarted.

Something as simple as a 'mkdir' failing is not worthy of an assertion failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information

2015-04-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494506#comment-14494506
 ] 

Steven Schlansker commented on MESOS-1865:
--

Yes.  Or it should return the correct results.  Really, it should do just about 
anything rather than returning a valid but incorrect result.

> Mesos APIs for non-leading masters should return copies of the leader's state 
> or an error, not a success with incorrect information
> ---
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-03-24 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377941#comment-14377941
 ] 

Steven Schlansker commented on MESOS-2162:
--

There is a library linked above which may be a good starting point.  
Alternately, if you want to start off simple, you could consider instead 
executing the 'rkt' command line tool rather than building it into the executor 
directly.  I believe this is how Docker works now.  It may be actually very 
straightforward to clone the Docker approach and replace all Docker options 
with their Rocket equivalent.

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: gsoc2015, mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295983#comment-14295983
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy. But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295981#comment-14295981
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy.  But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2162:
-
Comment: was deleted

(was: I would love to help out in any way I can, but I am not much of a C++ 
guy.  But at the very least I would happily test it, or if you have other 
suggestions for how I can help...)

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295944#comment-14295944
 ] 

Steven Schlansker commented on MESOS-2162:
--

This library may be a good starting point: https://github.com/cdaylward/libappc/

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295939#comment-14295939
 ] 

Steven Schlansker commented on MESOS-2162:
--

Any possibility of getting this scheduled for an upcoming release?

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2015-01-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277742#comment-14277742
 ] 

Steven Schlansker commented on MESOS-1949:
--

Yes, it'd be good enough for this specific case.  But this has been a pattern 
and I'm sure we'll find more cases as we go along :)

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2015-01-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399
 ] 

Steven Schlansker edited comment on MESOS-1949 at 1/14/15 6:33 PM:
---

Well, it's not quite as urgent as I thought.  But there's still a lot of 
information that is hidden in log files and is very hard to correlate.  For 
example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container 
'78065406-449e-4103-85c1-bbfab09d7372' for task 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 (and executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a')
 of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container 
'78065406-449e-4103-85c1-bbfab09d7372' for executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed to start: Port [4111] not included in 
resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed: Unknown container: 
78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they 
don't have the knowledge to trawl through Mesos logs (arguably a developer 
education problem, but the tools could help much more!).  You can find the 
Mesos slave logs through the UI, but you have to do a lot of correlation 
yourself -- you have to find the right slave, dig through the messages looking 
only for the ones relevant to your task, etc.

If all of the relevant logs to one task were collected in one place, this would 
be much easier.  Makes sense?


was (Author: stevenschlansker):
Well, it's not quite as urgent as I thought.  But there's still a lot of 
information that is hidden in log files and is very hard to correlate.  For 
example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container 
'78065406-449e-4103-85c1-bbfab09d7372' for task 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 (and executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a')
 of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container 
'78065406-449e-4103-85c1-bbfab09d7372' for executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed to start: Port [4111] not included in 
resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed: Unknown container: 
78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they 
don't have any way to see the Mesos slave logs (no login permissions in 
general).  You can find the Mesos slave logs through the UI, but you have to do 
a lot of correlation yourself -- you have to find the right slave, dig through 
the messages, etc.

If all of the relevant logs to one task were collected in one place, this would 
be much easier.  Makes sense?

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely m

[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2015-01-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399
 ] 

Steven Schlansker commented on MESOS-1949:
--

Well, it's not quite as urgent as I thought.  But there's still a lot of 
information that is hidden in log files and is very hard to correlate.  For 
example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container 
'78065406-449e-4103-85c1-bbfab09d7372' for task 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 (and executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a')
 of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container 
'78065406-449e-4103-85c1-bbfab09d7372' for executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed to start: Port [4111] not included in 
resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 
'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a'
 of framework 'Singularity' failed: Unknown container: 
78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they 
don't have any way to see the Mesos slave logs (no login permissions in 
general).  You can find the Mesos slave logs through the UI, but you have to do 
a lot of correlation yourself -- you have to find the right slave, dig through 
the messages, etc.

If all of the relevant logs to one task were collected in one place, this would 
be much easier.  Makes sense?

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2212) Better handling of errors during `docker wait`

2015-01-09 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2212:
-
Description: 
Currently, the Docker containerizer executes a "exit $(docker wait 
$CONTAINER_NAME)".  This misses a couple of edge cases in the 'docker wait' API 
-- notably, if an OOM condition occurs, it will return "-1" (which is not a 
valid exit code for sh, causing an error, see 
https://issues.apache.org/jira/browse/MESOS-2209.

If a Docker container OOMs, the 'docker inspect' output will set 
'State.OOMKilled' to 'true' and 'docker wait' will return -1.  This should be 
handled more gracefully.  In particular, setting the message to indicate that 
the OOM killer intervened would be very useful as then end users can know the 
real reason their task died.

{code}
"State": {
"Error": "",
"ExitCode": -1,
"FinishedAt": "2015-01-08T18:38:39.834089879Z",
"OOMKilled": true,
"Paused": false,
"Pid": 0,
"Restarting": false,
"Running": false,
"StartedAt": "2015-01-08T18:38:39.309034983Z"
}
{code}

I've filed a bug on Docker as well: https://github.com/docker/docker/issues/9979

  was:
Currently, the Docker containerizer executes a "exit $(docker wait 
$CONTAINER_NAME)".  This misses a couple of edge cases in the 'docker wait' API 
-- notably, if an OOM condition occurs, it will return "-1" (which is not a 
valid exit code for sh, causing an error, see 
https://issues.apache.org/jira/browse/MESOS-2209.

If a Docker container OOMs, the 'docker inspect' output will set 
'State.OOMKilled' to 'true' and 'docker wait' will return -1.  This should be 
handled more gracefully.

{code}
"State": {
"Error": "",
"ExitCode": -1,
"FinishedAt": "2015-01-08T18:38:39.834089879Z",
"OOMKilled": true,
"Paused": false,
"Pid": 0,
"Restarting": false,
"Running": false,
"StartedAt": "2015-01-08T18:38:39.309034983Z"
}
{code}

I've filed a but on Docker as well: https://github.com/docker/docker/issues/9979


> Better handling of errors during `docker wait`
> --
>
> Key: MESOS-2212
> URL: https://issues.apache.org/jira/browse/MESOS-2212
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.21.0
>Reporter: Steven Schlansker
>
> Currently, the Docker containerizer executes a "exit $(docker wait 
> $CONTAINER_NAME)".  This misses a couple of edge cases in the 'docker wait' 
> API -- notably, if an OOM condition occurs, it will return "-1" (which is not 
> a valid exit code for sh, causing an error, see 
> https://issues.apache.org/jira/browse/MESOS-2209.
> If a Docker container OOMs, the 'docker inspect' output will set 
> 'State.OOMKilled' to 'true' and 'docker wait' will return -1.  This should be 
> handled more gracefully.  In particular, setting the message to indicate that 
> the OOM killer intervened would be very useful as then end users can know the 
> real reason their task died.
> {code}
> "State": {
> "Error": "",
> "ExitCode": -1,
> "FinishedAt": "2015-01-08T18:38:39.834089879Z",
> "OOMKilled": true,
> "Paused": false,
> "Pid": 0,
> "Restarting": false,
> "Running": false,
> "StartedAt": "2015-01-08T18:38:39.309034983Z"
> }
> {code}
> I've filed a bug on Docker as well: 
> https://github.com/docker/docker/issues/9979



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2212) Better handling of errors during `docker wait`

2015-01-09 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-2212:


 Summary: Better handling of errors during `docker wait`
 Key: MESOS-2212
 URL: https://issues.apache.org/jira/browse/MESOS-2212
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 0.21.0
Reporter: Steven Schlansker


Currently, the Docker containerizer executes a "exit $(docker wait 
$CONTAINER_NAME)".  This misses a couple of edge cases in the 'docker wait' API 
-- notably, if an OOM condition occurs, it will return "-1" (which is not a 
valid exit code for sh, causing an error, see 
https://issues.apache.org/jira/browse/MESOS-2209.

If a Docker container OOMs, the 'docker inspect' output will set 
'State.OOMKilled' to 'true' and 'docker wait' will return -1.  This should be 
handled more gracefully.

{code}
"State": {
"Error": "",
"ExitCode": -1,
"FinishedAt": "2015-01-08T18:38:39.834089879Z",
"OOMKilled": true,
"Paused": false,
"Pid": 0,
"Restarting": false,
"Running": false,
"StartedAt": "2015-01-08T18:38:39.309034983Z"
}
{code}

I've filed a but on Docker as well: https://github.com/docker/docker/issues/9979



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2209) Mesos should not use negative exit codes

2015-01-08 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2209:
-
Description: 
POSIX restricts exit codes to an unsigned 8 bit integer.
Mesos has a number of places where it exits with status -1. outside that range. 
 With some shells (notably dash, which is /bin/sh on Debian based systems) this 
causes an error:

{code}
/bin/sh: 1: exit: Illegal number: -1
{code}

An example of where this is done is in exec.cpp

{code}
  void kill()
  {
VLOG(1) << "Committing suicide by killing the process group";

// TODO(vinod): Invoke killtree without killing ourselves.
// Kill the process group (including ourself).
killpg(0, SIGKILL);

// The signal might not get delivered immediately, so sleep for a
// few seconds. Worst case scenario, exit abnormally.
os::sleep(Seconds(5));
exit(-1);
  }
{code}

The code needs to be audited to always return valid exit codes.

  was:
POSIX restricts exit codes to an unsigned 8 bit integer.
Mesos has a number of places where it exits with status -1. outside that range. 
 With some shells (notably dash, which is /bin/sh on Debian based systems) this 
causes an error:

{code}
/bin/sh: 1: exit: Illegal number: -1
{code}

An example of where this is done is in exec.cpp

{code}
  void kill()
  {
VLOG(1) << "Committing suicide by killing the process group";

// TODO(vinod): Invoke killtree without killing ourselves.
// Kill the process group (including ourself).
killpg(0, SIGKILL);

// The signal might not get delivered immediately, so sleep for a
// few seconds. Worst case scenario, exit abnormally.
os::sleep(Seconds(5));
exit(-1);
  }
{code}



> Mesos should not use negative exit codes
> 
>
> Key: MESOS-2209
> URL: https://issues.apache.org/jira/browse/MESOS-2209
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.21.0
>Reporter: Steven Schlansker
>Priority: Minor
>
> POSIX restricts exit codes to an unsigned 8 bit integer.
> Mesos has a number of places where it exits with status -1. outside that 
> range.  With some shells (notably dash, which is /bin/sh on Debian based 
> systems) this causes an error:
> {code}
> /bin/sh: 1: exit: Illegal number: -1
> {code}
> An example of where this is done is in exec.cpp
> {code}
>   void kill()
>   {
> VLOG(1) << "Committing suicide by killing the process group";
> // TODO(vinod): Invoke killtree without killing ourselves.
> // Kill the process group (including ourself).
> killpg(0, SIGKILL);
> // The signal might not get delivered immediately, so sleep for a
> // few seconds. Worst case scenario, exit abnormally.
> os::sleep(Seconds(5));
> exit(-1);
>   }
> {code}
> The code needs to be audited to always return valid exit codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2209) Mesos should not use negative exit codes

2015-01-08 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-2209:


 Summary: Mesos should not use negative exit codes
 Key: MESOS-2209
 URL: https://issues.apache.org/jira/browse/MESOS-2209
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.21.0
Reporter: Steven Schlansker
Priority: Minor


POSIX restricts exit codes to an unsigned 8 bit integer.
Mesos has a number of places where it exits with status -1. outside that range. 
 With some shells (notably dash, which is /bin/sh on Debian based systems) this 
causes an error:

{code}
/bin/sh: 1: exit: Illegal number: -1
{code}

An example of where this is done is in exec.cpp

{code}
  void kill()
  {
VLOG(1) << "Committing suicide by killing the process group";

// TODO(vinod): Invoke killtree without killing ourselves.
// Kill the process group (including ourself).
killpg(0, SIGKILL);

// The signal might not get delivered immediately, so sleep for a
// few seconds. Worst case scenario, exit abnormally.
os::sleep(Seconds(5));
exit(-1);
  }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2024) mesos debian packaging should work on a java 8 install without java7

2014-10-31 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192288#comment-14192288
 ] 

Steven Schlansker commented on MESOS-2024:
--

Hm, this might actually be a Mesosphere issue (Mesos should really take the 
packaging into the main build!)
Maybe this PR addresses the issue? 
https://github.com/mesosphere/mesos-deb-packaging/pull/24

> mesos debian packaging should work on a java 8 install without java7
> 
>
> Key: MESOS-2024
> URL: https://issues.apache.org/jira/browse/MESOS-2024
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> The mesos .deb file:
> root@myhost:~# dpkg-query -s mesos
> Version: 0.20.1-1.0.ubuntu1404
> Depends: java7-runtime-headless | java6-runtime-headless, libcurl3
> Recommends: zookeeper, zookeeperd, zookeeper-bin
> We run java8, but installing the mesos package always drags in java7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2024) mesos debian packaging should work on a java 8 install without java7

2014-10-31 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2024:
-
Summary: mesos debian packaging should work on a java 8 install without 
java7  (was: mesos debian packaging should work on a java 8 install)

> mesos debian packaging should work on a java 8 install without java7
> 
>
> Key: MESOS-2024
> URL: https://issues.apache.org/jira/browse/MESOS-2024
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> The mesos .deb file:
> root@myhost:~# dpkg-query -s mesos
> Version: 0.20.1-1.0.ubuntu1404
> Depends: java7-runtime-headless | java6-runtime-headless, libcurl3
> Recommends: zookeeper, zookeeperd, zookeeper-bin
> We run java8, but installing the mesos package always drags in java7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2024) mesos debian packaging should work on a java 8 install

2014-10-31 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-2024:


 Summary: mesos debian packaging should work on a java 8 install
 Key: MESOS-2024
 URL: https://issues.apache.org/jira/browse/MESOS-2024
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.20.1
Reporter: Steven Schlansker


The mesos .deb file:
root@myhost:~# dpkg-query -s mesos
Version: 0.20.1-1.0.ubuntu1404
Depends: java7-runtime-headless | java6-runtime-headless, libcurl3
Recommends: zookeeper, zookeeperd, zookeeper-bin


We run java8, but installing the mesos package always drags in java7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2023) mesos-execute should allow setting environment variables

2014-10-31 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-2023:


 Summary: mesos-execute should allow setting environment variables
 Key: MESOS-2023
 URL: https://issues.apache.org/jira/browse/MESOS-2023
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.20.1
Reporter: Steven Schlansker


mesos-execute does not allow setting various properties of the 'CommandInfo' 
protobuf.  Most notably, being able to set environment variables and URIs would 
be very useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1976) Sandbox browse UI has path which is not selectable

2014-10-23 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-1976:


 Summary: Sandbox browse UI has path which is not selectable
 Key: MESOS-1976
 URL: https://issues.apache.org/jira/browse/MESOS-1976
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 0.20.1
Reporter: Steven Schlansker
Priority: Minor


The Sandbox UI displays the path being browsed as a series of links.  It is not 
possible to copy the path from this, it ends up being formatted as e.g. 

{code}
 mnt
mesos
slaves
20141022-230146-2500085258-5050-1554-3
frameworks
Singularity
executors
ci-discovery-singularity-bridge-steven.2014.10.21T21.00.04-1414092693380-2-10-us_west_2a
runs
554eebb3-126d-42bd-95c2-aa8282b05522 
{code}

instead of the expected

{code}
/mnt/mesos/slaves/20141022-230146-2500085258-5050-1554-3/frameworks/Singularity/executors/ci-discovery-singularity-bridge-steven.2014.10.21T21.00.04-1414092693380-2-10-us_west_2a/runs/554eebb3-126d-42bd-95c2-aa8282b05522
 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2014-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178899#comment-14178899
 ] 

Steven Schlansker commented on MESOS-1949:
--

I did, and it turns out at least in Singularity's case I'd just missed it 
(hidden on a detail view I don't click often) 
https://github.com/HubSpot/Singularity/issues/266

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2014-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178614#comment-14178614
 ] 

Steven Schlansker commented on MESOS-1949:
--

Interesting.  It seems that neither Singularity nor Marathon expose the 
`message` field of TaskStatus anywhere useful?  Would this be possible to 
expose via the Mesos web interface?

Having the message would be great and solve 80% of the use cases here.  I still 
think collating the logs is a useful feature, though, for when something more 
complex goes wrong.

> All log messages from master, slave, executor, etc. should be collected on a 
> per-task basis
> ---
>
> Key: MESOS-1949
> URL: https://issues.apache.org/jira/browse/MESOS-1949
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is 
> created at different layers of the Mesos ecosystem.  The framework will log 
> task information, the master deals with resource allocation, the slave 
> actually allocates those resources, and the executor does the work of 
> launching the task.
> If anything through that pipeline fails, the end user is left with little but 
> a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful 
> information (for example a "Docker pull failed because repository didn't 
> exist") is hidden in one of four or five different places, potentially spread 
> across as many different machines.  This leads to unpleasant and repetitive 
> searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier 
> way of figuring out exactly where in this process something went wrong, and 
> likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

2014-10-20 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-1949:


 Summary: All log messages from master, slave, executor, etc. 
should be collected on a per-task basis
 Key: MESOS-1949
 URL: https://issues.apache.org/jira/browse/MESOS-1949
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Affects Versions: 0.20.1
Reporter: Steven Schlansker


Currently through a task's lifecycle, various debugging information is created 
at different layers of the Mesos ecosystem.  The framework will log task 
information, the master deals with resource allocation, the slave actually 
allocates those resources, and the executor does the work of launching the task.

If anything through that pipeline fails, the end user is left with little but a 
"TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful information 
(for example a "Docker pull failed because repository didn't exist") is hidden 
in one of four or five different places, potentially spread across as many 
different machines.  This leads to unpleasant and repetitive searching through 
logs looking for a clue to what went wrong.

Collating logs on a per-task basis would give the end user a much friendlier 
way of figuring out exactly where in this process something went wrong, and 
likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information

2014-10-03 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-1865:
-
Description: 
Some of the API endpoints, for example /master/tasks.json, will return bogus 
information if you query a non-leading master:

{code}
[steven@Anesthetize:~]% curl 
http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": [
{
  "executor_id": "",
  "framework_id": "20140724-231003-419644938-5050-1707-",
  "id": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "name": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "resources": {
"cpus": 0.25,
"disk": 0,
{code}

This is very hard for end-users to work around.  For example if I query "which 
master is leading" followed by "leader: which tasks are running" it is possible 
that the leader fails over in between, leaving me with an incorrect answer and 
no way to know that this happened.

In my opinion the API should return the correct response (by asking the current 
leader?) or an error (500 Not the leader?) but it's unacceptable to return a 
successful wrong answer.


  was:
Some of the API endpoints, for example /master/tasks.json, will return bogus 
information if you query a non-leading master:

{code}
[steven@Anesthetize:~]% curl 
http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": [
{
  "executor_id": "",
  "framework_id": "20140724-231003-419644938-5050-1707-",
  "id": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "name": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "resources": {
"cpus": 0.25,
"disk": 0,
{code}

This is very hard for end-users to work around.  For example if I query "which 
master is leading" followed by "master: which tasks are running" it is possible 
that the leader fails over in between, leaving me with an incorrect answer and 
no way to know that this happened.

In my opinion the API should return the correct response (by asking the current 
leader?) or an error (500 Not the leader?) but it's unacceptable to return a 
successful wrong answer.



> Mesos APIs for non-leading masters should return copies of the leader's state 
> or an error, not a success with incorrect information
> ---
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information

2014-10-03 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-1865:
-
Description: 
Some of the API endpoints, for example /master/tasks.json, will return bogus 
information if you query a non-leading master:

{code}
[steven@Anesthetize:~]% curl 
http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
{
  "tasks": [
{
  "executor_id": "",
  "framework_id": "20140724-231003-419644938-5050-1707-",
  "id": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "name": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "resources": {
"cpus": 0.25,
"disk": 0,
{code}

This is very hard for end-users to work around.  For example if I query "which 
master is leading" followed by "master: which tasks are running" it is possible 
that the leader fails over in between, leaving me with an incorrect answer and 
no way to know that this happened.

In my opinion the API should return the correct response (by asking the current 
leader?) or an error (500 Not the leader?) but it's unacceptable to return a 
successful wrong answer.


  was:
Some of the API endpoints, for example /master/tasks.json, will return bogus 
information if you query a non-leading master:

{quote}
[steven@Anesthetize:~]% curl 
http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10012  100120 0 21  0 --:--:-- --:--:-- --:--:--21
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10012  100120 0105  0 --:--:-- --:--:-- --:--:--   106
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 43081  100 430810 0   196k  0 --:--:-- --:--:-- --:--:--  196k
{
  "tasks": [
{
  "executor_id": "",
  "framework_id": "20140724-231003-419644938-5050-1707-",
  "id": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "name": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "resources": {
"cpus": 0.25,
"disk": 0,
{quote}

This is very hard for end-users to work around.  For example if I query "which 
master is leading" followed by "master: which tasks are running" it is possible 
that the leader fails over in between, leaving me with an incorrect answer and 
no way to know that this happened.

In my opinion the API should return the correct response (by asking the current 
leader?) or an error (500 Not the leader?) but it's unacceptable to return a 
successful wrong answer.



> Mesos APIs for non-leading masters should return copies of the leader's state 
> or an error, not a success with incorrect information
> ---
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
>   

[jira] [Created] (MESOS-1865) Mesos APIs for non-leading masters should return copies of the leader's state or an error, not a success with incorrect information

2014-10-03 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-1865:


 Summary: Mesos APIs for non-leading masters should return copies 
of the leader's state or an error, not a success with incorrect information
 Key: MESOS-1865
 URL: https://issues.apache.org/jira/browse/MESOS-1865
 Project: Mesos
  Issue Type: Bug
  Components: json api
Affects Versions: 0.20.1
Reporter: Steven Schlansker


Some of the API endpoints, for example /master/tasks.json, will return bogus 
information if you query a non-leading master:

{quote}
[steven@Anesthetize:~]% curl 
http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10012  100120 0 21  0 --:--:-- --:--:-- --:--:--21
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10012  100120 0105  0 --:--:-- --:--:-- --:--:--   106
{
  "tasks": []
}
[steven@Anesthetize:~]% curl 
http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 43081  100 430810 0   196k  0 --:--:-- --:--:-- --:--:--  196k
{
  "tasks": [
{
  "executor_id": "",
  "framework_id": "20140724-231003-419644938-5050-1707-",
  "id": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "name": 
"pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
  "resources": {
"cpus": 0.25,
"disk": 0,
{quote}

This is very hard for end-users to work around.  For example if I query "which 
master is leading" followed by "master: which tasks are running" it is possible 
that the leader fails over in between, leaving me with an incorrect answer and 
no way to know that this happened.

In my opinion the API should return the correct response (by asking the current 
leader?) or an error (500 Not the leader?) but it's unacceptable to return a 
successful wrong answer.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1755) Add docker support to mesos-execute

2014-09-03 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120061#comment-14120061
 ] 

Steven Schlansker commented on MESOS-1755:
--

Hi Vinod and Timothy, sorry for the confusion.  I looped back with Henning and 
Sean and this commit is indeed unrelated as you suspected.

That said, it is a small commit, and I think it is worthy to be included in 
0.20.1 even if Singularity does not need it -- it will make testing things 
somewhat easier and it seems low-risk.


> Add docker support to mesos-execute
> ---
>
> Key: MESOS-1755
> URL: https://issues.apache.org/jira/browse/MESOS-1755
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Timothy Chen
> Fix For: 0.21.0
>
>
> The fix for this is already committed at https://reviews.apache.org/r/24808/. 
> I'm creating this ticket to track that this patch gets included in 0.20.1 
> release, since apparently Singularity framework depends on this patch to work 
> with Docker !?!? 
> https://groups.google.com/forum/#!topic/singularity-users/GzzswbpI92E
> [~tnachen]: Can you confirm if this has to be included in 0.20.1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1554) Persistent resources support for storage-like services

2014-09-03 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120041#comment-14120041
 ] 

Steven Schlansker commented on MESOS-1554:
--

It would be nice to be able to manage e.g. Amazon EBS (or generic SAN) volumes 
in this way.  That would be very powerful indeed.

> Persistent resources support for storage-like services
> --
>
> Key: MESOS-1554
> URL: https://issues.apache.org/jira/browse/MESOS-1554
> Project: Mesos
>  Issue Type: Epic
>  Components: general, hadoop
>Reporter: Nikita Vetoshkin
>Priority: Minor
>
> This question came up in [dev mailing 
> list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E].
> It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use 
> Mesos to manage it's instances. But right now if we'd like to restart 
> instance (e.g. to spin up a new version) - all previous instance version 
> sandbox filesystem resources will be recycled by slave's garbage collector.
> At the moment filesystem resources can be managed out of band - i.e. 
> instances can save their data in some database specific placed, that various 
> instances can share (e.g. {{/var/lib/cassandra}}).
> [~benjaminhindman] suggested an idea in the mailing list (though it still 
> needs some fleshing out):
> {quote}
> The idea originally came about because, even today, if we allocate some
> file system space to a task/executor, and then that task/executor
> terminates, we haven't officially "freed" those file system resources until
> after we garbage collect the task/executor sandbox! (We keep the sandbox
> around so a user/operator can get the stdout/stderr or anything else left
> around from their task/executor.)
> To solve this problem we wanted to be able to let a task/executor terminate
> but not *give up* all of it's resources, hence: persistent resources.
> Pushing this concept even further you could imagine always reallocating
> resources to a framework that had already been allocated those resources
> for a previous task/executor. Looked at from another perspective, these are
> "late-binding", or "lazy", resource reservations.
> At one point in time we had considered just doing 'right-of-first-refusal'
> for allocations after a task/executor terminate. But this is really
> insufficient for supporting storage-like frameworks well (and likely even
> harder to reliably implement then 'persistent resources' IMHO).
> There are a ton of things that need to get worked out in this model,
> including (but not limited to), how should a file system (or disk) be
> exposed in order to be made persistent? How should persistent resources be
> returned to a master? How many persistent resources can a framework get
> allocated?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-780) Adding support for 3rd party performance and health monitoring.

2014-08-14 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097763#comment-14097763
 ] 

Steven Schlansker commented on MESOS-780:
-

I've started a (very basic) Nagios check:
https://github.com/opentable/nagios-mesos
Hopefully someone finds it useful / contributes improvements :)

> Adding support for 3rd party performance and health monitoring.
> ---
>
> Key: MESOS-780
> URL: https://issues.apache.org/jira/browse/MESOS-780
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Reporter: Bernardo Gomez Palacio
>
> User Story:
> As a SysAdmin I should be able to monitor Mesos (Masters and Slaves) with
> 3rd party tools such as:
> * [Ganglia|http://ganglia.sourceforge.net/]
> * [Graphite|http://graphite.wikidot.com/]
> * [Nagios|http://www.nagios.org/]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-23 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071950#comment-14071950
 ] 

Steven Schlansker commented on MESOS-1193:
--

No, we aren't.  Maybe we should be...

> Check failed: promises.contains(containerId) crashes slave
> --
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555- has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.9644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535'
>  for gc 6.9644553185days in the future
> F0404 17:58:58.597434  3938 mesos_containerizer.cpp:682] Check failed: 
> promises.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x7f5209da6e5d  google::LogMessage::Fail()
> @ 0x7f5209da8c9d  google::LogMessage::SendToLog()
> @ 0x7f5209da6a4c  google::LogMessage::Flush()
> @ 0x7f5209da9599  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f5209ad9f88  
> mesos::internal::slave::MesosContainerizerProcess::exec()
> @ 0x7f5209af3b56  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f5209cd0bf2  process::ProcessManager::resume()
> @ 0x7f5209cd0eec  process::schedule()
> @ 0x7f5208b48f6e  start_thread
> @ 0x7f52088739cd  (unknown)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032
 ] 

Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:45 PM:


That sounds entirely plausible.
We are running Docker containers.  Given the somewhat unstable state of Docker 
support in Mesos, we are using our own Docker launching scripts.  I had just 
updated a base image so all the slaves were busy executing a 'docker pull' to 
grab the new images.

Given that the task is a shell script that executes this pull, it may well be 
past what Mesos thinks of as the "launch" phase.  But it definitely was during 
a lengthy initialization step.

It's worth noting that almost all of our jobs are Marathon tasks.  I believe 
the log messages about Chronos are unrelated, we only have one or two things 
launching with it, and I don't think any were around the time of the crash.


was (Author: stevenschlansker):
That sounds entirely plausible.
We are running Docker containers.  Given the somewhat unstable state of Docker 
support in Mesos, we are using our own Docker launching scripts.  I had just 
updated a base image so all the slaves were busy executing a 'docker pull' to 
grab the new images.

Given that the task is a shell script that executes this pull, it may well be 
past what Mesos thinks of as the "launch" phase.  But it definitely was during 
a lengthy initialization step.

It's worth noting that almost all of our jobs are Marathon tasks.  I believe 
(?) the log messages about Chronos are unrelated, we only have one or two 
things launching with it, and I don't think any were around the time of the 
crash.

> Check failed: promises.contains(containerId) crashes slave
> --
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555- has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.9644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535'
>  for gc 6.9644553185days in the future
> F0404 17:58:58.597434  3938 mesos_containerizer.cpp:682] Check failed: 
> promises.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x7f5209da6e5d  google::LogMessage::Fail()
> @ 0x7f5209da8c9d  google::LogMessage::SendToLog()
> @ 0x7f5209da6a4c  google::LogMessage::Flush()
> @ 0x7f5209da9599  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f5209ad9f88  
> mesos::internal::slave::MesosContainerizerProcess::exec()
> @ 0x7f5209af3b56  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS

[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032
 ] 

Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:40 PM:


That sounds entirely plausible.
We are running Docker containers.  Given the somewhat unstable state of Docker 
support in Mesos, we are using our own Docker launching scripts.  I had just 
updated a base image so all the slaves were busy executing a 'docker pull' to 
grab the new images.

Given that the task is a shell script that executes this pull, it may well be 
past what Mesos thinks of as the "launch" phase.  But it definitely was during 
a lengthy initialization step.

It's worth noting that almost all of our jobs are Marathon tasks.  I believe 
(?) the log messages about Chronos are unrelated, we only have one or two 
things launching with it, and I don't think any were around the time of the 
crash.


was (Author: stevenschlansker):
That sounds entirely plausible.
We are running Docker containers.  Given the somewhat unstable state of Docker 
support in Mesos, we are using our own Docker launching scripts.  I had just 
updated a base image so all the slaves were busy executing a 'docker pull' to 
grab the new images.

Given that the task is a shell script that executes this pull, it may well be 
past what Mesos thinks of as the "launch" phase.  But it definitely was during 
a lengthy initialization step.

> Check failed: promises.contains(containerId) crashes slave
> --
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555- has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.9644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535'
>  for gc 6.9644553185days in the future
> F0404 17:58:58.597434  3938 mesos_containerizer.cpp:682] Check failed: 
> promises.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x7f5209da6e5d  google::LogMessage::Fail()
> @ 0x7f5209da8c9d  google::LogMessage::SendToLog()
> @ 0x7f5209da6a4c  google::LogMessage::Flush()
> @ 0x7f5209da9599  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f5209ad9f88  
> mesos::internal::slave::MesosContainerizerProcess::exec()
> @ 0x7f5209af3b56  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f5209cd0bf2  process::ProcessM

[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032
 ] 

Steven Schlansker commented on MESOS-1193:
--

That sounds entirely plausible.
We are running Docker containers.  Given the somewhat unstable state of Docker 
support in Mesos, we are using our own Docker launching scripts.  I had just 
updated a base image so all the slaves were busy executing a 'docker pull' to 
grab the new images.

Given that the task is a shell script that executes this pull, it may well be 
past what Mesos thinks of as the "launch" phase.  But it definitely was during 
a lengthy initialization step.

> Check failed: promises.contains(containerId) crashes slave
> --
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555- has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.9644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535'
>  for gc 6.9644553185days in the future
> F0404 17:58:58.597434  3938 mesos_containerizer.cpp:682] Check failed: 
> promises.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x7f5209da6e5d  google::LogMessage::Fail()
> @ 0x7f5209da8c9d  google::LogMessage::SendToLog()
> @ 0x7f5209da6a4c  google::LogMessage::Flush()
> @ 0x7f5209da9599  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f5209ad9f88  
> mesos::internal::slave::MesosContainerizerProcess::exec()
> @ 0x7f5209af3b56  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f5209cd0bf2  process::ProcessManager::resume()
> @ 0x7f5209cd0eec  process::schedule()
> @ 0x7f5208b48f6e  start_thread
> @ 0x7f52088739cd  (unknown)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592
 ] 

Steven Schlansker edited comment on MESOS-1193 at 7/22/14 6:08 PM:
---

This same issue just took down our entire cluster this morning.  Not cool!
I wish I had more debugging information, but here's the last 10 log lines:
{code}
I0722 17:41:26.189750  1376 slave.cpp:2552] Cleaning up executor 
'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.189893  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f'
 for gc 6.9780272days in the future
I0722 17:41:26.189980  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60'
 for gc 6.9780201482days in the future
I0722 17:41:26.737553  1380 slave.cpp:933] Got assigned task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.737844  1380 slave.cpp:1043] Launching task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.739146  1375 mesos_containerizer.cpp:537] Starting container 
'17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework 
'201403072353-2969781770-5050-852-'
I0722 17:41:26.739151  1380 slave.cpp:1153] Queuing task 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework 
'201403072353-2969781770-5050-852-
I0722 17:41:26.748342  1375 launcher.cpp:117] Forked child with pid '12376' for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b'
I0722 17:41:26.752080  1380 mesos_containerizer.cpp:647] Fetching URIs for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command 
'/usr/local/libexec/mesos/mesos-fetcher'
F0722 17:41:52.215634  1377 mesos_containerizer.cpp:862] Check failed: 
promises.contains(containerId) 
{code}

{code}
I0722 17:56:35.702491 13428 main.cpp:126] Build: 2014-06-09 21:08:25 by root
I0722 17:56:35.702517 13428 main.cpp:128] Version: 0.19.0
I0722 17:56:35.702535 13428 main.cpp:131] Git tag: 0.19.0
I0722 17:56:35.702553 13428 main.cpp:135] Git SHA: 
51e047524cf744ee257870eb479345646c0428ff
I0722 17:56:35.702590 13428 mesos_containerizer.cpp:124] Using isolation: 
posix/cpu,posix/mem
I0722 17:56:35.702942 13428 main.cpp:149] Starting Mesos slave
I0722 17:56:35.703721 13428 slave.cpp:143] Slave started on 1)@10.70.6.32:5051
I0722 17:56:35.704082 13428 slave.cpp:255] Slave resources: cpus(*):8; 
mem(*):29077; disk(*):70336; ports(*):[31000-32000]
I0722 17:56:35.705883 13428 slave.cpp:283] Slave hostname: 10.70.6.32
I0722 17:56:35.705915 13428 slave.cpp:284] Slave checkpoint: true
{code}


was (Author: stevenschlansker):
This same issue just took down our entire cluster this morning.  Not cool!
I wish I had more debugging information, but here's the last 10 log lines:
{code}
I0722 17:41:26.189750  1376 slave.cpp:2552] Cleaning up executor 
'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.189893  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f'
 for gc 6.9780272days in the future
I0722 17:41:26.189980  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60'
 for gc 6.9780201482days in the future
I0722 17:41:26.737553  1380 slave.cpp:933] Got assigned task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.737844  1380 slave.cpp:1043] Launching task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.739146  1375 mesos_containerizer.cpp:537] Starting container 
'17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework 
'201403072353-2969781770-5050-852-'
I0722 17:41:26.739151  1380 slave.cpp:1153] Queuing task 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework 
'201403072353-2969781770-5050-852-
I0722 17:41:26.748342  1375 launcher.cpp:117] Forked child with pid '12376' for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b'
I0722 17:41:26.752080 

[jira] [Commented] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave

2014-07-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592
 ] 

Steven Schlansker commented on MESOS-1193:
--

This same issue just took down our entire cluster this morning.  Not cool!
I wish I had more debugging information, but here's the last 10 log lines:
{code}
I0722 17:41:26.189750  1376 slave.cpp:2552] Cleaning up executor 
'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.189893  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f'
 for gc 6.9780272days in the future
I0722 17:41:26.189980  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60'
 for gc 6.9780201482days in the future
I0722 17:41:26.737553  1380 slave.cpp:933] Got assigned task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.737844  1380 slave.cpp:1043] Launching task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-
I0722 17:41:26.739146  1375 mesos_containerizer.cpp:537] Starting container 
'17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework 
'201403072353-2969781770-5050-852-'
I0722 17:41:26.739151  1380 slave.cpp:1153] Queuing task 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework 
'201403072353-2969781770-5050-852-
I0722 17:41:26.748342  1375 launcher.cpp:117] Forked child with pid '12376' for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b'
I0722 17:41:26.752080  1380 mesos_containerizer.cpp:647] Fetching URIs for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command 
'/usr/local/libexec/mesos/mesos-fetcher'
F0722 17:41:52.215634  1377 mesos_containerizer.cpp:862] Check failed: 
promises.contains(containerId) 
{code}

> Check failed: promises.contains(containerId) crashes slave
> --
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555- has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.9644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535'
>  for gc 6.9644553185days in the future
> F0404 17:58:58.