[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-15 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243861#comment-15243861
 ] 

Joseph Wu commented on MESOS-5197:
--

It's currently unclear if we want to increase the logging level for these lines 
in this way.

The {{src/docker/docker.(hpp|cpp)}} files are shared between the agent (docker 
containerizer) and docker-command-executor.  The {{INFO}} logging is useful for 
the docker-command-executor, but is not so useful for the docker containerizer.

> Log executor commands w/o verbose logs enabled
> --
>
> Key: MESOS-5197
> URL: https://issues.apache.org/jira/browse/MESOS-5197
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Gummelt
>Assignee: Yong Tang
>  Labels: mesosphere
>
> To debug executors, it's often necessary to know the command that ran the 
> executor.  For example, when Spark executors fail, I'd like to know the 
> command used to invoke the executor (Spark uses the command executor in a 
> docker container).  Currently, it's only output if GLOG_v is enabled, but I 
> don't think this should be a "verbose" output.  It's a common debugging need.
> https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

2016-04-15 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5180:
--
Sprint:   (was: Mesosphere Sprint 33)

> Scheduler driver does not detect disconnection with master and reregister.
> --
>
> Key: MESOS-5180
> URL: https://issues.apache.org/jira/browse/MESOS-5180
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 0.24.0
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and trigger a 
> master (re-)detection upon a disconnection. This in turn should make the 
> driver (re)-register with the master. The scheduler library already does 
> this: 
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5181) Master should reject calls from the scheduler driver if the scheduler is not connected.

2016-04-15 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5181:
--
Sprint:   (was: Mesosphere Sprint 33)

> Master should reject calls from the scheduler driver if the scheduler is not 
> connected.
> ---
>
> Key: MESOS-5181
> URL: https://issues.apache.org/jira/browse/MESOS-5181
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 0.24.0
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> When a scheduler registers, the master will create a link from master to 
> scheduler.  If this link breaks, the master will consider the scheduler 
> {{inactive}} and mark it as {{disconnected}}.
> This causes a couple problems:
> 1) Master does not send offers to {{inactive}} schedulers.  But these 
> schedulers might consider themselves "registered" in a one-way network 
> partition scenario.
> 2) Any calls from the {{inactive}} scheduler is still accepted, which leaves 
> the scheduler in a starved, but semi-functional state.
> See the related issue for more context: MESOS-5180
> There should be an additional guard for registered, but {{inactive}} 
> schedulers here:
> https://github.com/apache/mesos/blob/94f4f4ebb7d491ec6da1473b619600332981dd8e/src/master/master.cpp#L1977
> The HTTP API already does this:
> https://github.com/apache/mesos/blob/94f4f4ebb7d491ec6da1473b619600332981dd8e/src/master/http.cpp#L459
> Since the scheduler driver cannot return a 403, it may be necessary to return 
> a {{Event::ERROR}} and force the scheduler to abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5060) Requesting /files/read.json with a negative length value causes subsequent /files requests to 404.

2016-04-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243420#comment-15243420
 ] 

Greg Mann commented on MESOS-5060:
--

[~dongdong], I had a look at this code with BenM and there is a clear bug; we 
parse the {{length}} parameter as a {{ssize_t}}, which is a signed type, but 
then we use that length value (which may be negative) to initialize an array: 
{{boost::shared_array data(new char[length]);}}.

After discussing with BenM, there are a few cases of {{length}} and {{offset}} 
which we need to handle:
* A user-defined {{length}} (strictly positive)
* A default {{length}} if none is specified (perhaps equal to the page size)

* A user-defined {{offset}} (positive, negative, or end-of-file)
* A default {{offset}}

The end-of-file offset is important because this endpoint is used to tail 
files. Unfortunately, we currently use {{offset == -1}} in the code to indicate 
the end-of-file offset. The end-of-file offset is currently the default value 
if no {{offset}} is specified; I don't find this to be very intuitive for 
users, but it may be our best option if we want to allow negative offsets 
(i.e., if we allow negative offsets, how would a user specify the end-of-file 
offset explicitly?).

We can probably just remove support for negative values of {{length}}, and 
allow the user to use the default length by omitting that parameter.

Have a look at the code and let me know what you think. Since this bug breaks 
part of the agent, we'd love to get a fix in soon; do you know when you might 
be able to take a look? Thanks! :-)

> Requesting /files/read.json with a negative length value causes subsequent 
> /files requests to 404.
> --
>
> Key: MESOS-5060
> URL: https://issues.apache.org/jira/browse/MESOS-5060
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
> Environment: Mesos 0.23.0 on CentOS 6, also Mesos 0.28.0 on OSX
>Reporter: Tom Petr
>Assignee: zhou xing
>Priority: Minor
> Fix For: 0.29.0
>
>
> I accidentally hit a slave's /files/read.json endpoint with a negative length 
> (ex. http://hostname:5051/files/read.json?path=XXX=0=-100). The 
> HTTP request timed out after 30 seconds with nothing relevant in the slave 
> logs, and subsequent calls to any of the /files endpoints on that slave 
> immediately returned a HTTP 404 response. We ultimately got things working 
> again by restarting the mesos-slave process (checkpointing FTW!), but it'd be 
> wise to guard against negative lengths on the slave's end too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4778) Add appc/runtime isolator for runtime isolation for appc images.

2016-04-15 Thread Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243361#comment-15243361
 ] 

Srinivas commented on MESOS-4778:
-

I plan to add set up patches. 
1. add runtime metadata "app" with ability to specify working directory, 
command and environment.
https://reviews.apache.org/r/46107
2. add code to store and provisioner to propagate the metadata to the executor
https://reviews.apache.org/r/46182
3. add runtime isolator to hook into containerizer. (WIP)


> Add appc/runtime isolator for runtime isolation for appc images.
> 
>
> Key: MESOS-4778
> URL: https://issues.apache.org/jira/browse/MESOS-4778
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Srinivas
>
> Appc image also contains runtime information like 'exec', 'env', 
> 'workingDirectory' etc.
> https://github.com/appc/spec/blob/master/spec/aci.md
> Similar to docker images, we need to support a subset of them (mainly 'exec', 
> 'env' and 'workingDirectory').



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread Tyson Norris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243232#comment-15243232
 ] 

Tyson Norris commented on MESOS-4279:
-

Thanks for the updates.
One note I wanted to add was that we see exactly what [~bydga] describes above, 
in the "there are actually 2 bugs" comment:
- task stdout is truncated (compared to docker container json.log)
- task status is killed (instead of finished)

For example, regarding "You are calling the run->discard method (which causes 
to close the stderr/stdout streams) too early - during the "stoping period" 
container can (and usually will) write something about the termination". If I 
check the docker container log file on disk, it has a series of lines that are 
emitted during shutdown, so I can see that "docker stop" is called and the 
container does actually perform a graceful shutdown. HOWEVER, the task stdout 
does not receive any of these lines, after the docker stop is called.



> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5061) process.cpp:1966] Failed to shutdown socket with fd x: Transport endpoint is not connected

2016-04-15 Thread Dan Osborne (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243063#comment-15243063
 ] 

Dan Osborne commented on MESOS-5061:


I spent some time looking at Zogg's setup and found the following to be true.

After its initialization, the first thing the executor does is register with 
the Slave. Since we're using the network isolator here, the  registration 
message should have a src address of the newly initialized networking 
namespace. Since calico is handling the isolation, this means we'll see a 
registration with an IP src from the default 192.168.0.0/16 range, like the 
following example:

I0414 23:31:44.479730   205 slave.cpp:2642] Got registration for executor 
'star_probe-b.0bd467c0-0299-11e6-ad3b-0242ac110005' of framework 
6a1ae9aa-ad50-44c1-8809-58791c5bcbe5- from executor(1)@192.168.0.4:35454`

In Zogg's test, we're seeing a registration message that is using the Slave's 
IP address. This is of course false information. When the slave then tries to 
handshake with the registration request, it of course fails, since there is no 
executor using that IP/port. This explains why we see tasks stuck in staging - 
mesos-slave has completely lost contact with the executor.

Can someone shine light on how the Executor picks this IP, or if its just 
extracted from the source IP of the registration Method? 

Versioning info:
Mesos Manually built from 0.27.0
Net-modules (basically latest): 
https://github.com/mesosphere/net-modules/commits/625b67992ceca535cf2c76ea980b64aa8f4b33e1

I'm going to work to get this reproducible using the net-modules docker-compose 
demo. In the meantime, any thoughts?

> process.cpp:1966] Failed to shutdown socket with fd x: Transport endpoint is 
> not connected
> --
>
> Key: MESOS-5061
> URL: https://issues.apache.org/jira/browse/MESOS-5061
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, modules
>Affects Versions: 0.27.0, 0.27.1, 0.28.0, 0.27.2
> Environment: Centos 7.1
>Reporter: Zogg
> Fix For: 0.29.0
>
>
> When launching a task through Marathon and asking the task to assign an IP 
> (using Calico networking):
> {noformat}
> {
> "id":"/calico-apps",
> "apps": [
> {
> "id": "hello-world-1",
> "cmd": "ip addr && sleep 3",
> "cpus": 0.1,
> "mem": 64.0,
> "ipAddress": {
> "groups": ["calico-k8s-network"]
> }
> }
> ]
> }
> {noformat}
> Mesos slave fails to launch a task, locking in STAGING state forewer, with 
> error:
> {noformat}
> [centos@rtmi-worker-001 mesos]$ tail mesos-slave.INFO
> I0325 20:35:43.420171 13495 slave.cpp:2642] Got registration for executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981- from executor(1)@10.0.0.10:33443
> I0325 20:35:43.422652 13495 slave.cpp:1862] Sending queued task 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' to executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981- at executor(1)@10.0.0.10:33443
> E0325 20:35:43.423159 13502 process.cpp:1966] Failed to shutdown socket with 
> fd 22: Transport endpoint is not connected
> I0325 20:35:43.423316 13501 slave.cpp:3481] executor(1)@10.0.0.10:33443 exited
> {noformat}
> However, when deploying a task without ipAddress field, mesos slave launches 
> a task successfully. 
> Tested with various Mesos/Marathon/Calico versions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4576) Introduce a stout helper for "which"

2016-04-15 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243064#comment-15243064
 ] 

Guangya Liu commented on MESOS-4576:


I created a patch here https://reviews.apache.org/r/45326/

> Introduce a stout helper for "which"
> 
>
> Key: MESOS-4576
> URL: https://issues.apache.org/jira/browse/MESOS-4576
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Joseph Wu
>Assignee: Disha Singh
>  Labels: mesosphere
>
> We may want to add a helper to {{stout/os.hpp}} that will natively emulate 
> the functionality of the Linux utility {{which}}.  i.e.
> {code}
> Option which(const string& command)
> {
>   Option path = os::getenv("PATH");
>   // Loop through path and return the first one which os::exists(...).
>   return None();
> }
> {code}
> This helper may be useful:
> * for test filters in {{src/tests/environment.cpp}}
> * a few tests in {{src/tests/containerizer/port_mapping_tests.cpp}}
> * the {{sha512}} utility in {{src/common/command_utils.cpp}}
> * as runtime checks in the {{LogrotateContainerLogger}}
> * etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4705) Slave failed to sample container with perf event

2016-04-15 Thread Fan Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243029#comment-15243029
 ] 

Fan Du commented on MESOS-4705:
---

{quote}
Which patch? This one? https://reviews.apache.org/r/44379/

It still does not contain the information related to perf stat formats that 
haosdent provided earlier in this thread. Can you add that?
{quote}

[~haosd...@gmail.com] I think I have added the format you mention at the first 
reply of the comments {{value,unit,event,cgroup}}, and this format also matches 
what you describe in 
[MESOS-4655|https://issues.apache.org/jira/browse/MESOS-4655], right?

> Slave failed to sample container with perf event
> 
>
> Key: MESOS-4705
> URL: https://issues.apache.org/jira/browse/MESOS-4705
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation
>Affects Versions: 0.27.1
>Reporter: Fan Du
>Assignee: Fan Du
>
> When sampling container with perf event on Centos7 with kernel 
> 3.10.0-123.el7.x86_64, slave complained with below error spew:
> {code}
> E0218 16:32:00.591181  8376 perf_event.cpp:408] Failed to get perf sample: 
> Failed to parse perf sample: Failed to parse perf sample line 
> '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00':
>  Unexpected number of fields
> {code}
> it's caused by the current perf format [assumption | 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430]
>  with kernel version below 3.12 
> On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below:
> value,unit,event,cgroup,running,ratio
> A local modification fixed this error on my test bed, please review this 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243004#comment-15243004
 ] 

Alexander Rukletsov commented on MESOS-4279:


Folks,

thanks a lot for reporting, elaborating and proposing solutions. I've 
prioritized the issue; we will be looking into it ASAP.

MESOS-4909 did not aim to solve this problem, but rather generalize the way 
tasks finalize.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4279:
---
Affects Version/s: 0.26.0
   0.27.2
 Story Points: 5
   Labels: docker mesosphere  (was: )

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4771) Document the network/cni isolator.

2016-04-15 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan reassigned MESOS-4771:


Assignee: Avinash Sridharan  (was: Qian Zhang)

> Document the network/cni isolator.
> --
>
> Key: MESOS-4771
> URL: https://issues.apache.org/jira/browse/MESOS-4771
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Avinash Sridharan
>
> We need to document this isolator in mesos-containerizer.md (e.g., how to 
> configure it, what's the pre-requisite, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread David Overcash (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242638#comment-15242638
 ] 

David Overcash commented on MESOS-4279:
---

We are seeing this as well.  There are quite a few problems that come up when 
you cannot reliably expect to handle a SIGTERM or SIGINT as you would usually 
expect.  The issue you linked ( 
https://issues.apache.org/jira/browse/MESOS-4909 ) is helpful, but it is not 
the right way to solve this problem.

Is there a reason that this branch 
https://github.com/apache/mesos/compare/master...bydga:dockerfix cannot be sent 
for a PR?  It looks like a simple enough fix.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4312) Porting Mesos on Power (ppc64le)

2016-04-15 Thread Chen Zhiwei (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242629#comment-15242629
 ] 

Chen Zhiwei commented on MESOS-4312:


Thanks Samuel, we have our daily build on Power for both RHEL and Ubuntu.

> Porting Mesos on Power (ppc64le)
> 
>
> Key: MESOS-4312
> URL: https://issues.apache.org/jira/browse/MESOS-4312
> Project: Mesos
>  Issue Type: Epic
>Reporter: Qian Zhang
>Assignee: Chen Zhiwei
>
> The goal of this ticket is to make IBM Power (ppc64le) as a supported 
> hardware platform of Mesos. Currently the latest Mesos code can not be 
> successfully built on ppc64le, we will resolve the build errors in this 
> ticket, and also make sure Mesos test suite ("make check") can be ran 
> successfully on ppc64le. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242587#comment-15242587
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

I'm happy I'm not the only one noticing this. Well to be honest, I absolutely 
gave up on solving this issue. I bugreported here, I even resolved the issues 
(there were more problems actually) in my branch on github, I spoke with some 
guy (from mesos) on the IRC, I was asking on Slack - without any response. And 
noone cares. In the meantime, they do something like: 
https://issues.apache.org/jira/browse/MESOS-4909 - but with the same errors 
again... So right now we are considering to write our own {{Executor}}

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4312) Porting Mesos on Power (ppc64le)

2016-04-15 Thread Samuel Cozannet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242570#comment-15242570
 ] 

Samuel Cozannet commented on MESOS-4312:


Hi there, 

Newbie to this, but I could actually build Mesos from the latest master 
(0.29.0) on git on a IBM Power S824L running Ubuntu 14.04. I also have access 
to Xenial 16.04 for at least a couple of weeks, so I can do some testing if 
you'd like. 

Note that building is fine, but I had a LOT of troubles packaging it into a 
proper .deb. A large number of the required stuff (fpm, mesos-deb-packaging) 
are not ported on Power and you have to compile a large number of items from 
source, which in certain cases is very tricky. 
Never the less, I have a few packages for you: 
https://s3-us-west-2.amazonaws.com/samnco-static-files/packages/marathon_0.7.5-0.1.20160414202109.ubuntu1404_ppc64el.deb
https://s3-us-west-2.amazonaws.com/samnco-static-files/packages/mesos-0.29.0.20160414202109.ubuntu1404_ppc64el.deb

They're free to dl if you want to have a look. Let me know if I can help in any 
sense. 

> Porting Mesos on Power (ppc64le)
> 
>
> Key: MESOS-4312
> URL: https://issues.apache.org/jira/browse/MESOS-4312
> Project: Mesos
>  Issue Type: Epic
>Reporter: Qian Zhang
>Assignee: Chen Zhiwei
>
> The goal of this ticket is to make IBM Power (ppc64le) as a supported 
> hardware platform of Mesos. Currently the latest Mesos code can not be 
> successfully built on ppc64le, we will resolve the build errors in this 
> ticket, and also make sure Mesos test suite ("make check") can be ran 
> successfully on ppc64le. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4902) Add authentication to libprocess endpoints

2016-04-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242554#comment-15242554
 ] 

Greg Mann edited comment on MESOS-4902 at 4/15/16 7:06 AM:
---

Reviews here:
https://reviews.apache.org/r/46258/
https://reviews.apache.org/r/46259/
https://reviews.apache.org/r/46260/
https://reviews.apache.org/r/46261/
https://reviews.apache.org/r/46262/

The above reviews take care of the {{/logging/toggle}} and 
{{/metrics/snapshot}} endpoints. I'll wait to move this ticket to "Reviewable" 
until the rest of the patches are up.


was (Author: greggomann):
Reviews here:
https://reviews.apache.org/r/46258/
https://reviews.apache.org/r/46259/
https://reviews.apache.org/r/46260/
https://reviews.apache.org/r/46261/
https://reviews.apache.org/r/46262/

> Add authentication to libprocess endpoints
> --
>
> Key: MESOS-4902
> URL: https://issues.apache.org/jira/browse/MESOS-4902
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: authentication, http, mesosphere, security
> Fix For: 0.29.0
>
>
> In addition to the endpoints addressed by MESOS-4850 and MESOS-5152, the 
> following endpoints would also benefit from HTTP authentication:
> * {{/profiler/*}}
> * {{/logging/toggle}}
> * {{/metrics/snapshot}}
> * {{/system/stats.json}}
> Adding HTTP authentication to these endpoints is a bit more complicated 
> because they are defined at the libprocess level.
> While working on MESOS-4850, it became apparent that since our tests use the 
> same instance of libprocess for both master and agent, different default 
> authentication realms must be used for master/agent so that HTTP 
> authentication can be independently enabled/disabled for each.
> We should establish a mechanism for making an endpoint authenticated that 
> allows us to:
> 1) Install an endpoint like {{/files}}, whose code is shared by the master 
> and agent, with different authentication realms for the master and agent
> 2) Avoid hard-coding a default authentication realm into libprocess, to 
> permit the use of different authentication realms for the master and agent 
> and to keep application-level concerns from leaking into libprocess
> Another option would be to use a single default authentication realm and 
> always enable or disable HTTP authentication for *both* the master and agent 
> in tests. However, this wouldn't allow us to test scenarios where HTTP 
> authentication is enabled on one but disabled on the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4902) Add authentication to libprocess endpoints

2016-04-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242554#comment-15242554
 ] 

Greg Mann commented on MESOS-4902:
--

Reviews here:
https://reviews.apache.org/r/46258/
https://reviews.apache.org/r/46259/
https://reviews.apache.org/r/46260/
https://reviews.apache.org/r/46261/
https://reviews.apache.org/r/46262/

> Add authentication to libprocess endpoints
> --
>
> Key: MESOS-4902
> URL: https://issues.apache.org/jira/browse/MESOS-4902
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: authentication, http, mesosphere, security
> Fix For: 0.29.0
>
>
> In addition to the endpoints addressed by MESOS-4850 and MESOS-5152, the 
> following endpoints would also benefit from HTTP authentication:
> * {{/profiler/*}}
> * {{/logging/toggle}}
> * {{/metrics/snapshot}}
> * {{/system/stats.json}}
> Adding HTTP authentication to these endpoints is a bit more complicated 
> because they are defined at the libprocess level.
> While working on MESOS-4850, it became apparent that since our tests use the 
> same instance of libprocess for both master and agent, different default 
> authentication realms must be used for master/agent so that HTTP 
> authentication can be independently enabled/disabled for each.
> We should establish a mechanism for making an endpoint authenticated that 
> allows us to:
> 1) Install an endpoint like {{/files}}, whose code is shared by the master 
> and agent, with different authentication realms for the master and agent
> 2) Avoid hard-coding a default authentication realm into libprocess, to 
> permit the use of different authentication realms for the master and agent 
> and to keep application-level concerns from leaking into libprocess
> Another option would be to use a single default authentication realm and 
> always enable or disable HTTP authentication for *both* the master and agent 
> in tests. However, this wouldn't allow us to test scenarios where HTTP 
> authentication is enabled on one but disabled on the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4951) Enable actors to pass an authentication realm to libprocess

2016-04-15 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242551#comment-15242551
 ] 

Greg Mann commented on MESOS-4951:
--

Reviews here:
https://reviews.apache.org/r/46254/
https://reviews.apache.org/r/46255/

> Enable actors to pass an authentication realm to libprocess
> ---
>
> Key: MESOS-4951
> URL: https://issues.apache.org/jira/browse/MESOS-4951
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, slave
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: authentication, http, mesosphere, security
> Fix For: 0.29.0
>
>
> To prepare for MESOS-4902, the Mesos master and agent need a way to pass the 
> desired authentication realm to libprocess. Since some endpoints (like 
> {{/profiler/*}}) get installed in libprocess, the master/agent should be able 
> to specify during initialization what authentication realm the 
> libprocess-level endpoints will be authenticated under.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5223) MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky

2016-04-15 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5223:
--

 Summary: MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky
 Key: MESOS-5223
 URL: https://issues.apache.org/jira/browse/MESOS-5223
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Guangya Liu


{code}
I0415 06:52:22.243783 31906 cluster.cpp:149] Creating default 'local' authorizer
I0415 06:52:22.365927 31906 leveldb.cpp:174] Opened db in 121.715227ms
I0415 06:52:22.413648 31906 leveldb.cpp:181] Compacted db in 47.651756ms
I0415 06:52:22.413713 31906 leveldb.cpp:196] Created db iterator in 25647ns
I0415 06:52:22.413729 31906 leveldb.cpp:202] Seeked to beginning of db in 1890ns
I0415 06:52:22.413741 31906 leveldb.cpp:271] Iterated through 0 keys in the db 
in 317ns
I0415 06:52:22.413800 31906 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0415 06:52:22.414681 31939 recover.cpp:447] Starting replica recovery
I0415 06:52:22.414999 31939 recover.cpp:473] Replica is in EMPTY status
I0415 06:52:22.416792 31939 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (17242)@172.17.0.2:44024
I0415 06:52:22.417222 31925 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0415 06:52:22.417966 31925 recover.cpp:564] Updating replica status to STARTING
I0415 06:52:22.421860 31933 master.cpp:382] Master 
c4bfcab0-cd45-4c65-953a-f810c14806e0 (37d6f4eebe29) started on 172.17.0.2:44024
I0415 06:52:22.421900 31933 master.cpp:384] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="false" --authenticate_http="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/ImAAfx/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ImAAfx/master" --zk_session_timeout="10secs"
I0415 06:52:22.422327 31933 master.cpp:435] Master allowing unauthenticated 
frameworks to register
I0415 06:52:22.422339 31933 master.cpp:438] Master only allowing authenticated 
agents to register
I0415 06:52:22.422349 31933 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ImAAfx/credentials'
I0415 06:52:22.422750 31933 master.cpp:480] Using default 'crammd5' 
authenticator
I0415 06:52:22.422914 31933 master.cpp:551] Using default 'basic' HTTP 
authenticator
I0415 06:52:22.423054 31933 master.cpp:589] Authorization enabled
I0415 06:52:22.423259 31926 hierarchical.cpp:142] Initialized hierarchical 
allocator process
I0415 06:52:22.423327 31926 whitelist_watcher.cpp:77] No whitelist given
I0415 06:52:22.425593 31937 master.cpp:1832] The newly elected leader is 
master@172.17.0.2:44024 with id c4bfcab0-cd45-4c65-953a-f810c14806e0
I0415 06:52:22.425631 31937 master.cpp:1845] Elected as the leading master!
I0415 06:52:22.425650 31937 master.cpp:1532] Recovering from registrar
I0415 06:52:22.425915 31937 registrar.cpp:331] Recovering registrar
I0415 06:52:22.458044 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 39.766176ms
I0415 06:52:22.458093 31928 replica.cpp:320] Persisted replica status to 
STARTING
I0415 06:52:22.458391 31928 recover.cpp:473] Replica is in STARTING status
I0415 06:52:22.459728 31930 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (17245)@172.17.0.2:44024
I0415 06:52:22.459952 31928 recover.cpp:193] Received a recover response from a 
replica in STARTING status
I0415 06:52:22.460414 31925 recover.cpp:564] Updating replica status to VOTING
I0415 06:52:22.499866 31925 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 39.170393ms
I0415 06:52:22.499905 31925 replica.cpp:320] Persisted replica status to VOTING
I0415 06:52:22.500013 31927 recover.cpp:578] Successfully joined the Paxos group
I0415 06:52:22.500238 31927 recover.cpp:462] Recover process terminated
I0415 06:52:22.500746 31925 log.cpp:659] Attempting to start the writer
I0415 06:52:22.501936 31927 replica.cpp:493] Replica received implicit promise 
request from (17246)@172.17.0.2:44024 with proposal 1
I0415 06:52:22.541733 31927 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 39.750816ms
I0415 06:52:22.541791 31927 replica.cpp:342] Persisted promised