[jira] [Commented] (MESOS-6848) The default executor does not exit if a single task pod fails.

2017-01-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816312#comment-15816312
 ] 

Anand Mazumdar commented on MESOS-6848:
---

aha, Thanks for the tip.

> The default executor does not exit if a single task pod fails.
> --
>
> Key: MESOS-6848
> URL: https://issues.apache.org/jira/browse/MESOS-6848
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
> Fix For: 1.2.0
>
>
> If a task group has a single task and it exits with a non-zero exit code, the 
> default executor does not commit suicide.
> This mostly happens due to the fact that we invoke {{shutdown()}} in 
> {{waited()}} when we notice the termination of a single container here: 
> https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666
> but then we return early here after executing all the kill calls: 
> https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751
> However, when there is just one task in the task group, this won't result in 
> {{__shutdown}} being called ever leading to the executor committing suicide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6848) The default executor does not exit if a single task pod fails.

2017-01-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816257#comment-15816257
 ] 

Anand Mazumdar commented on MESOS-6848:
---

Keeping the issue open to backport to 1.1.x branch.

{noformat}
commit 3efcd33f440c7e56c137bfb7cd953ee35e4b3aa5
Author: Anand Mazumdar 
Date:   Tue Jan 10 13:08:03 2017 -0800

Fixed a bug in the default executor around not committing suicide.

This bug is only observed when the task group contains a single task.
The default executor was not committing suicide when this single task
used to exit with a non-zero status code as per the default restart
policy.

Review: https://reviews.apache.org/r/55157/
{noformat}

> The default executor does not exit if a single task pod fails.
> --
>
> Key: MESOS-6848
> URL: https://issues.apache.org/jira/browse/MESOS-6848
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>
> If a task group has a single task and it exits with a non-zero exit code, the 
> default executor does not commit suicide.
> This mostly happens due to the fact that we invoke {{shutdown()}} in 
> {{waited()}} when we notice the termination of a single container here: 
> https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666
> but then we return early here after executing all the kill calls: 
> https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751
> However, when there is just one task in the task group, this won't result in 
> {{__shutdown}} being called ever leading to the executor committing suicide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2017-01-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6082:
--
Shepherd: Anand Mazumdar

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>Priority: Critical
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream

2017-01-07 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-3601:
--
Shepherd: Vinod Kone

Link to proposal:  http://bit.ly/2iovQVe

> Formalize all headers and metadata for HTTP API Event Stream
> 
>
> Key: MESOS-3601
> URL: https://issues.apache.org/jira/browse/MESOS-3601
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
> Environment: Mesos 0.24.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: api, http, mesosphere, wireprotocol
>
> From an HTTP standpoint the current set of headers returned when connecting 
> to the HTTP scheduler API are insufficient. 
> {code:title=current headers}
> HTTP/1.1 200 OK
> Transfer-Encoding: chunked
> Date: Wed, 30 Sep 2015 21:07:16 GMT
> Content-Type: application/json
> {code}
> Since the response from mesos is intended to function as a stream 
> {{Connection: keep-alive}} should be specified so that the connection can 
> remain open.
> If RecordIO is going to be applied to the messages, the headers should 
> include the information necessary for a client to be able to detect RecordIO 
> and setup it response handlers appropriately.
> How RecordIO is expressed will come down to the semantics of what is actually 
> "Returned" as the response from {{POST /api/v1/scheduler}}.
> h4. Proposal
> One approach would be to leverage http as much as possible, having a client 
> specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate 
> that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} 
> messages.  (This approach allows for things like gzip to be woven in fairly 
> easily in the future)
> For this approach I would expect the following:
> {code:title=Request}
> POST /api/v1/scheduler HTTP/1.1
> Host: localhost:5050
> Accept: application/x-protobuf
> Accept-Encoding: recordio
> Content-Type: application/x-protobuf
> Content-Length: 35
> User-Agent: RxNetty Client
> {code}
> {code:title=Response}
> HTTP/1.1 200 OK
> Connection: keep-alive
> Transfer-Encoding: chunked
> Content-Type: application/x-protobuf
> Content-Encoding: recordio
> Cache-Control: no-transform
> {code}
> When Content-Encoding is used it is recommended to set {{Cache-Control: 
> no-transform}} to signal to any proxies that no transformation should be 
> applied to the the content encoding [Section 14.11 RFC 
> 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6884) Add a test to verify that scheduler can launch a TTY container

2017-01-06 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806520#comment-15806520
 ] 

Anand Mazumdar commented on MESOS-6884:
---

I don't think so. A related test {{AttachContainerInput}} launches a TTY 
container as a nested sub-container. But, we don't yet have a test that tries 
to launch a root level TTY container using the Scheduler API directly. (e.g., 
launch vim etc.)

> Add a test to verify that scheduler can launch a TTY container
> --
>
> Key: MESOS-6884
> URL: https://issues.apache.org/jira/browse/MESOS-6884
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Anand Mazumdar
>
> [~anandmazumdar] Is this already done?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6864) Container Exec should be possible with tasks belonging to a task group

2017-01-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6864:
--
Target Version/s: 1.2.0
Priority: Blocker  (was: Major)

> Container Exec should be possible with tasks belonging to a task group
> --
>
> Key: MESOS-6864
> URL: https://issues.apache.org/jira/browse/MESOS-6864
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: debugging, mesosphere
>
> {{LaunchNestedContainerSession}} currently requires the parent container to 
> be an Executor 
> (https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189-L2211).
> This works for command tasks, because the task container id is the same as 
> the executor container id.
> But it won't work for pod tasks whose container id is different from 
> executor’s container id.
> In order to resolve this ticket, we need to allow launching a child container 
> at an arbitrary level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6865) Remove the constraint of being only able to launch 2 level nested containers on Agent API

2017-01-05 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6865:
-

 Summary: Remove the constraint of being only able to launch 2 
level nested containers on Agent API
 Key: MESOS-6865
 URL: https://issues.apache.org/jira/browse/MESOS-6865
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar
Priority: Blocker


Currently, the Agent API has a constraint that it _only_ allows two level of 
nesting. This was done at that time since the containerizer was still being 
worked on to support arbitrary level of nesting. Now, that the work has been 
completed we should remove the constraint on the API handlers on the agent. 

Note that this constraint also impacts the Debugging API i.e., a user currently 
can't attach to a task (child container) of a task group since we explicitly 
check that the top level container belongs to an executor on the API handler.
https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6859) Document HA behavior during mesos master replacement

2017-01-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6859:
--
Labels: newbie  (was: )

> Document HA behavior during mesos master replacement
> 
>
> Key: MESOS-6859
> URL: https://issues.apache.org/jira/browse/MESOS-6859
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, master
>Reporter: Charles Allen
>  Labels: newbie
>
> In a discussion in https://mesos.slack.com/archives/general/p1483637159001494 
> the question was brought up when a "new" master is really fully ready.
> Specifically, in the case where new masters can spin up faster than masters 
> can sync their logs, it is unclear from the HA docs at 
> http://mesos.apache.org/documentation/latest/high-availability/ how to ensure 
> a freshly spawned master is ready to take over leadership.
> There is documentation at 
> http://mesos.apache.org/documentation/latest/monitoring/ about using 
> {{registrar/log/recovered}} to gather this kind of information, but such 
> information is very easy to overlook.
> This ask is that the HA docs be amended to include more information about how 
> to use {{registrar/log/recovered}} properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6848) The default executor does not exit if a single task pod fails.

2017-01-03 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6848:
-

 Summary: The default executor does not exit if a single task pod 
fails.
 Key: MESOS-6848
 URL: https://issues.apache.org/jira/browse/MESOS-6848
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
Priority: Blocker


If a task group has a single task and it exits with a non-zero exit code, the 
default executor does not commit suicide.

This mostly happens due to the fact that we invoke {{shutdown()}} in 
{{waited()}} when we notice the termination of a single container here: 
https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666

but then we return early here after executing all the kill calls: 
https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751

However, when there is just one task in the task group, this won't result in 
{{__shutdown}} being called ever leading to the executor committing suicide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2017-01-03 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6177:
--
Target Version/s: 1.2.0

> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2017-01-03 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6784:
--
Priority: Major  (was: Blocker)

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6597) Include v1 Operator API protos in generated JAR and python packages.

2016-12-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783388#comment-15783388
 ] 

Anand Mazumdar commented on MESOS-6597:
---

Backported to 1.1.x
{noformat}
commit 2342bc3357ada490701cb1f40c5a27c30bc0ba1b
Author: Anand Mazumdar 
Date:   Wed Dec 28 10:00:05 2016 -0800

Added MESOS-6597 to CHANGELOG for 1.1.1.

commit 010c62dc81e3a8861f363db517cc064d190af85e
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:58:05 2016 -0800

Enabled python proto generation for v1 Master/Agent API.

The correspondng master/agent protos are now included in the
generated Mesos pypi package.

Review: https://reviews.apache.org/r/54015/

commit 63f31e0cf17e62ff5479d421112a8c80efa954c1
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:58:00 2016 -0800

Enabled java protos generation for v1 Master/Agent API.

The corresponding master/agent protos are now included in the
generated Mesos JAR.

Review: https://reviews.apache.org/r/53825/

commit 1e24a39925474a1a9a3113909f4f2496448f5469
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:57:55 2016 -0800

Fixed missing protobuf java package/classname definition.

Review: https://reviews.apache.org/r/54014/
{noformat}

> Include v1 Operator API protos in generated JAR and python packages.
> 
>
> Key: MESOS-6597
> URL: https://issues.apache.org/jira/browse/MESOS-6597
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Vijay Srinivasaraghavan
>Assignee: Vijay Srinivasaraghavan
>Priority: Blocker
> Fix For: 1.1.1, 1.2.0
>
>
> For V1 API support, the build file that generates Java protos wrapper as of 
> now includes only executor and scheduler. 
> (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) 
> To support operator HTTP API, we also need to generate java protos for 
> additional proto definitions like quota, maintenance etc., These java 
> definition files will be used by a standard Rest client when using the 
> straight HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6597) Include v1 Operator API protos in generated JAR and python packages.

2016-12-28 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6597:
--
Summary: Include v1 Operator API protos in generated JAR and python 
packages.  (was: Include missing Mesos Java classes for Protobuf files to 
support Operator HTTP V1 API)

> Include v1 Operator API protos in generated JAR and python packages.
> 
>
> Key: MESOS-6597
> URL: https://issues.apache.org/jira/browse/MESOS-6597
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Vijay Srinivasaraghavan
>Assignee: Vijay Srinivasaraghavan
>Priority: Blocker
>
> For V1 API support, the build file that generates Java protos wrapper as of 
> now includes only executor and scheduler. 
> (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) 
> To support operator HTTP API, we also need to generate java protos for 
> additional proto definitions like quota, maintenance etc., These java 
> definition files will be used by a standard Rest client when using the 
> straight HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6825) Increase default allocation_interval for tests

2016-12-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767510#comment-15767510
 ] 

Anand Mazumdar edited comment on MESOS-6825 at 12/21/16 4:54 PM:
-

> This means that if the host running the tests is slow, a test case might 
> receive a resource offer that it doesn't receive when running on a faster 
> host.

Did you mean that if the "task" in question in the test reaches a terminal 
state? This shouldn't happen otherwise if the "used" resources on the agent did 
not change (task is active)


was (Author: anandmazumdar):
> This means that if the host running the tests is slow, a test case might 
> receive a resource offer that it doesn't receive when running on a faster 
> host.

Did you mean that if the "task" in question in the test finishes? This 
shouldn't happen otherwise if the "used" resources on the agent did not change 
(task is active)

> Increase default allocation_interval for tests
> --
>
> Key: MESOS-6825
> URL: https://issues.apache.org/jira/browse/MESOS-6825
> Project: Mesos
>  Issue Type: Improvement
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The default {{allocation_interval}} is 1 second. This means that if the host 
> running the tests is slow, a test case might receive a resource offer that it 
> doesn't receive when running on a faster host. We could workaround this by 
> explicitly using {{WillRepeatedly(Return())}}, but that is a bit kludgy and 
> obscures the intent of the test.
> One way to avoid this would be to pause the clock by default in tests 
> (MESOS-4101). That would be a quite involved change, however.
> Instead, we could consider raising the default {{allocation_interval}} to a 
> large value, such as 1 minute or longer. This would significantly reduce the 
> chance of host performance causing test flakiness. Moreover, it would help to 
> highlight tests that rely on real-time batch allocations for correctness: 
> generally tests should avoid doing this (because it causes test slowness). 
> They should instead pause the clock and then advance it by 
> {{allocation_interval}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6825) Increase default allocation_interval for tests

2016-12-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767510#comment-15767510
 ] 

Anand Mazumdar commented on MESOS-6825:
---

> This means that if the host running the tests is slow, a test case might 
> receive a resource offer that it doesn't receive when running on a faster 
> host.

Did you mean that if the "task" in question in the test finishes? This 
shouldn't happen otherwise if the "used" resources on the agent did not change 
(task is active)

> Increase default allocation_interval for tests
> --
>
> Key: MESOS-6825
> URL: https://issues.apache.org/jira/browse/MESOS-6825
> Project: Mesos
>  Issue Type: Improvement
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The default {{allocation_interval}} is 1 second. This means that if the host 
> running the tests is slow, a test case might receive a resource offer that it 
> doesn't receive when running on a faster host. We could workaround this by 
> explicitly using {{WillRepeatedly(Return())}}, but that is a bit kludgy and 
> obscures the intent of the test.
> One way to avoid this would be to pause the clock by default in tests 
> (MESOS-4101). That would be a quite involved change, however.
> Instead, we could consider raising the default {{allocation_interval}} to a 
> large value, such as 1 minute or longer. This would significantly reduce the 
> chance of host performance causing test flakiness. Moreover, it would help to 
> highlight tests that rely on real-time batch allocations for correctness: 
> generally tests should avoid doing this (because it causes test slowness). 
> They should instead pause the clock and then advance it by 
> {{allocation_interval}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky

2016-12-20 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6823:
--
Labels: flaky flaky-test newbie  (was: flaky flaky-test)

> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 
> is flaky
> --
>
> Key: MESOS-6823
> URL: https://issues.apache.org/jira/browse/MESOS-6823
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 12/14 both with/without SSL
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test, newbie
>
> This showed up on our internal CI
> {code}
> [23:13:01] :   [Step 11/11] [ RUN  ] 
> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0
> [23:13:01] :   [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] 
> Creating default 'local' authorizer
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] 
> Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) 
> started on 172.16.10.213:45407
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" 
> --zk_session_timeout="10secs"
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] 
> Master only allowing authenticated agents to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/ev3icd/credentials'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using 
> default 'crammd5' authenticator
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] 
> Authorization enabled
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654551 25733 
> whitelist_watcher.cpp:77] No whitelist given
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] 
> Initialized hierarchical allocator process
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] 
> Elected as the leading master!
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] 
> Recovering from registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] 
> Recovering registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] 
> Successfully fetched the registry (0B) in 210944ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] 
> Applied 1 operations in 5006ns; attempting to update the registry
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] 
> Successfully updated the registry in 194048ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] 
> Successfully recovered registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] 
> 

[jira] [Commented] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky

2016-12-20 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764797#comment-15764797
 ] 

Anand Mazumdar commented on MESOS-6823:
---

cc: [~kaysoky] [~sivaramsk]

> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 
> is flaky
> --
>
> Key: MESOS-6823
> URL: https://issues.apache.org/jira/browse/MESOS-6823
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 12/14 both with/without SSL
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test
>
> This showed up on our internal CI
> {code}
> [23:13:01] :   [Step 11/11] [ RUN  ] 
> bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0
> [23:13:01] :   [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] 
> Creating default 'local' authorizer
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] 
> Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) 
> started on 172.16.10.213:45407
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" 
> --zk_session_timeout="10secs"
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] 
> Master only allowing authenticated agents to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/ev3icd/credentials'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using 
> default 'crammd5' authenticator
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] 
> Authorization enabled
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654551 25733 
> whitelist_watcher.cpp:77] No whitelist given
> [23:13:01] :   [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] 
> Initialized hierarchical allocator process
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] 
> Elected as the leading master!
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] 
> Recovering from registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] 
> Recovering registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] 
> Successfully fetched the registry (0B) in 210944ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] 
> Applied 1 operations in 5006ns; attempting to update the registry
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] 
> Successfully updated the registry in 194048ns
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] 
> Successfully recovered registrar
> [23:13:01] :   [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] 
> 

[jira] [Created] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky

2016-12-20 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6823:
-

 Summary: 
bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 
is flaky
 Key: MESOS-6823
 URL: https://issues.apache.org/jira/browse/MESOS-6823
 Project: Mesos
  Issue Type: Bug
 Environment: Ubuntu 12/14 both with/without SSL
Reporter: Anand Mazumdar


This showed up on our internal CI
{code}
[23:13:01] : [Step 11/11] [ RUN  ] 
bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0
[23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] 
Creating default 'local' authorizer
[23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] 
Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) 
started on 172.16.10.213:45407
[23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags 
at startup: --acls="" --agent_ping_timeout="15secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" 
--zk_session_timeout="10secs"
[23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] 
Master only allowing authenticated frameworks to register
[23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] 
Master only allowing authenticated agents to register
[23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] 
Master only allowing authenticated HTTP frameworks to register
[23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] 
Loading credentials for authentication from 
'/mnt/teamcity/temp/buildTmp/ev3icd/credentials'
[23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using 
default 'crammd5' authenticator
[23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
[23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
[23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
[23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] 
Authorization enabled
[23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 
whitelist_watcher.cpp:77] No whitelist given
[23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] 
Initialized hierarchical allocator process
[23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] 
Elected as the leading master!
[23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] 
Recovering from registrar
[23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] 
Recovering registrar
[23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] 
Successfully fetched the registry (0B) in 210944ns
[23:13:01] : [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] 
Applied 1 operations in 5006ns; attempting to update the registry
[23:13:01] : [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] 
Successfully updated the registry in 194048ns
[23:13:01] : [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] 
Successfully recovered registrar
[23:13:01] : [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] 
Recovered 0 agents from the registry (174B); allowing 10mins for agents to 
re-register
[23:13:01] : [Step 11/11] I1219 23:13:01.655840 25728 hierarchical.cpp:176] 
Skipping recovery of hierarchical allocator: nothing to recover
[23:13:01] : [Step 11/11] I1219 23:13:01.656813 25712 
containerizer.cpp:220] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
[23:13:01] : [Step 

[jira] [Commented] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2016-12-16 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755184#comment-15755184
 ] 

Anand Mazumdar commented on MESOS-6784:
---

Committed a fix for the second log snippet that Jie posted around the test bug. 

{noformat}
commit 28eaa8df7c95130b0c244f7613ad506be899cafd
Author: Anand Mazumdar 
Date:   Wed Dec 14 17:40:47 2016 -0800

Fixed the 'IOSwitchboardTest.KillSwitchboardContainerDestroyed' test.

The container was launched with TTY enabled. This meant that
killing the switchboard would trigger the task to terminate
on its own owing to the "master" end of the TTY dying. This
would make it not go through the code path of the isolator
failing due to resource limit issue.

Review: https://reviews.apache.org/r/54770
{noformat}

The original log in the issue description is a separate issue in the 
switchboard code itself and I am working on that. This should make the CI green 
for now.

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor

2016-12-15 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752229#comment-15752229
 ] 

Anand Mazumdar commented on MESOS-6801:
---

{{process::loop}} would use the {{pid}} passed to it as the async execution 
context i.e., it would implicitly do a {{defer}} to the actor.
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/loop.hpp#L74

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5795) Add support for Nvidia GPUs in the docker containerizer

2016-12-15 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751984#comment-15751984
 ] 

Anand Mazumdar commented on MESOS-5795:
---

^^ [~klueska] , 

> Add support for Nvidia GPUs in the docker containerizer
> ---
>
> Key: MESOS-5795
> URL: https://issues.apache.org/jira/browse/MESOS-5795
> Project: Mesos
>  Issue Type: Epic
>  Components: docker, isolation
>Reporter: Kevin Klues
>  Labels: gpu, mesosphere
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container. This tracks the support in the docker 
> containerizer. The mesos containerizer support has already been completed in 
> MESOS-5401.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky

2016-12-15 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751954#comment-15751954
 ] 

Anand Mazumdar commented on MESOS-6799:
---

[~greggomann] Can you take a look as to why this started failing when SSL 
enabled?

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky

2016-12-15 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6799:
--
Priority: Critical  (was: Major)

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor

2016-12-15 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751943#comment-15751943
 ] 

Anand Mazumdar commented on MESOS-6801:
---

In this case, the {{loop}} abstraction ensures that we are delegating to the 
correct actor.

Can we modify the existing script we have that checks such errors and make it 
aware of how {{process::loop}} works. I am under the impression that even if we 
capture {{this}} in the lambda it would still complain due to a missing 
{{defer}}?

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor

2016-12-15 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6801:
--
Labels: newbie  (was: )

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6788) Avoid stack overflow when handling streaming responses in API handlers

2016-12-14 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6788:
--
Shepherd:   (was: Benjamin Hindman)
Assignee: Benjamin Hindman

> Avoid stack overflow when handling streaming responses in API handlers
> --
>
> Key: MESOS-6788
> URL: https://issues.apache.org/jira/browse/MESOS-6788
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Benjamin Hindman
> Fix For: 1.2.0
>
>
> Right now both the `connect()` helper in src/slave/http.cpp and `transform()` 
> helper in src/common/recordio.hpp use recursion to read data from one pipe 
> and write to another. 
> The way these helpers are written could cause stack overflow. Ideally we 
> should be able to leverage the new `process::loop` abstraction for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2016-12-13 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6082:
--
Target Version/s: 1.2.0

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6781) Mesos containerizer overrides environment variables passed to the executor incorrectly.

2016-12-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6781:
--
Target Version/s: 1.2.0

> Mesos containerizer overrides environment variables passed to the executor 
> incorrectly.
> ---
>
> Key: MESOS-6781
> URL: https://issues.apache.org/jira/browse/MESOS-6781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Assignee: Jie Yu
>Priority: Blocker
>  Labels: mesosphere
>
> Currently, the mesos containerizer appends the default environment variables 
> of the executor and appends them _after_ any environment that were 
> over-ridden by an isolator. This is problematic e.g., if the CNI isolator 
> over-rides {{LIBPROCESS_IP}} in an overlay network. The containerizer would 
> give the executor environment variables more preference meaning that the 
> container would end up inheriting the IP address of the agent!
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L1412-L1421



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6781) Mesos containerizer overrides environment variables passed to the executor incorrectly.

2016-12-12 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6781:
-

 Summary: Mesos containerizer overrides environment variables 
passed to the executor incorrectly.
 Key: MESOS-6781
 URL: https://issues.apache.org/jira/browse/MESOS-6781
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Anand Mazumdar
Assignee: Jie Yu
Priority: Blocker


Currently, the mesos containerizer appends the default environment variables of 
the executor and appends them _after_ any environment that were over-ridden by 
an isolator. This is problematic e.g., if the CNI isolator over-rides 
{{LIBPROCESS_IP}} in an overlay network. The containerizer would give the 
executor environment variables more preference meaning that the container would 
end up inheriting the IP address of the agent!

https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L1412-L1421



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6780) ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably

2016-12-12 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742681#comment-15742681
 ] 

Anand Mazumdar commented on MESOS-6780:
---

cc: [~vinodkone], This might be related to the recent changes to the test for 
supporting TTY.

> ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably
> ---
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
>
> The test {{ContentType/AgentAPIStreamTest.AttachContainerInput}} (both {{/0}} 
> and {{/1}}) fail consistently for me in an SSL-enabled, optimized build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I1212 17:11:12.424154 3971208128 cluster.cpp:446] 

[jira] [Updated] (MESOS-6769) The server does not close it's end of the connection after returning a response to a streaming request.

2016-12-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6769:
--
Description: 
Consider this scenario, 
- The client starts to send a streaming request to the agent with the 
{{Connection: close}} header set. This means that the client is relying on the 
server to close it's end of the connection after sending the response.
- If the request failed on the server i.e., some validation errors. The server 
sends the response but does not close it's end of the socket.
- Some client libraries e.g., Python Requests rely on the server to close its 
end of the socket after sending the response. Otherwise, the connection just 
hangs on the client when it has no more streaming data to send in such cases.

Libprocess should close its end of the connection after sending the response in 
such cases.

  was:
Consider this scenario, 
- The client starts to send a streaming request to the agent with the 
{{Connection: close}} header set. This means that the client is relying on the 
server to close it's end of the connection after sending the response.
- If the request failed on the server i.e., some validation errors. The server 
sends the response but does not close it's end of the socket.
- Some client libraries e.g., Python Requests rely on the server to close its 
end of the socket after sending the response. Otherwise, the connection just 
hangs on the client when it has no more streaming data to send in such cases.

Libprocess should close its end of the 


> The server does not close it's end of the connection after returning a 
> response to a streaming request.
> ---
>
> Key: MESOS-6769
> URL: https://issues.apache.org/jira/browse/MESOS-6769
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>  Labels: libprocess, mesosphere
>
> Consider this scenario, 
> - The client starts to send a streaming request to the agent with the 
> {{Connection: close}} header set. This means that the client is relying on 
> the server to close it's end of the connection after sending the response.
> - If the request failed on the server i.e., some validation errors. The 
> server sends the response but does not close it's end of the socket.
> - Some client libraries e.g., Python Requests rely on the server to close its 
> end of the socket after sending the response. Otherwise, the connection just 
> hangs on the client when it has no more streaming data to send in such cases.
> Libprocess should close its end of the connection after sending the response 
> in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6769) The server does not close it's end of the connection after returning a response to a streaming request.

2016-12-09 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6769:
-

 Summary: The server does not close it's end of the connection 
after returning a response to a streaming request.
 Key: MESOS-6769
 URL: https://issues.apache.org/jira/browse/MESOS-6769
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar


Consider this scenario, 
- The client starts to send a streaming request to the agent with the 
{{Connection: close}} header set. This means that the client is relying on the 
server to close it's end of the connection after sending the response.
- If the request failed on the server i.e., some validation errors. The server 
sends the response but does not close it's end of the socket.
- Some client libraries e.g., Python Requests rely on the server to close its 
end of the socket after sending the response. Otherwise, the connection just 
hangs on the client when it has no more streaming data to send in such cases.

Libprocess should close its end of the 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6760) Make the scheduler heartbeat interval configurable

2016-12-08 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6760:
-

 Summary: Make the scheduler heartbeat interval configurable
 Key: MESOS-6760
 URL: https://issues.apache.org/jira/browse/MESOS-6760
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, the heartbeats sent by the master to the scheduler are hard-coded to 
a constant default to 15 seconds. We should think about configuring this value 
either as a master flag or make the scheduler pick an appropriate value via the 
{{Subscribe}} call. 

This might be useful for some clusters where the default value might be too 
frequent or if they want a even smaller value (rarer case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6623) Re-enable tests impacted by request streaming support

2016-12-08 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6623:
--
Sprint:   (was: Mesosphere Sprint 47)

> Re-enable tests impacted by request streaming support
> -
>
> Key: MESOS-6623
> URL: https://issues.apache.org/jira/browse/MESOS-6623
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> We added support for HTTP request streaming in libprocess as part of 
> MESOS-6466. However, this broke a few tests that relied on HTTP request 
> filtering since the handlers no longer have access to the body of the request 
> when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
> filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6752) Add a `post()` overload to libprocess for streaming requests

2016-12-07 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6752:
-

 Summary: Add a `post()` overload to libprocess for streaming 
requests
 Key: MESOS-6752
 URL: https://issues.apache.org/jira/browse/MESOS-6752
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, libprocess
Reporter: Anand Mazumdar


Currently, the {{post}}/{{streaming::post}} overloads in [libprocess | 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp]
 don't work for streaming requests. The {{streaming::post}} overload works only 
for streaming responses. We should add another overload to handle streaming 
requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6746) IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT

2016-12-07 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6746:
--
Shepherd: Vinod Kone
  Sprint: Mesosphere Sprint 47

> IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT
> 
>
> Key: MESOS-6746
> URL: https://issues.apache.org/jira/browse/MESOS-6746
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
> Fix For: 1.2.0
>
>
> Currently we are doing a close on the write end of all connection pipes when 
> we exit the switchboard, but we don't wait until the read is flushed before 
> exiting. This can cause some data to get dropped since the process may exit 
> before the reader is flushed.  The current code is:
> {noformat}
> void IOSwitchboardServerProcess::finalize()   
> { 
>   foreach (HttpConnection& connection, outputConnections) {   
> connection.close();  
>   }   
>   
>   if (failure.isSome()) {
> promise.fail(failure->message);   
>   } else {
> promise.set(Nothing());   
>   }   
> } 
> {noformat}
> We should change it to:
> {noformat}
> void IOSwitchboardServerProcess::finalize()   
> { 
>   foreach (HttpConnection& connection, outputConnections) {   
> connection.close();
> connection.closed().await();  
>   }   
>   
>   if (failure.isSome()) {
> promise.fail(failure->message);   
>   } else {
> promise.set(Nothing());   
>   }   
> } 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-07 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729231#comment-15729231
 ] 

Anand Mazumdar commented on MESOS-6744:
---

>From the logs, this looks a separate issue than what we fixed in MESOS-6576 
>(around status update reordering).

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
> I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
> I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar
> I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer
> I1208 03:26:47.769899 28653 replica.cpp:493] Replica 

[jira] [Updated] (MESOS-6646) StreamingRequestDecoder incompletely initializes its http_parser_settings

2016-12-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6646:
--
Shepherd: Anand Mazumdar

> StreamingRequestDecoder incompletely initializes its http_parser_settings
> -
>
> Key: MESOS-6646
> URL: https://issues.apache.org/jira/browse/MESOS-6646
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: coverity
> Fix For: 1.2.0
>
>
> Coverity reports in CID1394703 at {{3rdparty/libprocess/src/decoder.hpp:767}}:
> {code}
> CID 1394703 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
> 2. uninit_member: Non-static class member field settings.on_status is not 
> initialized in this constructor nor in any functions that it calls.
> {code}
> It seems like {{StreamingRequestDecoder}} should properly initialize its 
> member {{settings}}, e.g., with {{http_parser_settings_init}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API

2016-12-05 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723128#comment-15723128
 ] 

Anand Mazumdar commented on MESOS-6597:
---

{noformat}
commit 5abda76d697dcc21e64f9037b03c3a15fc434286
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:58:05 2016 -0800

Enabled python proto generation for v1 Master/Agent API.

The correspondng master/agent protos are now included in the
generated Mesos pypi package.

Review: https://reviews.apache.org/r/54015/

commit e1ae5cf8030821e1527466e84a0dfe1864406926
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:58:00 2016 -0800

Enabled java protos generation for v1 Master/Agent API.

The corresponding master/agent protos are now included in the
generated Mesos JAR.

Review: https://reviews.apache.org/r/53825/

commit 2786ef6e1b7c91ca68ef4c584d8b4316fe2d6a58
Author: Vijay Srinivasaraghavan 
Date:   Mon Dec 5 08:57:55 2016 -0800

Fixed missing protobuf java package/classname definition.

Review: https://reviews.apache.org/r/54014/
{noformat}

> Include missing Mesos Java classes for Protobuf files to support Operator 
> HTTP V1 API
> -
>
> Key: MESOS-6597
> URL: https://issues.apache.org/jira/browse/MESOS-6597
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Vijay Srinivasaraghavan
>Assignee: Vijay Srinivasaraghavan
>Priority: Blocker
>
> For V1 API support, the build file that generates Java protos wrapper as of 
> now includes only executor and scheduler. 
> (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) 
> To support operator HTTP API, we also need to generate java protos for 
> additional proto definitions like quota, maintenance etc., These java 
> definition files will be used by a standard Rest client when using the 
> straight HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API

2016-12-05 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723131#comment-15723131
 ] 

Anand Mazumdar commented on MESOS-6597:
---

Keeping the issue open while I do the back-port to 1.1.x.

> Include missing Mesos Java classes for Protobuf files to support Operator 
> HTTP V1 API
> -
>
> Key: MESOS-6597
> URL: https://issues.apache.org/jira/browse/MESOS-6597
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Vijay Srinivasaraghavan
>Assignee: Vijay Srinivasaraghavan
>Priority: Blocker
>
> For V1 API support, the build file that generates Java protos wrapper as of 
> now includes only executor and scheduler. 
> (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) 
> To support operator HTTP API, we also need to generate java protos for 
> additional proto definitions like quota, maintenance etc., These java 
> definition files will be used by a standard Rest client when using the 
> straight HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6467) Build a Container I/O Switchboard

2016-12-02 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715951#comment-15715951
 ] 

Anand Mazumdar commented on MESOS-6467:
---

{noformat}
commit 2a73d956af1cb0615d4e66de126ab554fdabb0b5
Author: Kevin Klues 
Date:   Fri Dec 2 10:14:45 2016 -0800

Updated the IOSwitchboard http handler to work with streaming requests.

Review: https://reviews.apache.org/r/54296/
{noformat}

> Build a Container I/O Switchboard
> -
>
> Key: MESOS-6467
> URL: https://issues.apache.org/jira/browse/MESOS-6467
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: debugging, mesosphere
> Fix For: 1.2.0
>
>
> In order to facilitate attach operations for a running container, we plan to 
> introduce a new component into Mesos known as an “I/O switchboard”. The goal 
> of this switchboard is to allow external components to *dynamically* 
> interpose on the {{stdin}}, {{stdout}} and {{stderr}} of the init process of 
> a running Mesos container. It will be implemented as a per-container, 
> stand-alone process launched by the mesos containerizer at the time a 
> container is first launched.
> Each per-container switchboard will be responsible for the following:
>  * Accepting a single dynamic request to register an fd for streaming data to 
> the {{stdin}} of a container’s init process.
>  * Accepting *multiple* dynamic requests to register fds for streaming data 
> from the {{stdout}} and {{stderr}} of a container’s init process to those fds.
>  * Allocating a pty for the new process (if requested), and directing data 
> through the master fd of the pty as necessary.
>  * Passing the *actual* set of file descriptors that should be dup’d onto the 
> {{stdin}}, {{stdout}} and {{stderr}} of a container’s init process back to 
> the containerizer. 
> The idea being that the switchboard will maintain three asynchronous loops 
> (one each for {{stdin}}, {{stdout}} and {{stderr}}) that constantly pipe data 
> to/from a container’s init process to/from all of the file descriptors that 
> have been dynamically registered with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6662) Some HTTP scheduler calls are missing from the docs

2016-12-01 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6662:
--
Labels: apidocs documentation http newbie scheduler  (was: apidocs 
documentation http scheduler)

> Some HTTP scheduler calls are missing from the docs
> ---
>
> Key: MESOS-6662
> URL: https://issues.apache.org/jira/browse/MESOS-6662
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Reporter: Greg Mann
>  Labels: apidocs, documentation, http, newbie, scheduler
>
> Some of the calls available to HTTP schedulers are missing from the HTTP 
> scheduler API documentation. We should make sure that all of the calls 
> available in the {{Master::Http::scheduler}} handler are in the documentation 
> [here|https://github.com/apache/mesos/blob/master/docs/scheduler-http-api.md].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6655) Corrupted HTTP log output

2016-11-30 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6655:
--
Target Version/s: 1.2.0
Priority: Blocker  (was: Major)

> Corrupted HTTP log output
> -
>
> Key: MESOS-6655
> URL: https://issues.apache.org/jira/browse/MESOS-6655
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Blocker
>  Labels: mesosphere
>
> The master log when running {{make check}} contains lines like:
> {noformat}
> I1130 12:28:56.747364 11811 http.cpp:391] HTTP GET for /master/state from 
> 10.0.49.2:48494n
> {noformat}
> Where {{BD}} is a binary sequence. This can be repro'd by running 
> {{MasterTest.*}} with revision 0d0805e914bc50717237e1246097af1d3b7ba92a; it 
> was presumably introduced by a prior revision, but I didn't triage it exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6472) Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos

2016-11-30 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-6472:
-

Assignee: Anand Mazumdar  (was: Vinod Kone)

> Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos
> 
>
> Key: MESOS-6472
> URL: https://issues.apache.org/jira/browse/MESOS-6472
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
>
> Coupled with the ATTACH_CONTAINER_OUTPUT call, this call will attach a remote 
> client to the the input/output of the entrypoint of a container. All 
> input/output data will be packed into I/O messages and interleaved with 
> control messages sent between a client and the agent. A single chunked 
> request will be used to stream messages to the agent over the input stream, 
> and a single chunked response will be used to stream messages to the  client 
> over the output stream.
> This call will integrate with the I/O switchboard to stream data between the 
> container and the HTTP stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6472) Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos

2016-11-30 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6472:
--
Shepherd: Vinod Kone

> Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos
> 
>
> Key: MESOS-6472
> URL: https://issues.apache.org/jira/browse/MESOS-6472
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
>
> Coupled with the ATTACH_CONTAINER_OUTPUT call, this call will attach a remote 
> client to the the input/output of the entrypoint of a container. All 
> input/output data will be packed into I/O messages and interleaved with 
> control messages sent between a client and the agent. A single chunked 
> request will be used to stream messages to the agent over the input stream, 
> and a single chunked response will be used to stream messages to the  client 
> over the output stream.
> This call will integrate with the I/O switchboard to stream data between the 
> container and the HTTP stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5900) Support Unix domain socket connections in libprocess

2016-11-26 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15698185#comment-15698185
 ] 

Anand Mazumdar commented on MESOS-5900:
---

This is in progress and being worked upon as part of the epic MESOS-6460. The 
reviews should be out shortly.

> Support Unix domain socket connections in libprocess
> 
>
> Key: MESOS-5900
> URL: https://issues.apache.org/jira/browse/MESOS-5900
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Neil Conway
>Assignee: Benjamin Hindman
>  Labels: mesosphere
>
> We should consider allowing two programs on the same host using libprocess to 
> communicate via Unix domain sockets rather than TCP. This has a few 
> advantages:
> * Security: remote hosts cannot connect to the Unix socket. Domain sockets 
> also offer additional support for 
> [authentication|https://docs.fedoraproject.org/en-US/Fedora_Security_Team/1/html/Defensive_Coding/sect-Defensive_Coding-Authentication-UNIX_Domain.html].
> * Performance: domain sockets are marginally faster than localhost TCP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4279) Docker executor truncates task's output when the task is killed.

2016-11-23 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-4279:
--
Labels: docker mesosphere won't-backport  (was: docker mesosphere)

> Docker executor truncates task's output when the task is killed.
> 
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1
>Reporter: Martin Bydzovsky
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: docker, mesosphere, won't-backport
> Fix For: 1.0.0
>
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Docker executor truncates task's output when the task is killed.

2016-11-23 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691430#comment-15691430
 ] 

Anand Mazumdar commented on MESOS-4279:
---

Ignoring backporting for 0.28.x since its not straightforward.

> Docker executor truncates task's output when the task is killed.
> 
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1
>Reporter: Martin Bydzovsky
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: docker, mesosphere
> Fix For: 1.0.0
>
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5763) Task stuck in fetching is not cleaned up after --executor_registration_timeout.

2016-11-22 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5763:
--
Target Version/s:   (was: 0.28.3)
   Fix Version/s: 0.28.3

Backport for 0.28.x branch
{noformat}
commit 52a0b0a41482da35dc736ec2fd445b6099e7a4e7
Author: Anand Mazumdar 
Date:   Tue Nov 22 20:38:43 2016 -0800

Added MESOS-5763 to 0.28.3 CHANGELOG.

commit 2d61bde81e3d6fb7400ec5f7078ceedd8d2bb802
Author: Jiang Yan Xu 
Date:   Fri Jul 1 18:12:01 2016 -0700

Made Mesos containerizer error messages more consistent.

We've been using slightly different wordings of the same condition in
multiple places in Mesos containerizer but they don't provide
additional information about where this failure is thrown in a long
continuation chain. Since failures don't capture the location in the
code we'd better distinguish them in a more meaningful way to assist
debugging.

Review: https://reviews.apache.org/r/49653

commit d7f8b8558974ee8739d460d53faf54a52832b754
Author: Jiang Yan Xu 
Date:   Fri Jul 1 18:11:29 2016 -0700

Improved Mesos containerizer invariant checking.

One of the reasons for MESOS-5763 is due to the lack invariant
checking. Mesos containerizer transitions the container state in
particular ways so when continuation chains could potentially be
interleaved with other actions we should verify the state transitions.

Review: https://reviews.apache.org/r/49652

commit 008e04433026aaec49779197c4a7b6655d5bb693
Author: Jiang Yan Xu 
Date:   Fri Jul 1 15:25:54 2016 -0700

Improved Mesos containerizer logging and documentation.

Review: https://reviews.apache.org/r/49651

commit 90b5be8e95c5868ea9142625b97050a75d0664f5
Author: Jiang Yan Xu 
Date:   Wed Jul 6 13:48:34 2016 -0700

Fail container launch if it's destroyed during logger->prepare().

Review: https://reviews.apache.org/r/49725

commit 56b4c561e08a8cc36e5cbc3a786981412bf226dd
Author: Jiang Yan Xu 
Date:   Fri Jul 1 15:27:37 2016 -0700

Fixed Mesos containerizer to set container FETCHING state.

If the container state is not properly set to FETCHING, Mesos agent
cannot detect the terminated executor when the fetcher times out.

Review: https://reviews.apache.org/r/49650
{noformat}

> Task stuck in fetching is not cleaned up after 
> --executor_registration_timeout.
> ---
>
> Key: MESOS-5763
> URL: https://issues.apache.org/jira/browse/MESOS-5763
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.0, 1.0.0
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Blocker
> Fix For: 0.28.3, 1.0.0
>
>
> When the fetching process hangs forever due to reasons such as HDFS issues, 
> Mesos containerizer would attempt to destroy the container and kill the 
> executor after {{--executor_registration_timeout}}. However this reliably 
> fails for us: the executor would be killed by the launcher destroy and the 
> container would be destroyed but the agent would never find out that the 
> executor is terminated thus leaving the task in the STAGING state forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6466) Add support for streaming HTTP requests in Mesos

2016-11-21 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685566#comment-15685566
 ] 

Anand Mazumdar commented on MESOS-6466:
---

{noformat}
commit 0ca11623e6deb98dc05ec90b0339960c251cd64e
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:08:44 2016 -0800

Disabled tests relying on filtering HTTP events.

Some tests that rely on filtering HTTP events based on type won't
work now since the request body is not yet known when \`visit()\`
is invoked. These would be enabled later as part of a separate JIRA
issue.

Review: https://reviews.apache.org/r/53491/

commit 5cd134fedf69900523f3088de32daae81f216437
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:08:30 2016 -0800

Added a test for request streaming and GZIP compression.

These tests validate that clients can stream requests and send
compressed GZIP requests using the connection abstraction. This
also test the implementation of the streaming decoder indirectly.

Review: https://reviews.apache.org/r/53490/

commit a24cb4985c2333e2d15eeb8f971242f1754f81ab
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:08:18 2016 -0800

Added support for request streaming to the connection abstraction.

This required modifications to the `encode()` method to return
a `Pipe::Reader` instead of the request body. The `send()` then
reads from this pipe to send the request via the socket.

Review: https://reviews.apache.org/r/53489/

commit d06d74562740767f0750e10907a327c5b45fef4c
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:08:12 2016 -0800

Removed `convert()` continuations in favor of using `readAll()`.

Review: https://reviews.apache.org/r/53488/

commit d30039ded0a434cdf9583e0a12b73e1b3661380e
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:08:01 2016 -0800

Wired the libprocess code to use the streaming decoder.

Old libprocess style messages and routes not supporting request
streaming read the body from the piped reader. Otherwise, the
request is forwarded to the handler when the route supports
streaming.

Review: https://reviews.apache.org/r/53487/

commit c95e86c37dd82130016abf3a240ebfd869dfed2c
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:54 2016 -0800

Parameterized existing decoder tests on the type of decoder.

This allows us to not duplicate tests for the streaming request
decoder.

Review: https://reviews.apache.org/r/53511/

commit dceacc50ce8577b1d0fd5cde0e55dadef1907fdf
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:47 2016 -0800

Removed extraneous socket argument from `DataDecoder` constructor.

This argument is not used anywhere in the code. This makes it
consistent with the streaming request decoder.

Review: https://reviews.apache.org/r/53510/

commit 32e203ea194e3531ea8c5ded4d538bacb7cc2781
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:41 2016 -0800

Introduced a streaming request decoder in libprocess.

This would become the default de facto decoder used by libprocess
and replace the existing `DataDecoder`.

Review: https://reviews.apache.org/r/53486/

commit e8e3fe596f242767fc10ccb95cbdcd36c49a89a5
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:29 2016 -0800

Introduced a `readAll()` helper on `http::Pipe::Reader`.

The helper reads from the pipe till EOF. This is used later to
read BODY requests from the streaming request decoder.

Review: https://reviews.apache.org/r/53485/

commit 5152728e3eeac8d6fac52545d0ebc5df6f2e42cb
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:25 2016 -0800

Introduced `RouteOptions` to support streaming requests.

This allows routes to specify configuration options. Currently, it
only has one member `streaming` i.e, if the route supports request
streaming. This also enables us to add more options in the future
without polluting overloads.

Review: https://reviews.apache.org/r/53484/

commit f6286048a5f897ff6859b38a24c3d64aa3b54d01
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:20 2016 -0800

Introduced a reader member to `Request` to support request streaming.

These new members are needed for supporting request streaming i.e.,
the caller can use the writer to stream chunks to the server if
the request body is not known in advance.

Review: https://reviews.apache.org/r/53483/

commit 1003f3d208f6e06e8bf485e395190f9bd4e5fe24
Author: Anand Mazumdar 
Date:   Mon Nov 21 18:07:11 2016 -0800

Initialized the POD type in the `Request` struct.

Previously, the `keepAlive` member was not initialized correctly,
the behavior is undefined if POD types are not correctly
initialized.

Review: 

[jira] [Created] (MESOS-6623) Re-enable tests impacted by request streaming support

2016-11-21 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6623:
-

 Summary: Re-enable tests impacted by request streaming support
 Key: MESOS-6623
 URL: https://issues.apache.org/jira/browse/MESOS-6623
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
Priority: Blocker


We added support for HTTP request streaming in libprocess as part of 
MESOS-6466. However, this broke a few tests that relied on HTTP request 
filtering since the handlers no longer have access to the body of the request 
when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5662) Call parent class `SetUpTestCase` function in our test fixtures.

2016-11-20 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5662:
--
Description: 
There are some occurrences in our code where we don't invoke the parent's 
{{SetUpTestCase}} method from a child test fixture. This can be a bit 
problematic if the parent class has its own custom {{SetUpTestCase}} logic or 
someone adds one in the future. It would be good to do a sweep across the code 
and explicitly invoke the parent class's method.

Some examples (there are more):
https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80
https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59

  was:
There are some occurrences in our code where we don't invoke the parent's 
{{SetUpTestCase}} method from a child test fixture. This can be a bit 
problematic if someone adds the method in the parent class sometime in the 
future. It would be good to do a sweep across the code and explicitly invoke 
the parent class's method.

Some examples (there are more):
https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80
https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59


> Call parent class `SetUpTestCase` function in our test fixtures.
> 
>
> Key: MESOS-5662
> URL: https://issues.apache.org/jira/browse/MESOS-5662
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Anand Mazumdar
>Assignee: Manuwela Kanade
>  Labels: mesosphere, newbie
>
> There are some occurrences in our code where we don't invoke the parent's 
> {{SetUpTestCase}} method from a child test fixture. This can be a bit 
> problematic if the parent class has its own custom {{SetUpTestCase}} logic or 
> someone adds one in the future. It would be good to do a sweep across the 
> code and explicitly invoke the parent class's method.
> Some examples (there are more):
> https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80
> https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API

2016-11-16 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6597:
--
Shepherd: Anand Mazumdar
Target Version/s: 1.1.1, 1.2.0
Priority: Blocker  (was: Major)

> Include missing Mesos Java classes for Protobuf files to support Operator 
> HTTP V1 API
> -
>
> Key: MESOS-6597
> URL: https://issues.apache.org/jira/browse/MESOS-6597
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Vijay Srinivasaraghavan
>Assignee: Vijay Srinivasaraghavan
>Priority: Blocker
>
> For V1 API support, the build file that generates Java protos wrapper as of 
> now includes only executor and scheduler. 
> (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) 
> To support operator HTTP API, we also need to generate java protos for 
> additional proto definitions like quota, maintenance etc., These java 
> definition files will be used by a standard Rest client when using the 
> straight HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6576) DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI

2016-11-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655649#comment-15655649
 ] 

Anand Mazumdar commented on MESOS-6576:
---

The root cause is similar to MESOS-6569 i.e., we erroneously expect the 
{{TASK_RUNNING}} updates for different tasks to arrive in order.

> DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI
> 
>
> Key: MESOS-6576
> URL: https://issues.apache.org/jira/browse/MESOS-6576
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: James Peach
>  Labels: flaky-test, mesosphere, newbie
> Attachments: KillTaskGroupOnTaskFailure.failure.log, 
> KillTaskGroupOnTaskFailure.success.log
>
>
> {{DefaultExecutorTest.KillTaskGroupOnTaskFailure}} sometimes fails in the ASF 
> CI.
> Interesting  pieces of the failing test run:
> {noformat}
> ...
> I1110 20:38:54.775871 29740 status_update_manager.cpp:323] Received status 
> update TASK_KILLED (UUID: a4746389-8155-44e0-ada4-00b8d3e997c1) for task 
> df99cc50-9b0f-4692-afc9-d587c3515a67 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:54.776181 29730 slave.cpp:4075] Status update manager 
> successfully handled status update TASK_KILLED (UUID: 
> a4746389-8155-44e0-ada4-00b8d3e997c1) for task 
> df99cc50-9b0f-4692-afc9-d587c3515a67 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.456354 29738 hierarchical.cpp:1880] Filtered offer with 
> cpus(*):1.7; mem(*):928; disk(*):928; ports(*):[31000-32000] on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 for framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.456434 29738 hierarchical.cpp:1694] No allocations performed
> I1110 20:38:55.456468 29738 hierarchical.cpp:1789] No inverse offers to send 
> out!
> I1110 20:38:55.456545 29738 hierarchical.cpp:1286] Performed allocation for 1 
> agents in 745185ns
> I1110 20:38:55.875964 29731 containerizer.cpp:2336] Container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 has exited
> I1110 20:38:55.876022 29731 containerizer.cpp:1973] Destroying container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 in RUNNING state
> I1110 20:38:55.876387 29731 launcher.cpp:143] Asked to destroy container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98
> I1110 20:38:55.881464 29728 provisioner.cpp:324] Ignoring destroy request for 
> unknown container a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98
> I1110 20:38:55.882894 29730 slave.cpp:4672] Executor 'default' of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- exited with status 0
> I1110 20:38:55.883446 29741 master.cpp:5884] Executor 'default' of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 
> (ade222407ffe): exited with status 0
> I1110 20:38:55.883545 29741 master.cpp:7840] Removing executor 'default' with 
> resources cpus(*):0.1; mem(*):32; disk(*):32 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 
> (ade222407ffe)
> I1110 20:38:55.884820 29729 hierarchical.cpp:1018] Recovered cpus(*):0.1; 
> mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024; 
> ports(*):[31000-32000], allocated: cpus(*):0.2; mem(*):64; disk(*):64) on 
> agent 2df0125f-4865-4aba-b13d-02f338815729-S0 from framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.885892 29737 scheduler.cpp:675] Enqueuing event FAILURE 
> received from  href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
> Function call: failure(0x7ffdc4df11f0, @0x2b639800b6b0 48-byte object 
> 90-82 AC-51 63-2B 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 
> 70-0A 01-98 63-2B 00-00 20-C7 00-98 63-2B 00-00 00-00 00-00 63-2B 00-00)
> ...
> I1110 20:39:04.566794 29732 master.cpp:7715] Updating the state of task 
> e72d5139-0a11-48af-9d43-d4163c1404ee of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- (latest state: TASK_FAILED, status 
> update state: TASK_RUNNING)
> ...
> I1110 20:39:04.569413 29736 scheduler.cpp:675] Enqueuing event UPDATE 
> received from  href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler
> ../../src/tests/default_executor_tests.cpp:583: Failure
> Value of: taskStates
>   Actual: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_KILLED), 
> (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_FAILED) }
> Expected: expectedTaskStates
> Which is: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_RUNNING), 
> (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_RUNNING) }
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6576) DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI

2016-11-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6576:
--
Labels: flaky-test mesosphere newbie  (was: )

> DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI
> 
>
> Key: MESOS-6576
> URL: https://issues.apache.org/jira/browse/MESOS-6576
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: James Peach
>  Labels: flaky-test, mesosphere, newbie
> Attachments: KillTaskGroupOnTaskFailure.failure.log, 
> KillTaskGroupOnTaskFailure.success.log
>
>
> {{DefaultExecutorTest.KillTaskGroupOnTaskFailure}} sometimes fails in the ASF 
> CI.
> Interesting  pieces of the failing test run:
> {noformat}
> ...
> I1110 20:38:54.775871 29740 status_update_manager.cpp:323] Received status 
> update TASK_KILLED (UUID: a4746389-8155-44e0-ada4-00b8d3e997c1) for task 
> df99cc50-9b0f-4692-afc9-d587c3515a67 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:54.776181 29730 slave.cpp:4075] Status update manager 
> successfully handled status update TASK_KILLED (UUID: 
> a4746389-8155-44e0-ada4-00b8d3e997c1) for task 
> df99cc50-9b0f-4692-afc9-d587c3515a67 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.456354 29738 hierarchical.cpp:1880] Filtered offer with 
> cpus(*):1.7; mem(*):928; disk(*):928; ports(*):[31000-32000] on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 for framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.456434 29738 hierarchical.cpp:1694] No allocations performed
> I1110 20:38:55.456468 29738 hierarchical.cpp:1789] No inverse offers to send 
> out!
> I1110 20:38:55.456545 29738 hierarchical.cpp:1286] Performed allocation for 1 
> agents in 745185ns
> I1110 20:38:55.875964 29731 containerizer.cpp:2336] Container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 has exited
> I1110 20:38:55.876022 29731 containerizer.cpp:1973] Destroying container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 in RUNNING state
> I1110 20:38:55.876387 29731 launcher.cpp:143] Asked to destroy container 
> a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98
> I1110 20:38:55.881464 29728 provisioner.cpp:324] Ignoring destroy request for 
> unknown container a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98
> I1110 20:38:55.882894 29730 slave.cpp:4672] Executor 'default' of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- exited with status 0
> I1110 20:38:55.883446 29741 master.cpp:5884] Executor 'default' of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 
> (ade222407ffe): exited with status 0
> I1110 20:38:55.883545 29741 master.cpp:7840] Removing executor 'default' with 
> resources cpus(*):0.1; mem(*):32; disk(*):32 of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- on agent 
> 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 
> (ade222407ffe)
> I1110 20:38:55.884820 29729 hierarchical.cpp:1018] Recovered cpus(*):0.1; 
> mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024; 
> ports(*):[31000-32000], allocated: cpus(*):0.2; mem(*):64; disk(*):64) on 
> agent 2df0125f-4865-4aba-b13d-02f338815729-S0 from framework 
> 2df0125f-4865-4aba-b13d-02f338815729-
> I1110 20:38:55.885892 29737 scheduler.cpp:675] Enqueuing event FAILURE 
> received from  href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
> Function call: failure(0x7ffdc4df11f0, @0x2b639800b6b0 48-byte object 
> 90-82 AC-51 63-2B 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 
> 70-0A 01-98 63-2B 00-00 20-C7 00-98 63-2B 00-00 00-00 00-00 63-2B 00-00)
> ...
> I1110 20:39:04.566794 29732 master.cpp:7715] Updating the state of task 
> e72d5139-0a11-48af-9d43-d4163c1404ee of framework 
> 2df0125f-4865-4aba-b13d-02f338815729- (latest state: TASK_FAILED, status 
> update state: TASK_RUNNING)
> ...
> I1110 20:39:04.569413 29736 scheduler.cpp:675] Enqueuing event UPDATE 
> received from  href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler
> ../../src/tests/default_executor_tests.cpp:583: Failure
> Value of: taskStates
>   Actual: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_KILLED), 
> (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_FAILED) }
> Expected: expectedTaskStates
> Which is: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_RUNNING), 
> (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_RUNNING) }
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2016-11-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6405:
--
Sprint: Mesosphere Sprint 45  (was: Mesosphere Sprint 45, Mesosphere Sprint 
46)

> Benchmark call ingestion path on the Mesos master.
> --
>
> Key: MESOS-6405
> URL: https://issues.apache.org/jira/browse/MESOS-6405
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> [~drexin] reported on the user mailing 
> [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
>  that there seems to be a significant regression in performance on the call 
> ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 
> We should create a benchmark to first get a sense of the numbers and then go 
> about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6569) MesosContainerizer/DefaultExecutorTest.KillTask/0 failing on ASF CI

2016-11-09 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652704#comment-15652704
 ] 

Anand Mazumdar commented on MESOS-6569:
---

Similar to the logic that {{TASK_KILLED}} update for the tasks can be received 
in any order, we need similar logic to ensure that {{TASK_RUNNING}} update for 
tasks can be received in any order by the scheduler.

> MesosContainerizer/DefaultExecutorTest.KillTask/0 failing on ASF CI
> ---
>
> Key: MESOS-6569
> URL: https://issues.apache.org/jira/browse/MESOS-6569
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: 
> https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)&&(!ubuntu-eu2)/
>Reporter: Yan Xu
>  Labels: flaky, newbie
>
> {noformat:title=}
> [ RUN  ] MesosContainerizer/DefaultExecutorTest.KillTask/0
> I1110 01:20:11.482097 29700 cluster.cpp:158] Creating default 'local' 
> authorizer
> I1110 01:20:11.485241 29700 leveldb.cpp:174] Opened db in 2.774513ms
> I1110 01:20:11.486237 29700 leveldb.cpp:181] Compacted db in 953614ns
> I1110 01:20:11.486299 29700 leveldb.cpp:196] Created db iterator in 24739ns
> I1110 01:20:11.486325 29700 leveldb.cpp:202] Seeked to beginning of db in 
> 2300ns
> I1110 01:20:11.486344 29700 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 378ns
> I1110 01:20:11.486399 29700 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1110 01:20:11.486933 29733 recover.cpp:451] Starting replica recovery
> I1110 01:20:11.487289 29733 recover.cpp:477] Replica is in EMPTY status
> I1110 01:20:11.488503 29721 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(7318)@172.17.0.3:52462
> I1110 01:20:11.488855 29727 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1110 01:20:11.489398 29729 recover.cpp:568] Updating replica status to 
> STARTING
> I1110 01:20:11.490223 29723 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 575135ns
> I1110 01:20:11.490284 29732 master.cpp:380] Master 
> d28fbae1-c3dc-45fa-8384-32ab9395a975 (3a31be8bf679) started on 
> 172.17.0.3:52462
> I1110 01:20:11.490317 29732 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/k50x7x/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/mesos/mesos-1.2.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/k50x7x/master" --zk_session_timeout="10secs"
> I1110 01:20:11.490696 29732 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1110 01:20:11.490712 29732 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1110 01:20:11.490720 29732 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1110 01:20:11.490730 29732 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/k50x7x/credentials'
> I1110 01:20:11.490281 29723 replica.cpp:320] Persisted replica status to 
> STARTING
> I1110 01:20:11.491210 29732 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1110 01:20:11.491225 29720 recover.cpp:477] Replica is in STARTING status
> I1110 01:20:11.491394 29732 http.cpp:895] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1110 01:20:11.491621 29732 http.cpp:895] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1110 01:20:11.491770 29732 http.cpp:895] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1110 01:20:11.491937 29732 master.cpp:584] Authorization enabled
> I1110 

[jira] [Updated] (MESOS-6552) Add ability to filter events on the subscriber stream for Master API.

2016-11-04 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6552:
--
Description: 
Currently, the v1 Master API allows an operator to subscribe to events 
happening on their clusters e.g., any time a new task is launched/updated. 
However,  there is no ability currently for a subscriber to express its 
interest in a particular subset of events on the master e.g, only in agent 
related events (add/removal) etc.

This would also take care of use cases where a subscriber would be short lived 
i.e., is only interested to see if a particular task has been launched on the 
cluster by the framework and then close its connection thereafter. Currently, 
such subscribers also receive the entire snapshot of the cluster via the 
{{SNAPSHOT}} events that can be rather huge for production clusters (we also 
don't support compression on the stream yet). Such subscribers in the future 
would be able to opt out of this event.

  was:
Currently, the v1 Master API allows an operator to subscribe to events 
happening on their clusters e.g., any time a new task is launched/updated. 
However,  there is no ability currently for a subscriber to express its 
interest in a particular subset of events on the master e.g, only in task 
add/updated events

This would also take care of use cases where a subscriber would be short lived 
i.e., is only interested to see if a particular task has been launched on the 
cluster by the framework and then close its connection thereafter. Currently, 
such subscribers also receive the entire snapshot of the cluster via the 
{{SNAPSHOT}} events that can be rather huge for production clusters (we also 
don't support compression on the stream yet). Such subscribers in the future 
would be able to opt out of this event.


> Add ability to filter events on the subscriber stream for Master API.
> -
>
> Key: MESOS-6552
> URL: https://issues.apache.org/jira/browse/MESOS-6552
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Currently, the v1 Master API allows an operator to subscribe to events 
> happening on their clusters e.g., any time a new task is launched/updated. 
> However,  there is no ability currently for a subscriber to express its 
> interest in a particular subset of events on the master e.g, only in agent 
> related events (add/removal) etc.
> This would also take care of use cases where a subscriber would be short 
> lived i.e., is only interested to see if a particular task has been launched 
> on the cluster by the framework and then close its connection thereafter. 
> Currently, such subscribers also receive the entire snapshot of the cluster 
> via the {{SNAPSHOT}} events that can be rather huge for production clusters 
> (we also don't support compression on the stream yet). Such subscribers in 
> the future would be able to opt out of this event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6552) Add ability to filter events on the subscriber stream for Master API.

2016-11-04 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6552:
-

 Summary: Add ability to filter events on the subscriber stream for 
Master API.
 Key: MESOS-6552
 URL: https://issues.apache.org/jira/browse/MESOS-6552
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, the v1 Master API allows an operator to subscribe to events 
happening on their clusters e.g., any time a new task is launched/updated. 
However,  there is no ability currently for a subscriber to express its 
interest in a particular subset of events on the master e.g, only in task 
add/updated events

This would also take care of use cases where a subscriber would be short lived 
i.e., is only interested to see if a particular task has been launched on the 
cluster by the framework and then close its connection thereafter. Currently, 
such subscribers also receive the entire snapshot of the cluster via the 
{{SNAPSHOT}} events that can be rather huge for production clusters (we also 
don't support compression on the stream yet). Such subscribers in the future 
would be able to opt out of this event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6466) Add support for streaming HTTP requests in Mesos

2016-11-04 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637168#comment-15637168
 ] 

Anand Mazumdar edited comment on MESOS-6466 at 11/4/16 8:25 PM:


Review Chain: https://reviews.apache.org/r/53481/

Currently still left:
- Fix all tests relying on filtering HTTP events (see r53491 for more details)
- Parameterize the existing decoder tests thereby making them also work for the 
streaming decoder.
- Include Ben's fix for streaming gzip decompression (MESOS-6530)


was (Author: anandmazumdar):
Review Chain: https://reviews.apache.org/r/53481/

Currently still left:
- Fix all tests relying on filtering HTTP events (see r53491 for more details)
- Parameterize the existing decoder tests thereby making them also work for the 
streaming decoder.

> Add support for streaming HTTP requests in Mesos
> 
>
> Key: MESOS-6466
> URL: https://issues.apache.org/jira/browse/MESOS-6466
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
>
> We already have support for streaming HTTP responses in Mesos. We now also 
> need to add support for streaming HTTP requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5763) Task stuck in fetching is not cleaned up after --executor_registration_timeout.

2016-11-03 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5763:
--
Target Version/s: 0.28.3
   Fix Version/s: (was: 0.27.4)
  (was: 0.28.3)

> Task stuck in fetching is not cleaned up after 
> --executor_registration_timeout.
> ---
>
> Key: MESOS-5763
> URL: https://issues.apache.org/jira/browse/MESOS-5763
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.0, 1.0.0
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When the fetching process hangs forever due to reasons such as HDFS issues, 
> Mesos containerizer would attempt to destroy the container and kill the 
> executor after {{--executor_registration_timeout}}. However this reliably 
> fails for us: the executor would be killed by the launcher destroy and the 
> container would be destroyed but the agent would never find out that the 
> executor is terminated thus leaving the task in the STAGING state forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6391) Command task's sandbox should not be owned by root if it uses container image.

2016-11-03 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6391:
--
Target Version/s: 1.0.2, 1.1.0  (was: 0.28.3, 1.0.2, 1.1.0)

Removing the target version from 0.28.3 since it's not a trivial backport. cc: 
[~jieyu]

> Command task's sandbox should not be owned by root if it uses container image.
> --
>
> Key: MESOS-6391
> URL: https://issues.apache.org/jira/browse/MESOS-6391
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.2, 1.0.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Blocker
> Fix For: 1.0.2, 1.1.0
>
>
> Currently, if the task defines a container image, the command executor will 
> be run under root because it needs to perform pivot_root.
> That means if the task wants to run under an unprivileged user, the sandbox 
> of that task will not be writable because it's owned by root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6527) Memory leak in the libprocess request decoder.

2016-11-02 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627944#comment-15627944
 ] 

Anand Mazumdar commented on MESOS-6527:
---

0.28.x backport
{noformat}
commit 4033b37087056c63bc9b90969288ad5a9fa7f4ff
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:32:55 2016 -0700

Fixed memory leak in request/response decoders.

The leak can happen in cases where a client disconnects while the
request/response is in progress.

Review: https://reviews.apache.org/r/53361/

commit 94cdfd01cebdcc8c2ecc52dc9d402fa6191aad87
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:38:42 2016 -0700

Added MESOS-6527 to CHANGELOG for 0.28.3.
{noformat}

1.0.2 backport
{noformat}
commit 07a4a242d7e722840c63e9b0d6a444ad5e6b1ec3
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:32:55 2016 -0700

Fixed memory leak in request/response decoders.

The leak can happen in cases where a client disconnects while the
request/response is in progress.

Review: https://reviews.apache.org/r/53361/

commit 9a218e3edf3d9fac0a83817d26ad689cf7d53f05
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:37:39 2016 -0700

Added MESOS-6527 to CHANGELOG for 1.0.2.
{noformat}

1.1.0 branch
{noformat}
commit eaec806adefa1206c242e0409c6022a3bc115f6d
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:32:55 2016 -0700

Fixed memory leak in request/response decoders.

The leak can happen in cases where a client disconnects while the
request/response is in progress.

Review: https://reviews.apache.org/r/53361/

commit 4f655fd98c91b8e72dca4f4c7c5faf024a78d763
Author: Anand Mazumdar 
Date:   Tue Nov 1 21:36:29 2016 -0700

Added MESOS-6527 to CHANGELOG for 1.1.0.
{noformat}

> Memory leak in the libprocess request decoder.
> --
>
> Key: MESOS-6527
> URL: https://issues.apache.org/jira/browse/MESOS-6527
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 0.28.3, 1.0.2, 1.1.0, 1.2.0
>
>
> The libprocess decoder can leak a {{Request}} object in cases when a client 
> disconnects while the request is in progress. In such cases, the decoder's 
> destructor won't delete the active {{Request}} object that it had allocated 
> on the heap.
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6527) Memory leak in the libprocess request decoder.

2016-11-02 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6527:
--
Story Points: 2  (was: 1)

> Memory leak in the libprocess request decoder.
> --
>
> Key: MESOS-6527
> URL: https://issues.apache.org/jira/browse/MESOS-6527
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 0.28.3, 1.0.2, 1.1.0, 1.2.0
>
>
> The libprocess decoder can leak a {{Request}} object in cases when a client 
> disconnects while the request is in progress. In such cases, the decoder's 
> destructor won't delete the active {{Request}} object that it had allocated 
> on the heap.
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6527) Memory leak in the libprocess request decoder.

2016-11-01 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6527:
-

 Summary: Memory leak in the libprocess request decoder.
 Key: MESOS-6527
 URL: https://issues.apache.org/jira/browse/MESOS-6527
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar
Priority: Blocker


The libprocess decoder can leak a {{Request}} object in cases when a client 
disconnects while the request is in progress. In such cases, the decoder's 
destructor won't delete the active {{Request}} object that it had allocated on 
the heap.

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6466) Add support for streaming HTTP requests in Mesos

2016-10-30 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6466:
--
Shepherd: Benjamin Mahler

> Add support for streaming HTTP requests in Mesos
> 
>
> Key: MESOS-6466
> URL: https://issues.apache.org/jira/browse/MESOS-6466
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
>
> We already have support for streaming HTTP responses in Mesos. We now also 
> need to add support for streaming HTTP requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6466) Add support for streaming HTTP requests in Mesos

2016-10-30 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-6466:
-

Assignee: Anand Mazumdar  (was: Kevin Klues)

> Add support for streaming HTTP requests in Mesos
> 
>
> Key: MESOS-6466
> URL: https://issues.apache.org/jira/browse/MESOS-6466
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
>
> We already have support for streaming HTTP responses in Mesos. We now also 
> need to add support for streaming HTTP requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.

2016-10-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617270#comment-15617270
 ] 

Anand Mazumdar edited comment on MESOS-6507 at 10/29/16 2:53 AM:
-

It was an oversight on my part, had forgotten backporting Ben's UUID patches. 
This should unblock the 1.0.2 release. 

{noformat}
commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:17:30 2016 -0700

Updated scheduler library to handle UUID parsing error.

Previously this would have thrown an exception.

Review: https://reviews.apache.org/r/52099

commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:14:39 2016 -0700

Updated UUID::fromString to not throw an exception on error.

The exception from the string_generator needs to be caught so
that we can surface a Try to the caller.

Review: https://reviews.apache.org/r/52098
{noformat}


was (Author: anandmazumdar):
It was an oversight on my part, had forgotten backporting Ben's UUID patches. 
This should unblock the 1.0.2 release. However, we still need to fix the test 
flakiness on HEAD. I would update the JIRA with the flaky test log.

{noformat}
commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:17:30 2016 -0700

Updated scheduler library to handle UUID parsing error.

Previously this would have thrown an exception.

Review: https://reviews.apache.org/r/52099

commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:14:39 2016 -0700

Updated UUID::fromString to not throw an exception on error.

The exception from the string_generator needs to be caught so
that we can surface a Try to the caller.

Review: https://reviews.apache.org/r/52098
{noformat}

> 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails 
> consistently.
> --
>
> Key: MESOS-6507
> URL: https://issues.apache.org/jira/browse/MESOS-6507
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
>Reporter: Gilbert Song
>Priority: Blocker
>  Labels: failure
>
> Here is the log:
> {noformat}
> [23:09:24] :   [Step 10/10] [ RUN  ] 
> DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] 
> Parsing resources as JSON failed: cpus:1;mem:512
> [23:09:24] :   [Step 10/10] Trying semicolon-delimited string format instead
> [23:09:24] :   [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] 
> Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 
> 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v 
> /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c 
> sleep 1000
> [23:09:24] :   [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] 
> Recovering Docker containers
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] 
> Running docker -H unix:///var/run/docker.sock ps -a
> [23:09:24] :   [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] 
> Skipping recovery of executor '' of framework '' because its latest run could 
> not be recovered
> [23:09:25] :   [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] 
> Checking if Docker container named '/mesos-s1.malformedUUID' was started by 
> Mesos
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] 
> Checking if Mesos container with ID 'malformedUUID' has been orphaned
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] 
> Running docker -H unix:///var/run/docker.sock stop -t 0 
> 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747
> [23:09:25] :   [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -v 
> 

[jira] [Updated] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.

2016-10-28 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6507:
--
Target Version/s:   (was: 1.0.2)

> 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails 
> consistently.
> --
>
> Key: MESOS-6507
> URL: https://issues.apache.org/jira/browse/MESOS-6507
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
>Reporter: Gilbert Song
>Priority: Blocker
>  Labels: failure
>
> Here is the log:
> {noformat}
> [23:09:24] :   [Step 10/10] [ RUN  ] 
> DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] 
> Parsing resources as JSON failed: cpus:1;mem:512
> [23:09:24] :   [Step 10/10] Trying semicolon-delimited string format instead
> [23:09:24] :   [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] 
> Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 
> 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v 
> /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c 
> sleep 1000
> [23:09:24] :   [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] 
> Recovering Docker containers
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] 
> Running docker -H unix:///var/run/docker.sock ps -a
> [23:09:24] :   [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] 
> Skipping recovery of executor '' of framework '' because its latest run could 
> not be recovered
> [23:09:25] :   [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] 
> Checking if Docker container named '/mesos-s1.malformedUUID' was started by 
> Mesos
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] 
> Checking if Mesos container with ID 'malformedUUID' has been orphaned
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] 
> Running docker -H unix:///var/run/docker.sock stop -t 0 
> 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747
> [23:09:25] :   [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -v 
> 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747
> [23:09:25] :   [Step 10/10] I1028 23:09:25.212870 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] I1028 23:09:25.513255 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] I1028 23:09:25.815946 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.119107 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.421722 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.724777 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:27] :   [Step 10/10] I1028 23:09:27.028252 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:27] :   [Step 10/10] I1028 23:09:27.331799 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:27] :   [Step 10/10] I1028 23:09:27.634660 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:27] :   [Step 10/10] I1028 23:09:27.938190 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:28] :   [Step 10/10] I1028 23:09:28.241756 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:28] :   [Step 10/10] I1028 

[jira] [Commented] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.

2016-10-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617270#comment-15617270
 ] 

Anand Mazumdar commented on MESOS-6507:
---

It was an oversight on my part, had forgotten backporting Ben's UUID patches. 
This should unblock the 1.0.2 release. However, we still need to fix the test 
flakiness on HEAD. I would update the JIRA with the flaky test log.

{noformat}
commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:17:30 2016 -0700

Updated scheduler library to handle UUID parsing error.

Previously this would have thrown an exception.

Review: https://reviews.apache.org/r/52099

commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb
Author: Benjamin Mahler 
Date:   Tue Sep 20 14:14:39 2016 -0700

Updated UUID::fromString to not throw an exception on error.

The exception from the string_generator needs to be caught so
that we can surface a Try to the caller.

Review: https://reviews.apache.org/r/52098
{noformat}

> 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails 
> consistently.
> --
>
> Key: MESOS-6507
> URL: https://issues.apache.org/jira/browse/MESOS-6507
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
>Reporter: Gilbert Song
>Priority: Blocker
>  Labels: failure
>
> Here is the log:
> {noformat}
> [23:09:24] :   [Step 10/10] [ RUN  ] 
> DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] 
> Parsing resources as JSON failed: cpus:1;mem:512
> [23:09:24] :   [Step 10/10] Trying semicolon-delimited string format instead
> [23:09:24] :   [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] 
> Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 
> 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v 
> /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c 
> sleep 1000
> [23:09:24] :   [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] 
> Recovering Docker containers
> [23:09:24] :   [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] 
> Running docker -H unix:///var/run/docker.sock ps -a
> [23:09:24] :   [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] 
> Skipping recovery of executor '' of framework '' because its latest run could 
> not be recovered
> [23:09:25] :   [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] 
> Checking if Docker container named '/mesos-s1.malformedUUID' was started by 
> Mesos
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] 
> Checking if Mesos container with ID 'malformedUUID' has been orphaned
> [23:09:25] :   [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] 
> Running docker -H unix:///var/run/docker.sock stop -t 0 
> 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747
> [23:09:25] :   [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] 
> Running docker -H unix:///var/run/docker.sock rm -v 
> 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747
> [23:09:25] :   [Step 10/10] I1028 23:09:25.212870 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] I1028 23:09:25.513255 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:25] :   [Step 10/10] I1028 23:09:25.815946 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.119107 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.421722 31435 docker.cpp:972] 
> Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID
> [23:09:26] :   [Step 10/10] I1028 23:09:26.724777 31435 

[jira] [Updated] (MESOS-6497) Java Scheduler Adapter does not surface MasterInfo.

2016-10-28 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6497:
--
  Sprint: Mesosphere Sprint 46
Story Points: 2

> Java Scheduler Adapter does not surface MasterInfo.
> ---
>
> Key: MESOS-6497
> URL: https://issues.apache.org/jira/browse/MESOS-6497
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Joris Van Remoortere
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere, v1_api
> Fix For: 1.1.0
>
>
> The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
> compatible with the V0 API where the {{registered}} and {{reregistered}} 
> calls provided the MasterInfo to the framework.
> cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6500) SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky

2016-10-28 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6500:
-

 Summary: SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky
 Key: MESOS-6500
 URL: https://issues.apache.org/jira/browse/MESOS-6500
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar


Showed up on ReviewBot. Unfortunately, the ReviewBot cleaned up the logs.

It seems like we are leaving orphan processes upon the test suite completion 
that leads to this test failing.
{code}
../../src/tests/environment.cpp:825: Failure
Failed
Tests completed with child processes remaining:
-+- 29429 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-tests 
 \-+- 5970 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-containerizer launch 
--command={"arguments":["mesos-executor","--launcher_dir=\/mesos\/mesos-1.2.0\/_build\/src"],"shell":false,"value":"\/mesos\/mesos-1.2.0\/_build\/src\/mesos-executor"}
 
--environment={"LIBPROCESS_PORT":"0","MESOS_AGENT_ENDPOINT":"172.17.0.2:52560","MESOS_CHECKPOINT":"1","MESOS_DIRECTORY":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_EXECUTOR_ID":"e4f3e7e4-1acf-46d6-9768-259be617a17a","MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD":"5secs","MESOS_FRAMEWORK_ID":"6ace31e5-eac7-41f8-a938-64d648610484-","M
 
ESOS_HTTP_COMMAND_EXECUTOR":"1","MESOS_RECOVERY_TIMEOUT":"15mins","MESOS_SANDBOX":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_SLAVE_ID":"6ace31e5-eac7-41f8-a938-64d648610484-S0","MESOS_SLAVE_PID":"agent@172.17.0.2:52560","MESOS_SUBSCRIPTION_BACKOFF_MAX":"2secs","PATH":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"}
 --help=false --pipe_read=72 --pipe_write=77 --pre_exec_commands=[] 
--runtime_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_FIHcEr/containers/6517ed10-859f-41d1-b5b4-75dc5c0c2a23
 --unshare_namespace_mnt=false --user=mesos 
--working_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_
 
kyPmzZ/slaves/6ace31e5-eac7-41f8-a938-64d648610484-S0/frameworks/6ace31e5-eac7-41f8-a938-64d648610484-/executors/e4f3e7e4-1acf-46d6-9768-259be617a17a/runs/6517ed10-859f-41d1-b5b4-75dc5c0c2a23
 
   \-+- 5984 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-executor 
--launcher_dir=/mesos/mesos-1.2.0/_build/src 
 \-+- 6015 sh -c sleep 1000 
   \--- 6029 sleep 1000 
[==] 1369 tests from 155 test cases ran. (465777 ms total)
[  PASSED  ] 1368 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6500) SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky

2016-10-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616236#comment-15616236
 ] 

Anand Mazumdar commented on MESOS-6500:
---

cc: [~jieyu] [~gilbert]

> SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky
> --
>
> Key: MESOS-6500
> URL: https://issues.apache.org/jira/browse/MESOS-6500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test
>
> Showed up on ReviewBot. Unfortunately, the ReviewBot cleaned up the logs.
> It seems like we are leaving orphan processes upon the test suite completion 
> that leads to this test failing.
> {code}
> ../../src/tests/environment.cpp:825: Failure
> Failed
> Tests completed with child processes remaining:
> -+- 29429 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-tests 
>  \-+- 5970 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-containerizer launch 
> --command={"arguments":["mesos-executor","--launcher_dir=\/mesos\/mesos-1.2.0\/_build\/src"],"shell":false,"value":"\/mesos\/mesos-1.2.0\/_build\/src\/mesos-executor"}
>  
> --environment={"LIBPROCESS_PORT":"0","MESOS_AGENT_ENDPOINT":"172.17.0.2:52560","MESOS_CHECKPOINT":"1","MESOS_DIRECTORY":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_EXECUTOR_ID":"e4f3e7e4-1acf-46d6-9768-259be617a17a","MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD":"5secs","MESOS_FRAMEWORK_ID":"6ace31e5-eac7-41f8-a938-64d648610484-","M
>  
> ESOS_HTTP_COMMAND_EXECUTOR":"1","MESOS_RECOVERY_TIMEOUT":"15mins","MESOS_SANDBOX":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_SLAVE_ID":"6ace31e5-eac7-41f8-a938-64d648610484-S0","MESOS_SLAVE_PID":"agent@172.17.0.2:52560","MESOS_SUBSCRIPTION_BACKOFF_MAX":"2secs","PATH":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"}
>  --help=false --pipe_read=72 --pipe_write=77 --pre_exec_commands=[] 
> --runtime_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_FIHcEr/containers/6517ed10-859f-41d1-b5b4-75dc5c0c2a23
>  --unshare_namespace_mnt=false --user=mesos 
> --working_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_
>  
> kyPmzZ/slaves/6ace31e5-eac7-41f8-a938-64d648610484-S0/frameworks/6ace31e5-eac7-41f8-a938-64d648610484-/executors/e4f3e7e4-1acf-46d6-9768-259be617a17a/runs/6517ed10-859f-41d1-b5b4-75dc5c0c2a23
>  
>\-+- 5984 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-executor 
> --launcher_dir=/mesos/mesos-1.2.0/_build/src 
>  \-+- 6015 sh -c sleep 1000 
>\--- 6029 sleep 1000 
> [==] 1369 tests from 155 test cases ran. (465777 ms total)
> [  PASSED  ] 1368 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6497) Java Scheduler Adapter does not surface MasterInfo.

2016-10-28 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6497:
--
Summary: Java Scheduler Adapter does not surface MasterInfo.  (was: HTTP 
Adapter does not surface MasterInfo.)

> Java Scheduler Adapter does not surface MasterInfo.
> ---
>
> Key: MESOS-6497
> URL: https://issues.apache.org/jira/browse/MESOS-6497
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Joris Van Remoortere
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere, v1_api
> Fix For: 1.1.0
>
>
> The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
> compatible with the V0 API where the {{registered}} and {{reregistered}} 
> calls provided the MasterInfo to the framework.
> cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-10-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615520#comment-15615520
 ] 

Anand Mazumdar commented on MESOS-6202:
---

Nopes, we can close this issue.

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6497) HTTP Adapter does not surface MasterInfo.

2016-10-27 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613287#comment-15613287
 ] 

Anand Mazumdar edited comment on MESOS-6497 at 10/27/16 9:30 PM:
-

We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event 
thereby providing the schedulers with this information. Another option was 
adding it to the {{connected}} callback on the scheduler library but we punted 
on it because in the future schedulers might want to use their own detection 
library that might not read contents from Master ZK to populate {{MasterInfo}} 
correctly.


was (Author: anandmazumdar):
We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event 
thereby providing the schedulers with this information. Another option was 
adding it to the {{connected}} callback on the scheduler library but we punted 
on it because in the future schedulers might want to use their own detection 
library that might not read contents from Master ZK. 

> HTTP Adapter does not surface MasterInfo.
> -
>
> Key: MESOS-6497
> URL: https://issues.apache.org/jira/browse/MESOS-6497
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Joris Van Remoortere
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere, v1_api
>
> The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
> compatible with the V0 API where the {{registered}} and {{reregistered}} 
> calls provided the MasterInfo to the framework.
> cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6497) HTTP Adapter does not surface MasterInfo.

2016-10-27 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613287#comment-15613287
 ] 

Anand Mazumdar commented on MESOS-6497:
---

We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event 
thereby providing the schedulers with this information. Another option was 
adding it to the {{connected}} callback on the scheduler library but we punted 
on it because in the future schedulers might want to use their own detection 
library that might not read contents from Master ZK. 

> HTTP Adapter does not surface MasterInfo.
> -
>
> Key: MESOS-6497
> URL: https://issues.apache.org/jira/browse/MESOS-6497
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Joris Van Remoortere
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere, v1_api
>
> The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
> compatible with the V0 API where the {{registered}} and {{reregistered}} 
> calls provided the MasterInfo to the framework.
> cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6497) HTTP Adapter does not surface MasterInfo.

2016-10-27 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6497:
--
   Shepherd: Vinod Kone
Description: 
The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
compatible with the V0 API where the {{registered}} and {{reregistered}} calls 
provided the MasterInfo to the framework.
cc [~vinodkone]

  was:
The HTTP adapter does not surface the MasterInfo. This makes it not compatible 
with the V0 API where the {{registered}} and {{reregistered}} calls provided 
the MasterInfo to the framework.
cc [~vinodkone]

Summary: HTTP Adapter does not surface MasterInfo.  (was: HTTP Adapter 
does not surface MasterInfo)

> HTTP Adapter does not surface MasterInfo.
> -
>
> Key: MESOS-6497
> URL: https://issues.apache.org/jira/browse/MESOS-6497
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Joris Van Remoortere
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere, v1_api
>
> The HTTP adapter does not surface the {{MasterInfo}}. This makes it not 
> compatible with the V0 API where the {{registered}} and {{reregistered}} 
> calls provided the MasterInfo to the framework.
> cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-27 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612536#comment-15612536
 ] 

Anand Mazumdar commented on MESOS-6212:
---

Keeping the JIRA open till I complete the backport to 1.0.2.

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
> Fix For: 1.1.0
>
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-27 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6212:
--
Target Version/s: 1.0.2, 1.1.0  (was: 1.0.2)
   Fix Version/s: (was: 1.0.2)
  1.1.0

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
> Fix For: 1.1.0
>
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6458) Add test to check fromString function of stout library

2016-10-27 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6458:
--
Target Version/s:   (was: 1.0.2)
   Fix Version/s: 1.1.0

> Add test to check fromString function of stout library
> --
>
> Key: MESOS-6458
> URL: https://issues.apache.org/jira/browse/MESOS-6458
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Affects Versions: 1.0.1
>Reporter: Manuwela Kanade
>Assignee: Manuwela Kanade
>Priority: Trivial
> Fix For: 1.1.0
>
>
> For the 3rdparty stout library, there is a testcase for checking Malformed 
> UUID. 
> But this testcase does not have a positive test for the fromString function 
> to test if it returns correct UUID when passed a correctly formatted UUID 
> string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-27 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6212:
--
Shepherd: Timothy Chen  (was: Anand Mazumdar)

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
> Fix For: 1.0.2
>
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5103) Enhance mesos-health-check to send v1:: TaskHealthStatus message

2016-10-24 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602306#comment-15602306
 ] 

Anand Mazumdar commented on MESOS-5103:
---

Looks like we missed closing this issue when we made the command executor 
unversioned. For more context see: 
https://github.com/apache/mesos/commit/00709d0dbb71d61242b902d3d324fa2dd5f12adc

> Enhance mesos-health-check to send v1:: TaskHealthStatus message
> 
>
> Key: MESOS-5103
> URL: https://issues.apache.org/jira/browse/MESOS-5103
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: haosdent
>
> The existing {{mesos-health-check}} 
> (https://github.com/apache/mesos/blob/master/src/health-check/main.cpp) can 
> only send the unversioned {{TaskHealthStatus}} message. However, with the new 
> Executor HTTP Library, there will be executor which is based on v1 HTTP 
> executor API and all the protobuf messages used by it should be v1 as well 
> (e.g., {{v1::TaskHealthStatus}}).
> So we may either modify the existing {{mesos-health-check}} binary to send 
> {{v1::TaskHealthStatus}} messages in addition to the unversioned ones or 
> create a new binary for versioned health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-22 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6212:
--
Shepherd: Anand Mazumdar

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-22 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597997#comment-15597997
 ] 

Anand Mazumdar commented on MESOS-6212:
---

Would be happy to.

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6407) Move DEFAULT_v1_xxx macros to the v1 namespace.

2016-10-17 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6407:
-

 Summary: Move DEFAULT_v1_xxx macros to the v1 namespace.
 Key: MESOS-6407
 URL: https://issues.apache.org/jira/browse/MESOS-6407
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar
Assignee: Joris Van Remoortere


We should clean up the existing {{DEFAULT_v1_*}} macros and bring it under the 
{{v1}} namespace e.g., {{v1::DEFAULT_FRAMEWORK_INFO}}. This is necessary for 
doing a larger cleanup i.e., we would like to introduce {{createXXX}} for the 
{{v1}} API and would not like to add {{createV1XXX}} functions eventually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2016-10-17 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-6405:
-

Assignee: Anand Mazumdar

> Benchmark call ingestion path on the Mesos master.
> --
>
> Key: MESOS-6405
> URL: https://issues.apache.org/jira/browse/MESOS-6405
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> [~drexin] reported on the user mailing 
> [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
>  that there seems to be a significant regression in performance on the call 
> ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 
> We should create a benchmark to first get a sense of the numbers and then go 
> about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2016-10-17 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6405:
--
Shepherd: Vinod Kone
  Sprint: Mesosphere Sprint 45
Story Points: 3
  Labels: mesosphere  (was: )

> Benchmark call ingestion path on the Mesos master.
> --
>
> Key: MESOS-6405
> URL: https://issues.apache.org/jira/browse/MESOS-6405
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> [~drexin] reported on the user mailing 
> [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
>  that there seems to be a significant regression in performance on the call 
> ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 
> We should create a benchmark to first get a sense of the numbers and then go 
> about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2016-10-17 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6405:
--
Priority: Critical  (was: Major)

> Benchmark call ingestion path on the Mesos master.
> --
>
> Key: MESOS-6405
> URL: https://issues.apache.org/jira/browse/MESOS-6405
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Priority: Critical
>
> [~drexin] reported on the user mailing 
> [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
>  that there seems to be a significant regression in performance on the call 
> ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 
> We should create a benchmark to first get a sense of the numbers and then go 
> about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2016-10-17 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6405:
-

 Summary: Benchmark call ingestion path on the Mesos master.
 Key: MESOS-6405
 URL: https://issues.apache.org/jira/browse/MESOS-6405
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


[~drexin] reported on the user mailing 
[list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
 that there seems to be a significant regression in performance on the call 
ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 

We should create a benchmark to first get a sense of the numbers and then go 
about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5222) Add benchmark for writing events on the persistent connection.

2016-10-16 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5222:
--
Description: It would be good to add a benchmark for testing writing events 
on the persistent connection for HTTP frameworks wrt driver based frameworks. 
The benchmark can be as simple as trying to stream generated reconciliation 
status update events on the persistent connection between the master and the 
scheduler.  (was: It would be good to add a benchmark for scale testing the 
HTTP frameworks wrt driver based frameworks. The benchmark can be as simple as 
trying to launch N tasks (parameterized) with the old/new API. We can then 
focus on fixing performance issues that we find as a result of this exercise.)

> Add benchmark for writing events on the persistent connection.
> --
>
> Key: MESOS-5222
> URL: https://issues.apache.org/jira/browse/MESOS-5222
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> It would be good to add a benchmark for testing writing events on the 
> persistent connection for HTTP frameworks wrt driver based frameworks. The 
> benchmark can be as simple as trying to stream generated reconciliation 
> status update events on the persistent connection between the master and the 
> scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5222) Add benchmark for writing events on the persistent connection.

2016-10-16 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5222:
--
Summary: Add benchmark for writing events on the persistent connection.  
(was: Create a benchmark for scale testing HTTP frameworks)

> Add benchmark for writing events on the persistent connection.
> --
>
> Key: MESOS-5222
> URL: https://issues.apache.org/jira/browse/MESOS-5222
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> It would be good to add a benchmark for scale testing the HTTP frameworks wrt 
> driver based frameworks. The benchmark can be as simple as trying to launch N 
> tasks (parameterized) with the old/new API. We can then focus on fixing 
> performance issues that we find as a result of this exercise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6373) Add the ability to not accept any connections from a client.

2016-10-11 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6373:
-

 Summary: Add the ability to not accept any connections from a 
client.
 Key: MESOS-6373
 URL: https://issues.apache.org/jira/browse/MESOS-6373
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Similar to the old {{DROP_MESSAGES}} abstraction allowing us to drop all 
messages from a client, we need a way to drop all incoming connection requests 
from a client. When running our tests, we initialize the libprocess instance 
once. When we notice a disconnection via the {{Connection}} abstraction, we 
immediately reconnect back due to being able to initiate a connection with the 
running libprocess instance. 

Use Cases: 
- Upon using the mock executor instance, we are never able to test if the 
recovery timeout expired due to the executor being able to reconnect with the 
agent immediately upon a disconnection since the libprocess instance running 
the tests is always alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6370) The executor library does not invoke the shutdown callback upon recovery timeout.

2016-10-11 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6370:
-

 Summary: The executor library does not invoke the shutdown 
callback upon recovery timeout.
 Key: MESOS-6370
 URL: https://issues.apache.org/jira/browse/MESOS-6370
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


The executor library does not invoke the {{shutdown}} callback for checkpointed 
frameworks upon recovery timeout before committing suicide. This is 
inconsistent with the executor driver that invokes shutdown upon recovery 
timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6363) Default executor should not crash with a failed assertion if it notices a disconnection from the agent for non checkpointed frameworks.

2016-10-11 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6363:
--
Target Version/s: 1.1.0

> Default executor should not crash with a failed assertion if it notices a 
> disconnection from the agent for non checkpointed frameworks.
> ---
>
> Key: MESOS-6363
> URL: https://issues.apache.org/jira/browse/MESOS-6363
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> If the executor library detects a disconnection for non-checkpointed 
> frameworks, it injects a {{SHUTDOWN}} event. For checkpointed frameworks, it 
> injects the {{SHUTDOWN}} event post the recovery timeout. In both these 
> cases, the default executor would die with a failed assertion in the 
> {{shutdown()}} handler:
> {code}
> CHECK_EQ(SUBSCRIBED, state);
> {code}
> The executor should commit suicide in both these cases with a successful 
> status code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6368) HTTP API v1 testing abstraction improvements

2016-10-11 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6368:
-

 Summary: HTTP API v1 testing abstraction improvements
 Key: MESOS-6368
 URL: https://issues.apache.org/jira/browse/MESOS-6368
 Project: Mesos
  Issue Type: Epic
Reporter: Anand Mazumdar


This epic covers all improvements needed to the existing testing infrastructure 
for the v1 HTTP API's (Scheduler/Executor/Operator). Some of the existing 
testing libprocess primitives for old driver based schedulers/executors cannot 
be used for the new API's since the communication does not happen via 
traditional libprocess message passing. Also, there might be some test helpers 
that exist for the old protobuf's (v0) that we should consider introducing for 
the v1 {{Call/Event}} protobufs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6363) Default executor should not crash with a failed assertion if it notices a disconnection from the agent for non checkpointed frameworks.

2016-10-11 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6363:
-

 Summary: Default executor should not crash with a failed assertion 
if it notices a disconnection from the agent for non checkpointed frameworks.
 Key: MESOS-6363
 URL: https://issues.apache.org/jira/browse/MESOS-6363
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


If the executor library detects a disconnection for non-checkpointed 
frameworks, it injects a {{SHUTDOWN}} event. For checkpointed frameworks, it 
injects the {{SHUTDOWN}} event post the recovery timeout. In both these cases, 
the default executor would die with a failed assertion in the {{shutdown()}} 
handler:
{code}
CHECK_EQ(SUBSCRIBED, state);
{code}

The executor should commit suicide in both these cases with a successful status 
code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2016-10-11 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6177:
--
Target Version/s:   (was: 1.1.0)

> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6356) ASF CI has interleaved logging.

2016-10-10 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564509#comment-15564509
 ] 

Anand Mazumdar commented on MESOS-6356:
---

We need to find a way to (un-)interleave them across *different* test 
executions. Also, how did this use to work before? I  had never seen this until 
recently on the ASF CI.

> ASF CI has interleaved logging.
> ---
>
> Key: MESOS-6356
> URL: https://issues.apache.org/jira/browse/MESOS-6356
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere, test
> Attachments: consoleText.zip
>
>
> It seems that the build output for test runs from ASF CI has interleaved 
> loggingmaking it very hard to debug test failures. This looks to have 
> started happening after the unified unified cgroups isolator patches went in 
> but we are yet to find a correlation.
> An example ASF CI run with interleaved logs:
> https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)/2762/changes
> (Also attached this to the ticket)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6356) ASF CI has interleaved logging.

2016-10-10 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6356:
--
Attachment: consoleText.zip

> ASF CI has interleaved logging.
> ---
>
> Key: MESOS-6356
> URL: https://issues.apache.org/jira/browse/MESOS-6356
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere, test
> Attachments: consoleText.zip
>
>
> It seems that the build output for test runs from ASF CI has interleaved 
> loggingmaking it very hard to debug test failures. This looks to have 
> started happening after the unified unified cgroups isolator patches went in 
> but we are yet to find a correlation.
> An example ASF CI run with interleaved logs:
> https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)/2762/changes
> (Also attached this to the ticket)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >