[jira] [Commented] (MESOS-6848) The default executor does not exit if a single task pod fails.
[ https://issues.apache.org/jira/browse/MESOS-6848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816312#comment-15816312 ] Anand Mazumdar commented on MESOS-6848: --- aha, Thanks for the tip. > The default executor does not exit if a single task pod fails. > -- > > Key: MESOS-6848 > URL: https://issues.apache.org/jira/browse/MESOS-6848 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > Fix For: 1.2.0 > > > If a task group has a single task and it exits with a non-zero exit code, the > default executor does not commit suicide. > This mostly happens due to the fact that we invoke {{shutdown()}} in > {{waited()}} when we notice the termination of a single container here: > https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666 > but then we return early here after executing all the kill calls: > https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751 > However, when there is just one task in the task group, this won't result in > {{__shutdown}} being called ever leading to the executor committing suicide. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6848) The default executor does not exit if a single task pod fails.
[ https://issues.apache.org/jira/browse/MESOS-6848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816257#comment-15816257 ] Anand Mazumdar commented on MESOS-6848: --- Keeping the issue open to backport to 1.1.x branch. {noformat} commit 3efcd33f440c7e56c137bfb7cd953ee35e4b3aa5 Author: Anand MazumdarDate: Tue Jan 10 13:08:03 2017 -0800 Fixed a bug in the default executor around not committing suicide. This bug is only observed when the task group contains a single task. The default executor was not committing suicide when this single task used to exit with a non-zero status code as per the default restart policy. Review: https://reviews.apache.org/r/55157/ {noformat} > The default executor does not exit if a single task pod fails. > -- > > Key: MESOS-6848 > URL: https://issues.apache.org/jira/browse/MESOS-6848 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > > If a task group has a single task and it exits with a non-zero exit code, the > default executor does not commit suicide. > This mostly happens due to the fact that we invoke {{shutdown()}} in > {{waited()}} when we notice the termination of a single container here: > https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666 > but then we return early here after executing all the kill calls: > https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751 > However, when there is just one task in the task group, this won't result in > {{__shutdown}} being called ever leading to the executor committing suicide. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.
[ https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6082: -- Shepherd: Anand Mazumdar > Add scheduler Call and Event based metrics to the master. > - > > Key: MESOS-6082 > URL: https://issues.apache.org/jira/browse/MESOS-6082 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta >Priority: Critical > > Currently, the master only has metrics for the old-style messages and these > are re-used for calls unfortunately: > {code} > // Messages from schedulers. > process::metrics::Counter messages_register_framework; > process::metrics::Counter messages_reregister_framework; > process::metrics::Counter messages_unregister_framework; > process::metrics::Counter messages_deactivate_framework; > process::metrics::Counter messages_kill_task; > process::metrics::Counter messages_status_update_acknowledgement; > process::metrics::Counter messages_resource_request; > process::metrics::Counter messages_launch_tasks; > process::metrics::Counter messages_decline_offers; > process::metrics::Counter messages_revive_offers; > process::metrics::Counter messages_suppress_offers; > process::metrics::Counter messages_reconcile_tasks; > process::metrics::Counter messages_framework_to_executor; > {code} > Now that we've introduced the Call/Event based API, we should have metrics > that reflect this. For example: > {code} > { > scheduler/calls: 100 > scheduler/calls/decline: 90, > scheduler/calls/accept: 10, > scheduler/calls/accept/operations/create: 1, > scheduler/calls/accept/operations/destroy: 0, > scheduler/calls/accept/operations/launch: 4, > scheduler/calls/accept/operations/launch_group: 2, > scheduler/calls/accept/operations/reserve: 1, > scheduler/calls/accept/operations/unreserve: 0, > scheduler/calls/kill: 0, > // etc > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream
[ https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-3601: -- Shepherd: Vinod Kone Link to proposal: http://bit.ly/2iovQVe > Formalize all headers and metadata for HTTP API Event Stream > > > Key: MESOS-3601 > URL: https://issues.apache.org/jira/browse/MESOS-3601 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.24.0 > Environment: Mesos 0.24.0 >Reporter: Ben Whitehead >Assignee: Anand Mazumdar >Priority: Blocker > Labels: api, http, mesosphere, wireprotocol > > From an HTTP standpoint the current set of headers returned when connecting > to the HTTP scheduler API are insufficient. > {code:title=current headers} > HTTP/1.1 200 OK > Transfer-Encoding: chunked > Date: Wed, 30 Sep 2015 21:07:16 GMT > Content-Type: application/json > {code} > Since the response from mesos is intended to function as a stream > {{Connection: keep-alive}} should be specified so that the connection can > remain open. > If RecordIO is going to be applied to the messages, the headers should > include the information necessary for a client to be able to detect RecordIO > and setup it response handlers appropriately. > How RecordIO is expressed will come down to the semantics of what is actually > "Returned" as the response from {{POST /api/v1/scheduler}}. > h4. Proposal > One approach would be to leverage http as much as possible, having a client > specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate > that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} > messages. (This approach allows for things like gzip to be woven in fairly > easily in the future) > For this approach I would expect the following: > {code:title=Request} > POST /api/v1/scheduler HTTP/1.1 > Host: localhost:5050 > Accept: application/x-protobuf > Accept-Encoding: recordio > Content-Type: application/x-protobuf > Content-Length: 35 > User-Agent: RxNetty Client > {code} > {code:title=Response} > HTTP/1.1 200 OK > Connection: keep-alive > Transfer-Encoding: chunked > Content-Type: application/x-protobuf > Content-Encoding: recordio > Cache-Control: no-transform > {code} > When Content-Encoding is used it is recommended to set {{Cache-Control: > no-transform}} to signal to any proxies that no transformation should be > applied to the the content encoding [Section 14.11 RFC > 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6884) Add a test to verify that scheduler can launch a TTY container
[ https://issues.apache.org/jira/browse/MESOS-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806520#comment-15806520 ] Anand Mazumdar commented on MESOS-6884: --- I don't think so. A related test {{AttachContainerInput}} launches a TTY container as a nested sub-container. But, we don't yet have a test that tries to launch a root level TTY container using the Scheduler API directly. (e.g., launch vim etc.) > Add a test to verify that scheduler can launch a TTY container > -- > > Key: MESOS-6884 > URL: https://issues.apache.org/jira/browse/MESOS-6884 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Anand Mazumdar > > [~anandmazumdar] Is this already done? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6864) Container Exec should be possible with tasks belonging to a task group
[ https://issues.apache.org/jira/browse/MESOS-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6864: -- Target Version/s: 1.2.0 Priority: Blocker (was: Major) > Container Exec should be possible with tasks belonging to a task group > -- > > Key: MESOS-6864 > URL: https://issues.apache.org/jira/browse/MESOS-6864 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Blocker > Labels: debugging, mesosphere > > {{LaunchNestedContainerSession}} currently requires the parent container to > be an Executor > (https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189-L2211). > This works for command tasks, because the task container id is the same as > the executor container id. > But it won't work for pod tasks whose container id is different from > executor’s container id. > In order to resolve this ticket, we need to allow launching a child container > at an arbitrary level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6865) Remove the constraint of being only able to launch 2 level nested containers on Agent API
Anand Mazumdar created MESOS-6865: - Summary: Remove the constraint of being only able to launch 2 level nested containers on Agent API Key: MESOS-6865 URL: https://issues.apache.org/jira/browse/MESOS-6865 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Priority: Blocker Currently, the Agent API has a constraint that it _only_ allows two level of nesting. This was done at that time since the containerizer was still being worked on to support arbitrary level of nesting. Now, that the work has been completed we should remove the constraint on the API handlers on the agent. Note that this constraint also impacts the Debugging API i.e., a user currently can't attach to a task (child container) of a task group since we explicitly check that the top level container belongs to an executor on the API handler. https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6859) Document HA behavior during mesos master replacement
[ https://issues.apache.org/jira/browse/MESOS-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6859: -- Labels: newbie (was: ) > Document HA behavior during mesos master replacement > > > Key: MESOS-6859 > URL: https://issues.apache.org/jira/browse/MESOS-6859 > Project: Mesos > Issue Type: Documentation > Components: documentation, master >Reporter: Charles Allen > Labels: newbie > > In a discussion in https://mesos.slack.com/archives/general/p1483637159001494 > the question was brought up when a "new" master is really fully ready. > Specifically, in the case where new masters can spin up faster than masters > can sync their logs, it is unclear from the HA docs at > http://mesos.apache.org/documentation/latest/high-availability/ how to ensure > a freshly spawned master is ready to take over leadership. > There is documentation at > http://mesos.apache.org/documentation/latest/monitoring/ about using > {{registrar/log/recovered}} to gather this kind of information, but such > information is very easy to overlook. > This ask is that the HA docs be amended to include more information about how > to use {{registrar/log/recovered}} properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6848) The default executor does not exit if a single task pod fails.
Anand Mazumdar created MESOS-6848: - Summary: The default executor does not exit if a single task pod fails. Key: MESOS-6848 URL: https://issues.apache.org/jira/browse/MESOS-6848 Project: Mesos Issue Type: Bug Affects Versions: 1.1.0 Reporter: Anand Mazumdar Assignee: Anand Mazumdar Priority: Blocker If a task group has a single task and it exits with a non-zero exit code, the default executor does not commit suicide. This mostly happens due to the fact that we invoke {{shutdown()}} in {{waited()}} when we notice the termination of a single container here: https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L666 but then we return early here after executing all the kill calls: https://github.com/apache/mesos/blob/master/src/launcher/default_executor.cpp#L751 However, when there is just one task in the task group, this won't result in {{__shutdown}} being called ever leading to the executor committing suicide. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6177: -- Target Version/s: 1.2.0 > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
[ https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6784: -- Priority: Major (was: Blocker) > IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky > > > Key: MESOS-6784 > URL: https://issues.apache.org/jira/browse/MESOS-6784 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Neil Conway >Assignee: Anand Mazumdar > Labels: mesosphere > > {noformat} > [ RUN ] IOSwitchboardTest.KillSwitchboardContainerDestroyed > I1212 13:57:02.641043 2211 containerizer.cpp:220] Using isolation: > posix/cpu,filesystem/posix,network/cni > W1212 13:57:02.641438 2211 backend.cpp:76] Failed to create 'overlay' > backend: OverlayBackend requires root privileges, but is running as user nrc > W1212 13:57:02.641559 2211 backend.cpp:76] Failed to create 'bind' backend: > BindBackend requires root privileges > I1212 13:57:02.642822 2268 containerizer.cpp:594] Recovering containerizer > I1212 13:57:02.643975 2253 provisioner.cpp:253] Provisioner recovery complete > I1212 13:57:02.644953 2255 containerizer.cpp:986] Starting container > 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework > I1212 13:57:02.647004 2245 switchboard.cpp:430] Allocated pseudo terminal > '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.652305 2245 switchboard.cpp:596] Created I/O switchboard > server (pid: 2705) listening on socket file > '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.655513 2267 launcher.cpp:133] Forked child with pid '2706' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f' > I1212 13:57:02.655732 2267 containerizer.cpp:1621] Checkpointing container's > forked pid 2706 to > '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid' > I1212 13:57:02.726306 2265 containerizer.cpp:2463] Container > 09e87380-00ab-4987-83c9-fa1c5d86717f has exited > I1212 13:57:02.726352 2265 containerizer.cpp:2100] Destroying container > 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state > E1212 13:57:02.726495 2243 switchboard.cpp:861] Unexpected termination of > I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.726563 2265 launcher.cpp:149] Asked to destroy container > 09e87380-00ab-4987-83c9-fa1c5d86717f > E1212 13:57:02.783607 2228 switchboard.cpp:799] Failed to remove unix domain > socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or > directory > ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure > Value of: wait.get()->reasons().size() == 1 > Actual: false > Expected: true > *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are > using GNU date *** > PC: @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; > stack trace: *** > @ 0x7faecf855100 (unknown) > @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > @ 0x1be6247 testing::internal::AssertHelper::operator=() > @ 0x19ed751 > mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody() > @ 0x1c0ed8c > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x1c09e74 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x1beb505 testing::Test::Run() > @ 0x1bebc88 testing::TestInfo::Run() > @ 0x1bec2ce testing::TestCase::Run() > @ 0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests() > @ 0x1c0f9b1 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x1c0a9f2 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x1bf18ee testing::UnitTest::Run() > @ 0x11bc9e3 RUN_ALL_TESTS() > @ 0x11bc599 main > @ 0x7faece663b15 __libc_start_main > @ 0xa9c219 (unknown) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6597) Include v1 Operator API protos in generated JAR and python packages.
[ https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783388#comment-15783388 ] Anand Mazumdar commented on MESOS-6597: --- Backported to 1.1.x {noformat} commit 2342bc3357ada490701cb1f40c5a27c30bc0ba1b Author: Anand MazumdarDate: Wed Dec 28 10:00:05 2016 -0800 Added MESOS-6597 to CHANGELOG for 1.1.1. commit 010c62dc81e3a8861f363db517cc064d190af85e Author: Vijay Srinivasaraghavan Date: Mon Dec 5 08:58:05 2016 -0800 Enabled python proto generation for v1 Master/Agent API. The correspondng master/agent protos are now included in the generated Mesos pypi package. Review: https://reviews.apache.org/r/54015/ commit 63f31e0cf17e62ff5479d421112a8c80efa954c1 Author: Vijay Srinivasaraghavan Date: Mon Dec 5 08:58:00 2016 -0800 Enabled java protos generation for v1 Master/Agent API. The corresponding master/agent protos are now included in the generated Mesos JAR. Review: https://reviews.apache.org/r/53825/ commit 1e24a39925474a1a9a3113909f4f2496448f5469 Author: Vijay Srinivasaraghavan Date: Mon Dec 5 08:57:55 2016 -0800 Fixed missing protobuf java package/classname definition. Review: https://reviews.apache.org/r/54014/ {noformat} > Include v1 Operator API protos in generated JAR and python packages. > > > Key: MESOS-6597 > URL: https://issues.apache.org/jira/browse/MESOS-6597 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Vijay Srinivasaraghavan >Assignee: Vijay Srinivasaraghavan >Priority: Blocker > Fix For: 1.1.1, 1.2.0 > > > For V1 API support, the build file that generates Java protos wrapper as of > now includes only executor and scheduler. > (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) > To support operator HTTP API, we also need to generate java protos for > additional proto definitions like quota, maintenance etc., These java > definition files will be used by a standard Rest client when using the > straight HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6597) Include v1 Operator API protos in generated JAR and python packages.
[ https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6597: -- Summary: Include v1 Operator API protos in generated JAR and python packages. (was: Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API) > Include v1 Operator API protos in generated JAR and python packages. > > > Key: MESOS-6597 > URL: https://issues.apache.org/jira/browse/MESOS-6597 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Vijay Srinivasaraghavan >Assignee: Vijay Srinivasaraghavan >Priority: Blocker > > For V1 API support, the build file that generates Java protos wrapper as of > now includes only executor and scheduler. > (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) > To support operator HTTP API, we also need to generate java protos for > additional proto definitions like quota, maintenance etc., These java > definition files will be used by a standard Rest client when using the > straight HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6825) Increase default allocation_interval for tests
[ https://issues.apache.org/jira/browse/MESOS-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767510#comment-15767510 ] Anand Mazumdar edited comment on MESOS-6825 at 12/21/16 4:54 PM: - > This means that if the host running the tests is slow, a test case might > receive a resource offer that it doesn't receive when running on a faster > host. Did you mean that if the "task" in question in the test reaches a terminal state? This shouldn't happen otherwise if the "used" resources on the agent did not change (task is active) was (Author: anandmazumdar): > This means that if the host running the tests is slow, a test case might > receive a resource offer that it doesn't receive when running on a faster > host. Did you mean that if the "task" in question in the test finishes? This shouldn't happen otherwise if the "used" resources on the agent did not change (task is active) > Increase default allocation_interval for tests > -- > > Key: MESOS-6825 > URL: https://issues.apache.org/jira/browse/MESOS-6825 > Project: Mesos > Issue Type: Improvement > Components: tests >Reporter: Neil Conway > Labels: mesosphere > > The default {{allocation_interval}} is 1 second. This means that if the host > running the tests is slow, a test case might receive a resource offer that it > doesn't receive when running on a faster host. We could workaround this by > explicitly using {{WillRepeatedly(Return())}}, but that is a bit kludgy and > obscures the intent of the test. > One way to avoid this would be to pause the clock by default in tests > (MESOS-4101). That would be a quite involved change, however. > Instead, we could consider raising the default {{allocation_interval}} to a > large value, such as 1 minute or longer. This would significantly reduce the > chance of host performance causing test flakiness. Moreover, it would help to > highlight tests that rely on real-time batch allocations for correctness: > generally tests should avoid doing this (because it causes test slowness). > They should instead pause the clock and then advance it by > {{allocation_interval}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6825) Increase default allocation_interval for tests
[ https://issues.apache.org/jira/browse/MESOS-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767510#comment-15767510 ] Anand Mazumdar commented on MESOS-6825: --- > This means that if the host running the tests is slow, a test case might > receive a resource offer that it doesn't receive when running on a faster > host. Did you mean that if the "task" in question in the test finishes? This shouldn't happen otherwise if the "used" resources on the agent did not change (task is active) > Increase default allocation_interval for tests > -- > > Key: MESOS-6825 > URL: https://issues.apache.org/jira/browse/MESOS-6825 > Project: Mesos > Issue Type: Improvement > Components: tests >Reporter: Neil Conway > Labels: mesosphere > > The default {{allocation_interval}} is 1 second. This means that if the host > running the tests is slow, a test case might receive a resource offer that it > doesn't receive when running on a faster host. We could workaround this by > explicitly using {{WillRepeatedly(Return())}}, but that is a bit kludgy and > obscures the intent of the test. > One way to avoid this would be to pause the clock by default in tests > (MESOS-4101). That would be a quite involved change, however. > Instead, we could consider raising the default {{allocation_interval}} to a > large value, such as 1 minute or longer. This would significantly reduce the > chance of host performance causing test flakiness. Moreover, it would help to > highlight tests that rely on real-time batch allocations for correctness: > generally tests should avoid doing this (because it causes test slowness). > They should instead pause the clock and then advance it by > {{allocation_interval}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6823: -- Labels: flaky flaky-test newbie (was: flaky flaky-test) > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > is flaky > -- > > Key: MESOS-6823 > URL: https://issues.apache.org/jira/browse/MESOS-6823 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 12/14 both with/without SSL >Reporter: Anand Mazumdar > Labels: flaky, flaky-test, newbie > > This showed up on our internal CI > {code} > [23:13:01] : [Step 11/11] [ RUN ] > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > [23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] > Creating default 'local' authorizer > [23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] > Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) > started on 172.16.10.213:45407 > [23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" > --zk_session_timeout="10secs" > [23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] > Master only allowing authenticated frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] > Master only allowing authenticated agents to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] > Loading credentials for authentication from > '/mnt/teamcity/temp/buildTmp/ev3icd/credentials' > [23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using > default 'crammd5' authenticator > [23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] > Authorization enabled > [23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 > whitelist_watcher.cpp:77] No whitelist given > [23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] > Initialized hierarchical allocator process > [23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] > Elected as the leading master! > [23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] > Recovering from registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] > Recovering registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] > Successfully fetched the registry (0B) in 210944ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] > Applied 1 operations in 5006ns; attempting to update the registry > [23:13:01] : [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] > Successfully updated the registry in 194048ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] > Successfully recovered registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] >
[jira] [Commented] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764797#comment-15764797 ] Anand Mazumdar commented on MESOS-6823: --- cc: [~kaysoky] [~sivaramsk] > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > is flaky > -- > > Key: MESOS-6823 > URL: https://issues.apache.org/jira/browse/MESOS-6823 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 12/14 both with/without SSL >Reporter: Anand Mazumdar > Labels: flaky, flaky-test > > This showed up on our internal CI > {code} > [23:13:01] : [Step 11/11] [ RUN ] > bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 > [23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] > Creating default 'local' authorizer > [23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] > Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) > started on 172.16.10.213:45407 > [23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" > --zk_session_timeout="10secs" > [23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] > Master only allowing authenticated frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] > Master only allowing authenticated agents to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] > Loading credentials for authentication from > '/mnt/teamcity/temp/buildTmp/ev3icd/credentials' > [23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using > default 'crammd5' authenticator > [23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] > Authorization enabled > [23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 > whitelist_watcher.cpp:77] No whitelist given > [23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] > Initialized hierarchical allocator process > [23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] > Elected as the leading master! > [23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] > Recovering from registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] > Recovering registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] > Successfully fetched the registry (0B) in 210944ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] > Applied 1 operations in 5006ns; attempting to update the registry > [23:13:01] : [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] > Successfully updated the registry in 194048ns > [23:13:01] : [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] > Successfully recovered registrar > [23:13:01] : [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] >
[jira] [Created] (MESOS-6823) bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky
Anand Mazumdar created MESOS-6823: - Summary: bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 is flaky Key: MESOS-6823 URL: https://issues.apache.org/jira/browse/MESOS-6823 Project: Mesos Issue Type: Bug Environment: Ubuntu 12/14 both with/without SSL Reporter: Anand Mazumdar This showed up on our internal CI {code} [23:13:01] : [Step 11/11] [ RUN ] bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0 [23:13:01] : [Step 11/11] I1219 23:13:01.653230 25712 cluster.cpp:160] Creating default 'local' authorizer [23:13:01] : [Step 11/11] I1219 23:13:01.654103 25732 master.cpp:380] Master c590a129-814c-4903-9681-e16da4da4c94 (ip-172-16-10-213.mesosphere.io) started on 172.16.10.213:45407 [23:13:01] : [Step 11/11] I1219 23:13:01.654119 25732 master.cpp:382] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/mnt/teamcity/temp/buildTmp/ev3icd/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/mnt/teamcity/temp/buildTmp/ev3icd/master" --zk_session_timeout="10secs" [23:13:01] : [Step 11/11] I1219 23:13:01.654248 25732 master.cpp:432] Master only allowing authenticated frameworks to register [23:13:01] : [Step 11/11] I1219 23:13:01.654254 25732 master.cpp:446] Master only allowing authenticated agents to register [23:13:01] : [Step 11/11] I1219 23:13:01.654258 25732 master.cpp:459] Master only allowing authenticated HTTP frameworks to register [23:13:01] : [Step 11/11] I1219 23:13:01.654261 25732 credentials.hpp:37] Loading credentials for authentication from '/mnt/teamcity/temp/buildTmp/ev3icd/credentials' [23:13:01] : [Step 11/11] I1219 23:13:01.654343 25732 master.cpp:504] Using default 'crammd5' authenticator [23:13:01] : [Step 11/11] I1219 23:13:01.654386 25732 http.cpp:922] Using default 'basic' HTTP authenticator for realm 'mesos-master-readonly' [23:13:01] : [Step 11/11] I1219 23:13:01.654429 25732 http.cpp:922] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' [23:13:01] : [Step 11/11] I1219 23:13:01.654458 25732 http.cpp:922] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' [23:13:01] : [Step 11/11] I1219 23:13:01.654477 25732 master.cpp:584] Authorization enabled [23:13:01] : [Step 11/11] I1219 23:13:01.654551 25733 whitelist_watcher.cpp:77] No whitelist given [23:13:01] : [Step 11/11] I1219 23:13:01.654582 25730 hierarchical.cpp:149] Initialized hierarchical allocator process [23:13:01] : [Step 11/11] I1219 23:13:01.655076 25732 master.cpp:2046] Elected as the leading master! [23:13:01] : [Step 11/11] I1219 23:13:01.655086 25732 master.cpp:1568] Recovering from registrar [23:13:01] : [Step 11/11] I1219 23:13:01.655124 25729 registrar.cpp:329] Recovering registrar [23:13:01] : [Step 11/11] I1219 23:13:01.655354 25731 registrar.cpp:362] Successfully fetched the registry (0B) in 210944ns [23:13:01] : [Step 11/11] I1219 23:13:01.655385 25731 registrar.cpp:461] Applied 1 operations in 5006ns; attempting to update the registry [23:13:01] : [Step 11/11] I1219 23:13:01.655593 25732 registrar.cpp:506] Successfully updated the registry in 194048ns [23:13:01] : [Step 11/11] I1219 23:13:01.655658 25732 registrar.cpp:392] Successfully recovered registrar [23:13:01] : [Step 11/11] I1219 23:13:01.655799 25732 master.cpp:1684] Recovered 0 agents from the registry (174B); allowing 10mins for agents to re-register [23:13:01] : [Step 11/11] I1219 23:13:01.655840 25728 hierarchical.cpp:176] Skipping recovery of hierarchical allocator: nothing to recover [23:13:01] : [Step 11/11] I1219 23:13:01.656813 25712 containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni [23:13:01] : [Step
[jira] [Commented] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
[ https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755184#comment-15755184 ] Anand Mazumdar commented on MESOS-6784: --- Committed a fix for the second log snippet that Jie posted around the test bug. {noformat} commit 28eaa8df7c95130b0c244f7613ad506be899cafd Author: Anand MazumdarDate: Wed Dec 14 17:40:47 2016 -0800 Fixed the 'IOSwitchboardTest.KillSwitchboardContainerDestroyed' test. The container was launched with TTY enabled. This meant that killing the switchboard would trigger the task to terminate on its own owing to the "master" end of the TTY dying. This would make it not go through the code path of the isolator failing due to resource limit issue. Review: https://reviews.apache.org/r/54770 {noformat} The original log in the issue description is a separate issue in the switchboard code itself and I am working on that. This should make the CI green for now. > IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky > > > Key: MESOS-6784 > URL: https://issues.apache.org/jira/browse/MESOS-6784 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Neil Conway >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > {noformat} > [ RUN ] IOSwitchboardTest.KillSwitchboardContainerDestroyed > I1212 13:57:02.641043 2211 containerizer.cpp:220] Using isolation: > posix/cpu,filesystem/posix,network/cni > W1212 13:57:02.641438 2211 backend.cpp:76] Failed to create 'overlay' > backend: OverlayBackend requires root privileges, but is running as user nrc > W1212 13:57:02.641559 2211 backend.cpp:76] Failed to create 'bind' backend: > BindBackend requires root privileges > I1212 13:57:02.642822 2268 containerizer.cpp:594] Recovering containerizer > I1212 13:57:02.643975 2253 provisioner.cpp:253] Provisioner recovery complete > I1212 13:57:02.644953 2255 containerizer.cpp:986] Starting container > 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework > I1212 13:57:02.647004 2245 switchboard.cpp:430] Allocated pseudo terminal > '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.652305 2245 switchboard.cpp:596] Created I/O switchboard > server (pid: 2705) listening on socket file > '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.655513 2267 launcher.cpp:133] Forked child with pid '2706' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f' > I1212 13:57:02.655732 2267 containerizer.cpp:1621] Checkpointing container's > forked pid 2706 to > '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid' > I1212 13:57:02.726306 2265 containerizer.cpp:2463] Container > 09e87380-00ab-4987-83c9-fa1c5d86717f has exited > I1212 13:57:02.726352 2265 containerizer.cpp:2100] Destroying container > 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state > E1212 13:57:02.726495 2243 switchboard.cpp:861] Unexpected termination of > I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.726563 2265 launcher.cpp:149] Asked to destroy container > 09e87380-00ab-4987-83c9-fa1c5d86717f > E1212 13:57:02.783607 2228 switchboard.cpp:799] Failed to remove unix domain > socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or > directory > ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure > Value of: wait.get()->reasons().size() == 1 > Actual: false > Expected: true > *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are > using GNU date *** > PC: @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; > stack trace: *** > @ 0x7faecf855100 (unknown) > @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > @ 0x1be6247 testing::internal::AssertHelper::operator=() > @ 0x19ed751 > mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody() > @ 0x1c0ed8c > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x1c09e74 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x1beb505 testing::Test::Run() > @ 0x1bebc88 testing::TestInfo::Run() > @ 0x1bec2ce testing::TestCase::Run() > @ 0x1bf2ba8
[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor
[ https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752229#comment-15752229 ] Anand Mazumdar commented on MESOS-6801: --- {{process::loop}} would use the {{pid}} passed to it as the async execution context i.e., it would implicitly do a {{defer}} to the actor. https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/loop.hpp#L74 > IOSwitchboard::connect installs continuations capturing this without properly > deferring/dispatching to an actor > --- > > Key: MESOS-6801 > URL: https://issues.apache.org/jira/browse/MESOS-6801 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier > Labels: newbie > > In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are > created and used as callbacks without properly deferring to a libprocess > actor. > {noformat} > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Nothing&) { > ^ > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Result& record) -> Future { > ^ > {noformat} > Patterns like this can create use-after-free scenarios or introduce data > races which can often be avoided by installing the callbacks via > {{defer}}/{{dispatch}} on some process' actor. > This code should be revisited to remove existing data races. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5795) Add support for Nvidia GPUs in the docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751984#comment-15751984 ] Anand Mazumdar commented on MESOS-5795: --- ^^ [~klueska] , > Add support for Nvidia GPUs in the docker containerizer > --- > > Key: MESOS-5795 > URL: https://issues.apache.org/jira/browse/MESOS-5795 > Project: Mesos > Issue Type: Epic > Components: docker, isolation >Reporter: Kevin Klues > Labels: gpu, mesosphere > > In order to support Nvidia GPUs with docker containers in Mesos, we need to > be able to consolidate all Nvidia libraries into a common volume and inject > that volume into the container. This tracks the support in the docker > containerizer. The mesos containerizer support has already been completed in > MESOS-5401. > More info on why this is necessary here: > https://github.com/NVIDIA/nvidia-docker/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751954#comment-15751954 ] Anand Mazumdar commented on MESOS-6799: --- [~greggomann] Can you take a look as to why this started failing when SSL enabled? > Scheme/HTTPTest.Endpoints/0 is flaky > > > Key: MESOS-6799 > URL: https://issues.apache.org/jira/browse/MESOS-6799 > Project: Mesos > Issue Type: Bug > Components: libprocess, test > Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug > symbols >Reporter: Benjamin Bannier > Labels: flaky, flaky-test, ssl > > Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with > {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}}, > {noformat} > [03:26:43] : [Step 10/11] [ RUN ] Scheme/HTTPTest.Endpoints/0 > [03:26:43]W: [Step 10/11] I1215 03:26:43.221824 23530 > libevent_ssl_socket.cpp:1141] Socket error: > error::lib(0):func(0):reason(0) > [03:26:43]W: [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA > file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > [03:26:43]W: [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA > directory path unspecified! NOTE: Set CA directory path with > LIBPROCESS_SSL_CA_DIR= > [03:26:43]W: [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will > not verify peer certificate! > [03:26:43]W: [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable > peer certificate verification > [03:26:43]W: [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will > only verify peer certificate if presented! > [03:26:43]W: [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to > require peer certificate verification > [03:26:43]W: [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] > libprocess is initialized on 172.16.10.123:58973 with 8 worker threads > [03:26:43]W: [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] > Handling HTTP event for process '(75)' with path: '/(75)/body' > [03:26:43]W: [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] > Handling HTTP event for process '(75)' with path: '/(75)/pipe' > [03:26:43] : [Step 10/11] > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure > [03:26:43] : [Step 10/11] (future).failure(): failed to decode body > [03:26:43] : [Step 10/11] [ FAILED ] Scheme/HTTPTest.Endpoints/0, where > GetParam() = "https" (234 ms) > {noformat} > I was not able to trigger this failure again in a couple thousand iterations, > so there might be some relation to load or other processes running in the > system. > We should figure out when this problem first occurred as it might be worthy > to backport a fix (if this isn't just a test error). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6799: -- Priority: Critical (was: Major) > Scheme/HTTPTest.Endpoints/0 is flaky > > > Key: MESOS-6799 > URL: https://issues.apache.org/jira/browse/MESOS-6799 > Project: Mesos > Issue Type: Bug > Components: libprocess, test > Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug > symbols >Reporter: Benjamin Bannier >Priority: Critical > Labels: flaky, flaky-test, ssl > > Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with > {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}}, > {noformat} > [03:26:43] : [Step 10/11] [ RUN ] Scheme/HTTPTest.Endpoints/0 > [03:26:43]W: [Step 10/11] I1215 03:26:43.221824 23530 > libevent_ssl_socket.cpp:1141] Socket error: > error::lib(0):func(0):reason(0) > [03:26:43]W: [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA > file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > [03:26:43]W: [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA > directory path unspecified! NOTE: Set CA directory path with > LIBPROCESS_SSL_CA_DIR= > [03:26:43]W: [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will > not verify peer certificate! > [03:26:43]W: [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable > peer certificate verification > [03:26:43]W: [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will > only verify peer certificate if presented! > [03:26:43]W: [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to > require peer certificate verification > [03:26:43]W: [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] > libprocess is initialized on 172.16.10.123:58973 with 8 worker threads > [03:26:43]W: [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] > Handling HTTP event for process '(75)' with path: '/(75)/body' > [03:26:43]W: [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] > Handling HTTP event for process '(75)' with path: '/(75)/pipe' > [03:26:43] : [Step 10/11] > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure > [03:26:43] : [Step 10/11] (future).failure(): failed to decode body > [03:26:43] : [Step 10/11] [ FAILED ] Scheme/HTTPTest.Endpoints/0, where > GetParam() = "https" (234 ms) > {noformat} > I was not able to trigger this failure again in a couple thousand iterations, > so there might be some relation to load or other processes running in the > system. > We should figure out when this problem first occurred as it might be worthy > to backport a fix (if this isn't just a test error). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor
[ https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751943#comment-15751943 ] Anand Mazumdar commented on MESOS-6801: --- In this case, the {{loop}} abstraction ensures that we are delegating to the correct actor. Can we modify the existing script we have that checks such errors and make it aware of how {{process::loop}} works. I am under the impression that even if we capture {{this}} in the lambda it would still complain due to a missing {{defer}}? > IOSwitchboard::connect installs continuations capturing this without properly > deferring/dispatching to an actor > --- > > Key: MESOS-6801 > URL: https://issues.apache.org/jira/browse/MESOS-6801 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier > Labels: newbie > > In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are > created and used as callbacks without properly deferring to a libprocess > actor. > {noformat} > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Nothing&) { > ^ > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Result& record) -> Future { > ^ > {noformat} > Patterns like this can create use-after-free scenarios or introduce data > races which can often be avoided by installing the callbacks via > {{defer}}/{{dispatch}} on some process' actor. > This code should be revisited to remove existing data races. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor
[ https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6801: -- Labels: newbie (was: ) > IOSwitchboard::connect installs continuations capturing this without properly > deferring/dispatching to an actor > --- > > Key: MESOS-6801 > URL: https://issues.apache.org/jira/browse/MESOS-6801 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier > Labels: newbie > > In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are > created and used as callbacks without properly deferring to a libprocess > actor. > {noformat} > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Nothing&) { > ^ > /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: > callback capturing this should be dispatched/deferred to a specific PID > [mesos-this-capture] > [=](const Result& record) -> Future { > ^ > {noformat} > Patterns like this can create use-after-free scenarios or introduce data > races which can often be avoided by installing the callbacks via > {{defer}}/{{dispatch}} on some process' actor. > This code should be revisited to remove existing data races. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6788) Avoid stack overflow when handling streaming responses in API handlers
[ https://issues.apache.org/jira/browse/MESOS-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6788: -- Shepherd: (was: Benjamin Hindman) Assignee: Benjamin Hindman > Avoid stack overflow when handling streaming responses in API handlers > -- > > Key: MESOS-6788 > URL: https://issues.apache.org/jira/browse/MESOS-6788 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Benjamin Hindman > Fix For: 1.2.0 > > > Right now both the `connect()` helper in src/slave/http.cpp and `transform()` > helper in src/common/recordio.hpp use recursion to read data from one pipe > and write to another. > The way these helpers are written could cause stack overflow. Ideally we > should be able to leverage the new `process::loop` abstraction for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.
[ https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6082: -- Target Version/s: 1.2.0 > Add scheduler Call and Event based metrics to the master. > - > > Key: MESOS-6082 > URL: https://issues.apache.org/jira/browse/MESOS-6082 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta > > Currently, the master only has metrics for the old-style messages and these > are re-used for calls unfortunately: > {code} > // Messages from schedulers. > process::metrics::Counter messages_register_framework; > process::metrics::Counter messages_reregister_framework; > process::metrics::Counter messages_unregister_framework; > process::metrics::Counter messages_deactivate_framework; > process::metrics::Counter messages_kill_task; > process::metrics::Counter messages_status_update_acknowledgement; > process::metrics::Counter messages_resource_request; > process::metrics::Counter messages_launch_tasks; > process::metrics::Counter messages_decline_offers; > process::metrics::Counter messages_revive_offers; > process::metrics::Counter messages_suppress_offers; > process::metrics::Counter messages_reconcile_tasks; > process::metrics::Counter messages_framework_to_executor; > {code} > Now that we've introduced the Call/Event based API, we should have metrics > that reflect this. For example: > {code} > { > scheduler/calls: 100 > scheduler/calls/decline: 90, > scheduler/calls/accept: 10, > scheduler/calls/accept/operations/create: 1, > scheduler/calls/accept/operations/destroy: 0, > scheduler/calls/accept/operations/launch: 4, > scheduler/calls/accept/operations/launch_group: 2, > scheduler/calls/accept/operations/reserve: 1, > scheduler/calls/accept/operations/unreserve: 0, > scheduler/calls/kill: 0, > // etc > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6781) Mesos containerizer overrides environment variables passed to the executor incorrectly.
[ https://issues.apache.org/jira/browse/MESOS-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6781: -- Target Version/s: 1.2.0 > Mesos containerizer overrides environment variables passed to the executor > incorrectly. > --- > > Key: MESOS-6781 > URL: https://issues.apache.org/jira/browse/MESOS-6781 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Assignee: Jie Yu >Priority: Blocker > Labels: mesosphere > > Currently, the mesos containerizer appends the default environment variables > of the executor and appends them _after_ any environment that were > over-ridden by an isolator. This is problematic e.g., if the CNI isolator > over-rides {{LIBPROCESS_IP}} in an overlay network. The containerizer would > give the executor environment variables more preference meaning that the > container would end up inheriting the IP address of the agent! > https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L1412-L1421 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6781) Mesos containerizer overrides environment variables passed to the executor incorrectly.
Anand Mazumdar created MESOS-6781: - Summary: Mesos containerizer overrides environment variables passed to the executor incorrectly. Key: MESOS-6781 URL: https://issues.apache.org/jira/browse/MESOS-6781 Project: Mesos Issue Type: Bug Components: containerization Reporter: Anand Mazumdar Assignee: Jie Yu Priority: Blocker Currently, the mesos containerizer appends the default environment variables of the executor and appends them _after_ any environment that were over-ridden by an isolator. This is problematic e.g., if the CNI isolator over-rides {{LIBPROCESS_IP}} in an overlay network. The containerizer would give the executor environment variables more preference meaning that the container would end up inheriting the IP address of the agent! https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L1412-L1421 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6780) ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably
[ https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742681#comment-15742681 ] Anand Mazumdar commented on MESOS-6780: --- cc: [~vinodkone], This might be related to the recent changes to the test for supporting TTY. > ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably > --- > > Key: MESOS-6780 > URL: https://issues.apache.org/jira/browse/MESOS-6780 > Project: Mesos > Issue Type: Bug > Environment: Mac OS 10.12, clang version 4.0.0 > (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) > (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), > libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46 >Reporter: Benjamin Bannier > > The test {{ContentType/AgentAPIStreamTest.AttachContainerInput}} (both {{/0}} > and {{/1}}) fail consistently for me in an SSL-enabled, optimized build. > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ContentType/AgentAPIStreamingTest > [ RUN ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0 > I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' > authorizer > I1212 17:11:12.393844 17362944 master.cpp:380] Master > c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on > 172.18.8.114:51059 > I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" > --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master" > --zk_session_timeout="10secs" > I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing > authenticated frameworks to register > I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing > authenticated agents to register > I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for > authentication from > '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials' > I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' > authenticator > I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL > I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled > I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master! > I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar > I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the > registry (0B) in 4.131072ms > I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in > 27us; attempting to update the registry > I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the > registry in 4.10496ms > I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered > registrar > I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the > registry (136B); allowing 10mins for agents to re-register > I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: > posix/cpu,posix/mem,filesystem/posix > I1212 17:11:12.424154 3971208128 cluster.cpp:446]
[jira] [Updated] (MESOS-6769) The server does not close it's end of the connection after returning a response to a streaming request.
[ https://issues.apache.org/jira/browse/MESOS-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6769: -- Description: Consider this scenario, - The client starts to send a streaming request to the agent with the {{Connection: close}} header set. This means that the client is relying on the server to close it's end of the connection after sending the response. - If the request failed on the server i.e., some validation errors. The server sends the response but does not close it's end of the socket. - Some client libraries e.g., Python Requests rely on the server to close its end of the socket after sending the response. Otherwise, the connection just hangs on the client when it has no more streaming data to send in such cases. Libprocess should close its end of the connection after sending the response in such cases. was: Consider this scenario, - The client starts to send a streaming request to the agent with the {{Connection: close}} header set. This means that the client is relying on the server to close it's end of the connection after sending the response. - If the request failed on the server i.e., some validation errors. The server sends the response but does not close it's end of the socket. - Some client libraries e.g., Python Requests rely on the server to close its end of the socket after sending the response. Otherwise, the connection just hangs on the client when it has no more streaming data to send in such cases. Libprocess should close its end of the > The server does not close it's end of the connection after returning a > response to a streaming request. > --- > > Key: MESOS-6769 > URL: https://issues.apache.org/jira/browse/MESOS-6769 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > Labels: libprocess, mesosphere > > Consider this scenario, > - The client starts to send a streaming request to the agent with the > {{Connection: close}} header set. This means that the client is relying on > the server to close it's end of the connection after sending the response. > - If the request failed on the server i.e., some validation errors. The > server sends the response but does not close it's end of the socket. > - Some client libraries e.g., Python Requests rely on the server to close its > end of the socket after sending the response. Otherwise, the connection just > hangs on the client when it has no more streaming data to send in such cases. > Libprocess should close its end of the connection after sending the response > in such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6769) The server does not close it's end of the connection after returning a response to a streaming request.
Anand Mazumdar created MESOS-6769: - Summary: The server does not close it's end of the connection after returning a response to a streaming request. Key: MESOS-6769 URL: https://issues.apache.org/jira/browse/MESOS-6769 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Consider this scenario, - The client starts to send a streaming request to the agent with the {{Connection: close}} header set. This means that the client is relying on the server to close it's end of the connection after sending the response. - If the request failed on the server i.e., some validation errors. The server sends the response but does not close it's end of the socket. - Some client libraries e.g., Python Requests rely on the server to close its end of the socket after sending the response. Otherwise, the connection just hangs on the client when it has no more streaming data to send in such cases. Libprocess should close its end of the -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6760) Make the scheduler heartbeat interval configurable
Anand Mazumdar created MESOS-6760: - Summary: Make the scheduler heartbeat interval configurable Key: MESOS-6760 URL: https://issues.apache.org/jira/browse/MESOS-6760 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Currently, the heartbeats sent by the master to the scheduler are hard-coded to a constant default to 15 seconds. We should think about configuring this value either as a master flag or make the scheduler pick an appropriate value via the {{Subscribe}} call. This might be useful for some clusters where the default value might be too frequent or if they want a even smaller value (rarer case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6623) Re-enable tests impacted by request streaming support
[ https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6623: -- Sprint: (was: Mesosphere Sprint 47) > Re-enable tests impacted by request streaming support > - > > Key: MESOS-6623 > URL: https://issues.apache.org/jira/browse/MESOS-6623 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > We added support for HTTP request streaming in libprocess as part of > MESOS-6466. However, this broke a few tests that relied on HTTP request > filtering since the handlers no longer have access to the body of the request > when {{visit()}} is invoked. We would need to revisit how we do HTTP request > filtering and then re-enable these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6752) Add a `post()` overload to libprocess for streaming requests
Anand Mazumdar created MESOS-6752: - Summary: Add a `post()` overload to libprocess for streaming requests Key: MESOS-6752 URL: https://issues.apache.org/jira/browse/MESOS-6752 Project: Mesos Issue Type: Improvement Components: HTTP API, libprocess Reporter: Anand Mazumdar Currently, the {{post}}/{{streaming::post}} overloads in [libprocess | https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp] don't work for streaming requests. The {{streaming::post}} overload works only for streaming responses. We should add another overload to handle streaming requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6746) IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT
[ https://issues.apache.org/jira/browse/MESOS-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6746: -- Shepherd: Vinod Kone Sprint: Mesosphere Sprint 47 > IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT > > > Key: MESOS-6746 > URL: https://issues.apache.org/jira/browse/MESOS-6746 > Project: Mesos > Issue Type: Bug >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > Fix For: 1.2.0 > > > Currently we are doing a close on the write end of all connection pipes when > we exit the switchboard, but we don't wait until the read is flushed before > exiting. This can cause some data to get dropped since the process may exit > before the reader is flushed. The current code is: > {noformat} > void IOSwitchboardServerProcess::finalize() > { > foreach (HttpConnection& connection, outputConnections) { > connection.close(); > } > > if (failure.isSome()) { > promise.fail(failure->message); > } else { > promise.set(Nothing()); > } > } > {noformat} > We should change it to: > {noformat} > void IOSwitchboardServerProcess::finalize() > { > foreach (HttpConnection& connection, outputConnections) { > connection.close(); > connection.closed().await(); > } > > if (failure.isSome()) { > promise.fail(failure->message); > } else { > promise.set(Nothing()); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
[ https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729231#comment-15729231 ] Anand Mazumdar commented on MESOS-6744: --- >From the logs, this looks a separate issue than what we fixed in MESOS-6576 >(around status update reordering). > DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky > --- > > Key: MESOS-6744 > URL: https://issues.apache.org/jira/browse/MESOS-6744 > Project: Mesos > Issue Type: Bug > Environment: Recent Arch Linux VM, amd64. >Reporter: Neil Conway > Labels: mesosphere > > This repros consistently for me (~10 test iterations or fewer). Test log: > {noformat} > [ RUN ] DefaultExecutorTest.KillTaskGroupOnTaskFailure > I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' > authorizer > I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery > I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status > I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from __req_res__(64)@10.0.2.15:46643 > I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from > a replica in EMPTY status > I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to > STARTING > I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to > STARTING > I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status > I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643 > I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from > a replica in STARTING status > I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING > I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to > VOTING > I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos > group > I1208 03:26:47.763380 28651 master.cpp:380] Master > 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on > 10.0.2.15:46643 > I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/7lpy50/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="100secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs" > I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing > authenticated frameworks to register > I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing > authenticated agents to register > I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for > authentication from '/tmp/7lpy50/credentials' > I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' > authenticator > I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled > I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master! > I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar > I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer > I1208 03:26:47.769899 28653 replica.cpp:493] Replica
[jira] [Updated] (MESOS-6646) StreamingRequestDecoder incompletely initializes its http_parser_settings
[ https://issues.apache.org/jira/browse/MESOS-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6646: -- Shepherd: Anand Mazumdar > StreamingRequestDecoder incompletely initializes its http_parser_settings > - > > Key: MESOS-6646 > URL: https://issues.apache.org/jira/browse/MESOS-6646 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: coverity > Fix For: 1.2.0 > > > Coverity reports in CID1394703 at {{3rdparty/libprocess/src/decoder.hpp:767}}: > {code} > CID 1394703 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) > 2. uninit_member: Non-static class member field settings.on_status is not > initialized in this constructor nor in any functions that it calls. > {code} > It seems like {{StreamingRequestDecoder}} should properly initialize its > member {{settings}}, e.g., with {{http_parser_settings_init}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API
[ https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723128#comment-15723128 ] Anand Mazumdar commented on MESOS-6597: --- {noformat} commit 5abda76d697dcc21e64f9037b03c3a15fc434286 Author: Vijay SrinivasaraghavanDate: Mon Dec 5 08:58:05 2016 -0800 Enabled python proto generation for v1 Master/Agent API. The correspondng master/agent protos are now included in the generated Mesos pypi package. Review: https://reviews.apache.org/r/54015/ commit e1ae5cf8030821e1527466e84a0dfe1864406926 Author: Vijay Srinivasaraghavan Date: Mon Dec 5 08:58:00 2016 -0800 Enabled java protos generation for v1 Master/Agent API. The corresponding master/agent protos are now included in the generated Mesos JAR. Review: https://reviews.apache.org/r/53825/ commit 2786ef6e1b7c91ca68ef4c584d8b4316fe2d6a58 Author: Vijay Srinivasaraghavan Date: Mon Dec 5 08:57:55 2016 -0800 Fixed missing protobuf java package/classname definition. Review: https://reviews.apache.org/r/54014/ {noformat} > Include missing Mesos Java classes for Protobuf files to support Operator > HTTP V1 API > - > > Key: MESOS-6597 > URL: https://issues.apache.org/jira/browse/MESOS-6597 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Vijay Srinivasaraghavan >Assignee: Vijay Srinivasaraghavan >Priority: Blocker > > For V1 API support, the build file that generates Java protos wrapper as of > now includes only executor and scheduler. > (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) > To support operator HTTP API, we also need to generate java protos for > additional proto definitions like quota, maintenance etc., These java > definition files will be used by a standard Rest client when using the > straight HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API
[ https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723131#comment-15723131 ] Anand Mazumdar commented on MESOS-6597: --- Keeping the issue open while I do the back-port to 1.1.x. > Include missing Mesos Java classes for Protobuf files to support Operator > HTTP V1 API > - > > Key: MESOS-6597 > URL: https://issues.apache.org/jira/browse/MESOS-6597 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Vijay Srinivasaraghavan >Assignee: Vijay Srinivasaraghavan >Priority: Blocker > > For V1 API support, the build file that generates Java protos wrapper as of > now includes only executor and scheduler. > (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) > To support operator HTTP API, we also need to generate java protos for > additional proto definitions like quota, maintenance etc., These java > definition files will be used by a standard Rest client when using the > straight HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6467) Build a Container I/O Switchboard
[ https://issues.apache.org/jira/browse/MESOS-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715951#comment-15715951 ] Anand Mazumdar commented on MESOS-6467: --- {noformat} commit 2a73d956af1cb0615d4e66de126ab554fdabb0b5 Author: Kevin KluesDate: Fri Dec 2 10:14:45 2016 -0800 Updated the IOSwitchboard http handler to work with streaming requests. Review: https://reviews.apache.org/r/54296/ {noformat} > Build a Container I/O Switchboard > - > > Key: MESOS-6467 > URL: https://issues.apache.org/jira/browse/MESOS-6467 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: debugging, mesosphere > Fix For: 1.2.0 > > > In order to facilitate attach operations for a running container, we plan to > introduce a new component into Mesos known as an “I/O switchboard”. The goal > of this switchboard is to allow external components to *dynamically* > interpose on the {{stdin}}, {{stdout}} and {{stderr}} of the init process of > a running Mesos container. It will be implemented as a per-container, > stand-alone process launched by the mesos containerizer at the time a > container is first launched. > Each per-container switchboard will be responsible for the following: > * Accepting a single dynamic request to register an fd for streaming data to > the {{stdin}} of a container’s init process. > * Accepting *multiple* dynamic requests to register fds for streaming data > from the {{stdout}} and {{stderr}} of a container’s init process to those fds. > * Allocating a pty for the new process (if requested), and directing data > through the master fd of the pty as necessary. > * Passing the *actual* set of file descriptors that should be dup’d onto the > {{stdin}}, {{stdout}} and {{stderr}} of a container’s init process back to > the containerizer. > The idea being that the switchboard will maintain three asynchronous loops > (one each for {{stdin}}, {{stdout}} and {{stderr}}) that constantly pipe data > to/from a container’s init process to/from all of the file descriptors that > have been dynamically registered with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6662) Some HTTP scheduler calls are missing from the docs
[ https://issues.apache.org/jira/browse/MESOS-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6662: -- Labels: apidocs documentation http newbie scheduler (was: apidocs documentation http scheduler) > Some HTTP scheduler calls are missing from the docs > --- > > Key: MESOS-6662 > URL: https://issues.apache.org/jira/browse/MESOS-6662 > Project: Mesos > Issue Type: Bug > Components: documentation >Reporter: Greg Mann > Labels: apidocs, documentation, http, newbie, scheduler > > Some of the calls available to HTTP schedulers are missing from the HTTP > scheduler API documentation. We should make sure that all of the calls > available in the {{Master::Http::scheduler}} handler are in the documentation > [here|https://github.com/apache/mesos/blob/master/docs/scheduler-http-api.md]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6655) Corrupted HTTP log output
[ https://issues.apache.org/jira/browse/MESOS-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6655: -- Target Version/s: 1.2.0 Priority: Blocker (was: Major) > Corrupted HTTP log output > - > > Key: MESOS-6655 > URL: https://issues.apache.org/jira/browse/MESOS-6655 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Blocker > Labels: mesosphere > > The master log when running {{make check}} contains lines like: > {noformat} > I1130 12:28:56.747364 11811 http.cpp:391] HTTP GET for /master/state from > 10.0.49.2:48494n > {noformat} > Where {{BD}} is a binary sequence. This can be repro'd by running > {{MasterTest.*}} with revision 0d0805e914bc50717237e1246097af1d3b7ba92a; it > was presumably introduced by a prior revision, but I didn't triage it exactly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6472) Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar reassigned MESOS-6472: - Assignee: Anand Mazumdar (was: Vinod Kone) > Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos > > > Key: MESOS-6472 > URL: https://issues.apache.org/jira/browse/MESOS-6472 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > > Coupled with the ATTACH_CONTAINER_OUTPUT call, this call will attach a remote > client to the the input/output of the entrypoint of a container. All > input/output data will be packed into I/O messages and interleaved with > control messages sent between a client and the agent. A single chunked > request will be used to stream messages to the agent over the input stream, > and a single chunked response will be used to stream messages to the client > over the output stream. > This call will integrate with the I/O switchboard to stream data between the > container and the HTTP stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6472) Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6472: -- Shepherd: Vinod Kone > Build support for ATTACH_CONTAINER_INPUT into the Agent API in Mesos > > > Key: MESOS-6472 > URL: https://issues.apache.org/jira/browse/MESOS-6472 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > > Coupled with the ATTACH_CONTAINER_OUTPUT call, this call will attach a remote > client to the the input/output of the entrypoint of a container. All > input/output data will be packed into I/O messages and interleaved with > control messages sent between a client and the agent. A single chunked > request will be used to stream messages to the agent over the input stream, > and a single chunked response will be used to stream messages to the client > over the output stream. > This call will integrate with the I/O switchboard to stream data between the > container and the HTTP stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5900) Support Unix domain socket connections in libprocess
[ https://issues.apache.org/jira/browse/MESOS-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15698185#comment-15698185 ] Anand Mazumdar commented on MESOS-5900: --- This is in progress and being worked upon as part of the epic MESOS-6460. The reviews should be out shortly. > Support Unix domain socket connections in libprocess > > > Key: MESOS-5900 > URL: https://issues.apache.org/jira/browse/MESOS-5900 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Neil Conway >Assignee: Benjamin Hindman > Labels: mesosphere > > We should consider allowing two programs on the same host using libprocess to > communicate via Unix domain sockets rather than TCP. This has a few > advantages: > * Security: remote hosts cannot connect to the Unix socket. Domain sockets > also offer additional support for > [authentication|https://docs.fedoraproject.org/en-US/Fedora_Security_Team/1/html/Defensive_Coding/sect-Defensive_Coding-Authentication-UNIX_Domain.html]. > * Performance: domain sockets are marginally faster than localhost TCP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4279) Docker executor truncates task's output when the task is killed.
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4279: -- Labels: docker mesosphere won't-backport (was: docker mesosphere) > Docker executor truncates task's output when the task is killed. > > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1 >Reporter: Martin Bydzovsky >Assignee: Benjamin Mahler >Priority: Critical > Labels: docker, mesosphere, won't-backport > Fix For: 1.0.0 > > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Docker executor truncates task's output when the task is killed.
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691430#comment-15691430 ] Anand Mazumdar commented on MESOS-4279: --- Ignoring backporting for 0.28.x since its not straightforward. > Docker executor truncates task's output when the task is killed. > > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1 >Reporter: Martin Bydzovsky >Assignee: Benjamin Mahler >Priority: Critical > Labels: docker, mesosphere > Fix For: 1.0.0 > > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5763) Task stuck in fetching is not cleaned up after --executor_registration_timeout.
[ https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5763: -- Target Version/s: (was: 0.28.3) Fix Version/s: 0.28.3 Backport for 0.28.x branch {noformat} commit 52a0b0a41482da35dc736ec2fd445b6099e7a4e7 Author: Anand MazumdarDate: Tue Nov 22 20:38:43 2016 -0800 Added MESOS-5763 to 0.28.3 CHANGELOG. commit 2d61bde81e3d6fb7400ec5f7078ceedd8d2bb802 Author: Jiang Yan Xu Date: Fri Jul 1 18:12:01 2016 -0700 Made Mesos containerizer error messages more consistent. We've been using slightly different wordings of the same condition in multiple places in Mesos containerizer but they don't provide additional information about where this failure is thrown in a long continuation chain. Since failures don't capture the location in the code we'd better distinguish them in a more meaningful way to assist debugging. Review: https://reviews.apache.org/r/49653 commit d7f8b8558974ee8739d460d53faf54a52832b754 Author: Jiang Yan Xu Date: Fri Jul 1 18:11:29 2016 -0700 Improved Mesos containerizer invariant checking. One of the reasons for MESOS-5763 is due to the lack invariant checking. Mesos containerizer transitions the container state in particular ways so when continuation chains could potentially be interleaved with other actions we should verify the state transitions. Review: https://reviews.apache.org/r/49652 commit 008e04433026aaec49779197c4a7b6655d5bb693 Author: Jiang Yan Xu Date: Fri Jul 1 15:25:54 2016 -0700 Improved Mesos containerizer logging and documentation. Review: https://reviews.apache.org/r/49651 commit 90b5be8e95c5868ea9142625b97050a75d0664f5 Author: Jiang Yan Xu Date: Wed Jul 6 13:48:34 2016 -0700 Fail container launch if it's destroyed during logger->prepare(). Review: https://reviews.apache.org/r/49725 commit 56b4c561e08a8cc36e5cbc3a786981412bf226dd Author: Jiang Yan Xu Date: Fri Jul 1 15:27:37 2016 -0700 Fixed Mesos containerizer to set container FETCHING state. If the container state is not properly set to FETCHING, Mesos agent cannot detect the terminated executor when the fetcher times out. Review: https://reviews.apache.org/r/49650 {noformat} > Task stuck in fetching is not cleaned up after > --executor_registration_timeout. > --- > > Key: MESOS-5763 > URL: https://issues.apache.org/jira/browse/MESOS-5763 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.0, 1.0.0 >Reporter: Yan Xu >Assignee: Yan Xu >Priority: Blocker > Fix For: 0.28.3, 1.0.0 > > > When the fetching process hangs forever due to reasons such as HDFS issues, > Mesos containerizer would attempt to destroy the container and kill the > executor after {{--executor_registration_timeout}}. However this reliably > fails for us: the executor would be killed by the launcher destroy and the > container would be destroyed but the agent would never find out that the > executor is terminated thus leaving the task in the STAGING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6466) Add support for streaming HTTP requests in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685566#comment-15685566 ] Anand Mazumdar commented on MESOS-6466: --- {noformat} commit 0ca11623e6deb98dc05ec90b0339960c251cd64e Author: Anand MazumdarDate: Mon Nov 21 18:08:44 2016 -0800 Disabled tests relying on filtering HTTP events. Some tests that rely on filtering HTTP events based on type won't work now since the request body is not yet known when \`visit()\` is invoked. These would be enabled later as part of a separate JIRA issue. Review: https://reviews.apache.org/r/53491/ commit 5cd134fedf69900523f3088de32daae81f216437 Author: Anand Mazumdar Date: Mon Nov 21 18:08:30 2016 -0800 Added a test for request streaming and GZIP compression. These tests validate that clients can stream requests and send compressed GZIP requests using the connection abstraction. This also test the implementation of the streaming decoder indirectly. Review: https://reviews.apache.org/r/53490/ commit a24cb4985c2333e2d15eeb8f971242f1754f81ab Author: Anand Mazumdar Date: Mon Nov 21 18:08:18 2016 -0800 Added support for request streaming to the connection abstraction. This required modifications to the `encode()` method to return a `Pipe::Reader` instead of the request body. The `send()` then reads from this pipe to send the request via the socket. Review: https://reviews.apache.org/r/53489/ commit d06d74562740767f0750e10907a327c5b45fef4c Author: Anand Mazumdar Date: Mon Nov 21 18:08:12 2016 -0800 Removed `convert()` continuations in favor of using `readAll()`. Review: https://reviews.apache.org/r/53488/ commit d30039ded0a434cdf9583e0a12b73e1b3661380e Author: Anand Mazumdar Date: Mon Nov 21 18:08:01 2016 -0800 Wired the libprocess code to use the streaming decoder. Old libprocess style messages and routes not supporting request streaming read the body from the piped reader. Otherwise, the request is forwarded to the handler when the route supports streaming. Review: https://reviews.apache.org/r/53487/ commit c95e86c37dd82130016abf3a240ebfd869dfed2c Author: Anand Mazumdar Date: Mon Nov 21 18:07:54 2016 -0800 Parameterized existing decoder tests on the type of decoder. This allows us to not duplicate tests for the streaming request decoder. Review: https://reviews.apache.org/r/53511/ commit dceacc50ce8577b1d0fd5cde0e55dadef1907fdf Author: Anand Mazumdar Date: Mon Nov 21 18:07:47 2016 -0800 Removed extraneous socket argument from `DataDecoder` constructor. This argument is not used anywhere in the code. This makes it consistent with the streaming request decoder. Review: https://reviews.apache.org/r/53510/ commit 32e203ea194e3531ea8c5ded4d538bacb7cc2781 Author: Anand Mazumdar Date: Mon Nov 21 18:07:41 2016 -0800 Introduced a streaming request decoder in libprocess. This would become the default de facto decoder used by libprocess and replace the existing `DataDecoder`. Review: https://reviews.apache.org/r/53486/ commit e8e3fe596f242767fc10ccb95cbdcd36c49a89a5 Author: Anand Mazumdar Date: Mon Nov 21 18:07:29 2016 -0800 Introduced a `readAll()` helper on `http::Pipe::Reader`. The helper reads from the pipe till EOF. This is used later to read BODY requests from the streaming request decoder. Review: https://reviews.apache.org/r/53485/ commit 5152728e3eeac8d6fac52545d0ebc5df6f2e42cb Author: Anand Mazumdar Date: Mon Nov 21 18:07:25 2016 -0800 Introduced `RouteOptions` to support streaming requests. This allows routes to specify configuration options. Currently, it only has one member `streaming` i.e, if the route supports request streaming. This also enables us to add more options in the future without polluting overloads. Review: https://reviews.apache.org/r/53484/ commit f6286048a5f897ff6859b38a24c3d64aa3b54d01 Author: Anand Mazumdar Date: Mon Nov 21 18:07:20 2016 -0800 Introduced a reader member to `Request` to support request streaming. These new members are needed for supporting request streaming i.e., the caller can use the writer to stream chunks to the server if the request body is not known in advance. Review: https://reviews.apache.org/r/53483/ commit 1003f3d208f6e06e8bf485e395190f9bd4e5fe24 Author: Anand Mazumdar Date: Mon Nov 21 18:07:11 2016 -0800 Initialized the POD type in the `Request` struct. Previously, the `keepAlive` member was not initialized correctly, the behavior is undefined if POD types are not correctly initialized. Review:
[jira] [Created] (MESOS-6623) Re-enable tests impacted by request streaming support
Anand Mazumdar created MESOS-6623: - Summary: Re-enable tests impacted by request streaming support Key: MESOS-6623 URL: https://issues.apache.org/jira/browse/MESOS-6623 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Assignee: Anand Mazumdar Priority: Blocker We added support for HTTP request streaming in libprocess as part of MESOS-6466. However, this broke a few tests that relied on HTTP request filtering since the handlers no longer have access to the body of the request when {{visit()}} is invoked. We would need to revisit how we do HTTP request filtering and then re-enable these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5662) Call parent class `SetUpTestCase` function in our test fixtures.
[ https://issues.apache.org/jira/browse/MESOS-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5662: -- Description: There are some occurrences in our code where we don't invoke the parent's {{SetUpTestCase}} method from a child test fixture. This can be a bit problematic if the parent class has its own custom {{SetUpTestCase}} logic or someone adds one in the future. It would be good to do a sweep across the code and explicitly invoke the parent class's method. Some examples (there are more): https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80 https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59 was: There are some occurrences in our code where we don't invoke the parent's {{SetUpTestCase}} method from a child test fixture. This can be a bit problematic if someone adds the method in the parent class sometime in the future. It would be good to do a sweep across the code and explicitly invoke the parent class's method. Some examples (there are more): https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80 https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59 > Call parent class `SetUpTestCase` function in our test fixtures. > > > Key: MESOS-5662 > URL: https://issues.apache.org/jira/browse/MESOS-5662 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Anand Mazumdar >Assignee: Manuwela Kanade > Labels: mesosphere, newbie > > There are some occurrences in our code where we don't invoke the parent's > {{SetUpTestCase}} method from a child test fixture. This can be a bit > problematic if the parent class has its own custom {{SetUpTestCase}} logic or > someone adds one in the future. It would be good to do a sweep across the > code and explicitly invoke the parent class's method. > Some examples (there are more): > https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L80 > https://github.com/apache/mesos/blob/master/src/tests/module_tests.cpp#L59 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6597) Include missing Mesos Java classes for Protobuf files to support Operator HTTP V1 API
[ https://issues.apache.org/jira/browse/MESOS-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6597: -- Shepherd: Anand Mazumdar Target Version/s: 1.1.1, 1.2.0 Priority: Blocker (was: Major) > Include missing Mesos Java classes for Protobuf files to support Operator > HTTP V1 API > - > > Key: MESOS-6597 > URL: https://issues.apache.org/jira/browse/MESOS-6597 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Vijay Srinivasaraghavan >Assignee: Vijay Srinivasaraghavan >Priority: Blocker > > For V1 API support, the build file that generates Java protos wrapper as of > now includes only executor and scheduler. > (https://github.com/apache/mesos/blob/master/src/Makefile.am#L334) > To support operator HTTP API, we also need to generate java protos for > additional proto definitions like quota, maintenance etc., These java > definition files will be used by a standard Rest client when using the > straight HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6576) DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI
[ https://issues.apache.org/jira/browse/MESOS-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655649#comment-15655649 ] Anand Mazumdar commented on MESOS-6576: --- The root cause is similar to MESOS-6569 i.e., we erroneously expect the {{TASK_RUNNING}} updates for different tasks to arrive in order. > DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI > > > Key: MESOS-6576 > URL: https://issues.apache.org/jira/browse/MESOS-6576 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: James Peach > Labels: flaky-test, mesosphere, newbie > Attachments: KillTaskGroupOnTaskFailure.failure.log, > KillTaskGroupOnTaskFailure.success.log > > > {{DefaultExecutorTest.KillTaskGroupOnTaskFailure}} sometimes fails in the ASF > CI. > Interesting pieces of the failing test run: > {noformat} > ... > I1110 20:38:54.775871 29740 status_update_manager.cpp:323] Received status > update TASK_KILLED (UUID: a4746389-8155-44e0-ada4-00b8d3e997c1) for task > df99cc50-9b0f-4692-afc9-d587c3515a67 of framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:54.776181 29730 slave.cpp:4075] Status update manager > successfully handled status update TASK_KILLED (UUID: > a4746389-8155-44e0-ada4-00b8d3e997c1) for task > df99cc50-9b0f-4692-afc9-d587c3515a67 of framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.456354 29738 hierarchical.cpp:1880] Filtered offer with > cpus(*):1.7; mem(*):928; disk(*):928; ports(*):[31000-32000] on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 for framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.456434 29738 hierarchical.cpp:1694] No allocations performed > I1110 20:38:55.456468 29738 hierarchical.cpp:1789] No inverse offers to send > out! > I1110 20:38:55.456545 29738 hierarchical.cpp:1286] Performed allocation for 1 > agents in 745185ns > I1110 20:38:55.875964 29731 containerizer.cpp:2336] Container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 has exited > I1110 20:38:55.876022 29731 containerizer.cpp:1973] Destroying container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 in RUNNING state > I1110 20:38:55.876387 29731 launcher.cpp:143] Asked to destroy container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 > I1110 20:38:55.881464 29728 provisioner.cpp:324] Ignoring destroy request for > unknown container a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 > I1110 20:38:55.882894 29730 slave.cpp:4672] Executor 'default' of framework > 2df0125f-4865-4aba-b13d-02f338815729- exited with status 0 > I1110 20:38:55.883446 29741 master.cpp:5884] Executor 'default' of framework > 2df0125f-4865-4aba-b13d-02f338815729- on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 > (ade222407ffe): exited with status 0 > I1110 20:38:55.883545 29741 master.cpp:7840] Removing executor 'default' with > resources cpus(*):0.1; mem(*):32; disk(*):32 of framework > 2df0125f-4865-4aba-b13d-02f338815729- on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 > (ade222407ffe) > I1110 20:38:55.884820 29729 hierarchical.cpp:1018] Recovered cpus(*):0.1; > mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024; > ports(*):[31000-32000], allocated: cpus(*):0.2; mem(*):64; disk(*):64) on > agent 2df0125f-4865-4aba-b13d-02f338815729-S0 from framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.885892 29737 scheduler.cpp:675] Enqueuing event FAILURE > received from href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler > GMOCK WARNING: > Uninteresting mock function call - returning directly. > Function call: failure(0x7ffdc4df11f0, @0x2b639800b6b0 48-byte object > 90-82 AC-51 63-2B 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 > 70-0A 01-98 63-2B 00-00 20-C7 00-98 63-2B 00-00 00-00 00-00 63-2B 00-00) > ... > I1110 20:39:04.566794 29732 master.cpp:7715] Updating the state of task > e72d5139-0a11-48af-9d43-d4163c1404ee of framework > 2df0125f-4865-4aba-b13d-02f338815729- (latest state: TASK_FAILED, status > update state: TASK_RUNNING) > ... > I1110 20:39:04.569413 29736 scheduler.cpp:675] Enqueuing event UPDATE > received from href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler > ../../src/tests/default_executor_tests.cpp:583: Failure > Value of: taskStates > Actual: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_KILLED), > (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_FAILED) } > Expected: expectedTaskStates > Which is: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_RUNNING), > (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_RUNNING) } > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6576) DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI
[ https://issues.apache.org/jira/browse/MESOS-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6576: -- Labels: flaky-test mesosphere newbie (was: ) > DefaultExecutorTest.KillTaskGroupOnTaskFailure sometimes fails in CI > > > Key: MESOS-6576 > URL: https://issues.apache.org/jira/browse/MESOS-6576 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: James Peach > Labels: flaky-test, mesosphere, newbie > Attachments: KillTaskGroupOnTaskFailure.failure.log, > KillTaskGroupOnTaskFailure.success.log > > > {{DefaultExecutorTest.KillTaskGroupOnTaskFailure}} sometimes fails in the ASF > CI. > Interesting pieces of the failing test run: > {noformat} > ... > I1110 20:38:54.775871 29740 status_update_manager.cpp:323] Received status > update TASK_KILLED (UUID: a4746389-8155-44e0-ada4-00b8d3e997c1) for task > df99cc50-9b0f-4692-afc9-d587c3515a67 of framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:54.776181 29730 slave.cpp:4075] Status update manager > successfully handled status update TASK_KILLED (UUID: > a4746389-8155-44e0-ada4-00b8d3e997c1) for task > df99cc50-9b0f-4692-afc9-d587c3515a67 of framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.456354 29738 hierarchical.cpp:1880] Filtered offer with > cpus(*):1.7; mem(*):928; disk(*):928; ports(*):[31000-32000] on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 for framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.456434 29738 hierarchical.cpp:1694] No allocations performed > I1110 20:38:55.456468 29738 hierarchical.cpp:1789] No inverse offers to send > out! > I1110 20:38:55.456545 29738 hierarchical.cpp:1286] Performed allocation for 1 > agents in 745185ns > I1110 20:38:55.875964 29731 containerizer.cpp:2336] Container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 has exited > I1110 20:38:55.876022 29731 containerizer.cpp:1973] Destroying container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 in RUNNING state > I1110 20:38:55.876387 29731 launcher.cpp:143] Asked to destroy container > a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 > I1110 20:38:55.881464 29728 provisioner.cpp:324] Ignoring destroy request for > unknown container a56ac08b-8f97-4ae4-a2e8-5ef5d55fbe98 > I1110 20:38:55.882894 29730 slave.cpp:4672] Executor 'default' of framework > 2df0125f-4865-4aba-b13d-02f338815729- exited with status 0 > I1110 20:38:55.883446 29741 master.cpp:5884] Executor 'default' of framework > 2df0125f-4865-4aba-b13d-02f338815729- on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 > (ade222407ffe): exited with status 0 > I1110 20:38:55.883545 29741 master.cpp:7840] Removing executor 'default' with > resources cpus(*):0.1; mem(*):32; disk(*):32 of framework > 2df0125f-4865-4aba-b13d-02f338815729- on agent > 2df0125f-4865-4aba-b13d-02f338815729-S0 at slave(18)@172.17.0.2:36164 > (ade222407ffe) > I1110 20:38:55.884820 29729 hierarchical.cpp:1018] Recovered cpus(*):0.1; > mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024; > ports(*):[31000-32000], allocated: cpus(*):0.2; mem(*):64; disk(*):64) on > agent 2df0125f-4865-4aba-b13d-02f338815729-S0 from framework > 2df0125f-4865-4aba-b13d-02f338815729- > I1110 20:38:55.885892 29737 scheduler.cpp:675] Enqueuing event FAILURE > received from href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler > GMOCK WARNING: > Uninteresting mock function call - returning directly. > Function call: failure(0x7ffdc4df11f0, @0x2b639800b6b0 48-byte object > 90-82 AC-51 63-2B 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 > 70-0A 01-98 63-2B 00-00 20-C7 00-98 63-2B 00-00 00-00 00-00 63-2B 00-00) > ... > I1110 20:39:04.566794 29732 master.cpp:7715] Updating the state of task > e72d5139-0a11-48af-9d43-d4163c1404ee of framework > 2df0125f-4865-4aba-b13d-02f338815729- (latest state: TASK_FAILED, status > update state: TASK_RUNNING) > ... > I1110 20:39:04.569413 29736 scheduler.cpp:675] Enqueuing event UPDATE > received from href='http://172.17.0.2:36164/master/api/v1/scheduler'>http://172.17.0.2:36164/master/api/v1/scheduler > ../../src/tests/default_executor_tests.cpp:583: Failure > Value of: taskStates > Actual: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_KILLED), > (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_FAILED) } > Expected: expectedTaskStates > Which is: { (df99cc50-9b0f-4692-afc9-d587c3515a67, TASK_RUNNING), > (e72d5139-0a11-48af-9d43-d4163c1404ee, TASK_RUNNING) } > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.
[ https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6405: -- Sprint: Mesosphere Sprint 45 (was: Mesosphere Sprint 45, Mesosphere Sprint 46) > Benchmark call ingestion path on the Mesos master. > -- > > Key: MESOS-6405 > URL: https://issues.apache.org/jira/browse/MESOS-6405 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Critical > Labels: mesosphere > > [~drexin] reported on the user mailing > [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E] > that there seems to be a significant regression in performance on the call > ingestion path on the Mesos master wrt to the scheduler driver (v0 API). > We should create a benchmark to first get a sense of the numbers and then go > about fixing the performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6569) MesosContainerizer/DefaultExecutorTest.KillTask/0 failing on ASF CI
[ https://issues.apache.org/jira/browse/MESOS-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652704#comment-15652704 ] Anand Mazumdar commented on MESOS-6569: --- Similar to the logic that {{TASK_KILLED}} update for the tasks can be received in any order, we need similar logic to ensure that {{TASK_RUNNING}} update for tasks can be received in any order by the scheduler. > MesosContainerizer/DefaultExecutorTest.KillTask/0 failing on ASF CI > --- > > Key: MESOS-6569 > URL: https://issues.apache.org/jira/browse/MESOS-6569 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 > Environment: > https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)&&(!ubuntu-eu2)/ >Reporter: Yan Xu > Labels: flaky, newbie > > {noformat:title=} > [ RUN ] MesosContainerizer/DefaultExecutorTest.KillTask/0 > I1110 01:20:11.482097 29700 cluster.cpp:158] Creating default 'local' > authorizer > I1110 01:20:11.485241 29700 leveldb.cpp:174] Opened db in 2.774513ms > I1110 01:20:11.486237 29700 leveldb.cpp:181] Compacted db in 953614ns > I1110 01:20:11.486299 29700 leveldb.cpp:196] Created db iterator in 24739ns > I1110 01:20:11.486325 29700 leveldb.cpp:202] Seeked to beginning of db in > 2300ns > I1110 01:20:11.486344 29700 leveldb.cpp:271] Iterated through 0 keys in the > db in 378ns > I1110 01:20:11.486399 29700 replica.cpp:776] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1110 01:20:11.486933 29733 recover.cpp:451] Starting replica recovery > I1110 01:20:11.487289 29733 recover.cpp:477] Replica is in EMPTY status > I1110 01:20:11.488503 29721 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from __req_res__(7318)@172.17.0.3:52462 > I1110 01:20:11.488855 29727 recover.cpp:197] Received a recover response from > a replica in EMPTY status > I1110 01:20:11.489398 29729 recover.cpp:568] Updating replica status to > STARTING > I1110 01:20:11.490223 29723 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 575135ns > I1110 01:20:11.490284 29732 master.cpp:380] Master > d28fbae1-c3dc-45fa-8384-32ab9395a975 (3a31be8bf679) started on > 172.17.0.3:52462 > I1110 01:20:11.490317 29732 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/k50x7x/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="100secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/mesos/mesos-1.2.0/_inst/share/mesos/webui" > --work_dir="/tmp/k50x7x/master" --zk_session_timeout="10secs" > I1110 01:20:11.490696 29732 master.cpp:432] Master only allowing > authenticated frameworks to register > I1110 01:20:11.490712 29732 master.cpp:446] Master only allowing > authenticated agents to register > I1110 01:20:11.490720 29732 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1110 01:20:11.490730 29732 credentials.hpp:37] Loading credentials for > authentication from '/tmp/k50x7x/credentials' > I1110 01:20:11.490281 29723 replica.cpp:320] Persisted replica status to > STARTING > I1110 01:20:11.491210 29732 master.cpp:504] Using default 'crammd5' > authenticator > I1110 01:20:11.491225 29720 recover.cpp:477] Replica is in STARTING status > I1110 01:20:11.491394 29732 http.cpp:895] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1110 01:20:11.491621 29732 http.cpp:895] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1110 01:20:11.491770 29732 http.cpp:895] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1110 01:20:11.491937 29732 master.cpp:584] Authorization enabled > I1110
[jira] [Updated] (MESOS-6552) Add ability to filter events on the subscriber stream for Master API.
[ https://issues.apache.org/jira/browse/MESOS-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6552: -- Description: Currently, the v1 Master API allows an operator to subscribe to events happening on their clusters e.g., any time a new task is launched/updated. However, there is no ability currently for a subscriber to express its interest in a particular subset of events on the master e.g, only in agent related events (add/removal) etc. This would also take care of use cases where a subscriber would be short lived i.e., is only interested to see if a particular task has been launched on the cluster by the framework and then close its connection thereafter. Currently, such subscribers also receive the entire snapshot of the cluster via the {{SNAPSHOT}} events that can be rather huge for production clusters (we also don't support compression on the stream yet). Such subscribers in the future would be able to opt out of this event. was: Currently, the v1 Master API allows an operator to subscribe to events happening on their clusters e.g., any time a new task is launched/updated. However, there is no ability currently for a subscriber to express its interest in a particular subset of events on the master e.g, only in task add/updated events This would also take care of use cases where a subscriber would be short lived i.e., is only interested to see if a particular task has been launched on the cluster by the framework and then close its connection thereafter. Currently, such subscribers also receive the entire snapshot of the cluster via the {{SNAPSHOT}} events that can be rather huge for production clusters (we also don't support compression on the stream yet). Such subscribers in the future would be able to opt out of this event. > Add ability to filter events on the subscriber stream for Master API. > - > > Key: MESOS-6552 > URL: https://issues.apache.org/jira/browse/MESOS-6552 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar > Labels: mesosphere > > Currently, the v1 Master API allows an operator to subscribe to events > happening on their clusters e.g., any time a new task is launched/updated. > However, there is no ability currently for a subscriber to express its > interest in a particular subset of events on the master e.g, only in agent > related events (add/removal) etc. > This would also take care of use cases where a subscriber would be short > lived i.e., is only interested to see if a particular task has been launched > on the cluster by the framework and then close its connection thereafter. > Currently, such subscribers also receive the entire snapshot of the cluster > via the {{SNAPSHOT}} events that can be rather huge for production clusters > (we also don't support compression on the stream yet). Such subscribers in > the future would be able to opt out of this event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6552) Add ability to filter events on the subscriber stream for Master API.
Anand Mazumdar created MESOS-6552: - Summary: Add ability to filter events on the subscriber stream for Master API. Key: MESOS-6552 URL: https://issues.apache.org/jira/browse/MESOS-6552 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Currently, the v1 Master API allows an operator to subscribe to events happening on their clusters e.g., any time a new task is launched/updated. However, there is no ability currently for a subscriber to express its interest in a particular subset of events on the master e.g, only in task add/updated events This would also take care of use cases where a subscriber would be short lived i.e., is only interested to see if a particular task has been launched on the cluster by the framework and then close its connection thereafter. Currently, such subscribers also receive the entire snapshot of the cluster via the {{SNAPSHOT}} events that can be rather huge for production clusters (we also don't support compression on the stream yet). Such subscribers in the future would be able to opt out of this event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6466) Add support for streaming HTTP requests in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637168#comment-15637168 ] Anand Mazumdar edited comment on MESOS-6466 at 11/4/16 8:25 PM: Review Chain: https://reviews.apache.org/r/53481/ Currently still left: - Fix all tests relying on filtering HTTP events (see r53491 for more details) - Parameterize the existing decoder tests thereby making them also work for the streaming decoder. - Include Ben's fix for streaming gzip decompression (MESOS-6530) was (Author: anandmazumdar): Review Chain: https://reviews.apache.org/r/53481/ Currently still left: - Fix all tests relying on filtering HTTP events (see r53491 for more details) - Parameterize the existing decoder tests thereby making them also work for the streaming decoder. > Add support for streaming HTTP requests in Mesos > > > Key: MESOS-6466 > URL: https://issues.apache.org/jira/browse/MESOS-6466 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > > We already have support for streaming HTTP responses in Mesos. We now also > need to add support for streaming HTTP requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5763) Task stuck in fetching is not cleaned up after --executor_registration_timeout.
[ https://issues.apache.org/jira/browse/MESOS-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5763: -- Target Version/s: 0.28.3 Fix Version/s: (was: 0.27.4) (was: 0.28.3) > Task stuck in fetching is not cleaned up after > --executor_registration_timeout. > --- > > Key: MESOS-5763 > URL: https://issues.apache.org/jira/browse/MESOS-5763 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.0, 1.0.0 >Reporter: Yan Xu >Assignee: Yan Xu >Priority: Blocker > Fix For: 1.0.0 > > > When the fetching process hangs forever due to reasons such as HDFS issues, > Mesos containerizer would attempt to destroy the container and kill the > executor after {{--executor_registration_timeout}}. However this reliably > fails for us: the executor would be killed by the launcher destroy and the > container would be destroyed but the agent would never find out that the > executor is terminated thus leaving the task in the STAGING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6391) Command task's sandbox should not be owned by root if it uses container image.
[ https://issues.apache.org/jira/browse/MESOS-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6391: -- Target Version/s: 1.0.2, 1.1.0 (was: 0.28.3, 1.0.2, 1.1.0) Removing the target version from 0.28.3 since it's not a trivial backport. cc: [~jieyu] > Command task's sandbox should not be owned by root if it uses container image. > -- > > Key: MESOS-6391 > URL: https://issues.apache.org/jira/browse/MESOS-6391 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.2, 1.0.1 >Reporter: Jie Yu >Assignee: Jie Yu >Priority: Blocker > Fix For: 1.0.2, 1.1.0 > > > Currently, if the task defines a container image, the command executor will > be run under root because it needs to perform pivot_root. > That means if the task wants to run under an unprivileged user, the sandbox > of that task will not be writable because it's owned by root. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6527) Memory leak in the libprocess request decoder.
[ https://issues.apache.org/jira/browse/MESOS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627944#comment-15627944 ] Anand Mazumdar commented on MESOS-6527: --- 0.28.x backport {noformat} commit 4033b37087056c63bc9b90969288ad5a9fa7f4ff Author: Anand MazumdarDate: Tue Nov 1 21:32:55 2016 -0700 Fixed memory leak in request/response decoders. The leak can happen in cases where a client disconnects while the request/response is in progress. Review: https://reviews.apache.org/r/53361/ commit 94cdfd01cebdcc8c2ecc52dc9d402fa6191aad87 Author: Anand Mazumdar Date: Tue Nov 1 21:38:42 2016 -0700 Added MESOS-6527 to CHANGELOG for 0.28.3. {noformat} 1.0.2 backport {noformat} commit 07a4a242d7e722840c63e9b0d6a444ad5e6b1ec3 Author: Anand Mazumdar Date: Tue Nov 1 21:32:55 2016 -0700 Fixed memory leak in request/response decoders. The leak can happen in cases where a client disconnects while the request/response is in progress. Review: https://reviews.apache.org/r/53361/ commit 9a218e3edf3d9fac0a83817d26ad689cf7d53f05 Author: Anand Mazumdar Date: Tue Nov 1 21:37:39 2016 -0700 Added MESOS-6527 to CHANGELOG for 1.0.2. {noformat} 1.1.0 branch {noformat} commit eaec806adefa1206c242e0409c6022a3bc115f6d Author: Anand Mazumdar Date: Tue Nov 1 21:32:55 2016 -0700 Fixed memory leak in request/response decoders. The leak can happen in cases where a client disconnects while the request/response is in progress. Review: https://reviews.apache.org/r/53361/ commit 4f655fd98c91b8e72dca4f4c7c5faf024a78d763 Author: Anand Mazumdar Date: Tue Nov 1 21:36:29 2016 -0700 Added MESOS-6527 to CHANGELOG for 1.1.0. {noformat} > Memory leak in the libprocess request decoder. > -- > > Key: MESOS-6527 > URL: https://issues.apache.org/jira/browse/MESOS-6527 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.2, 1.1.0, 1.2.0 > > > The libprocess decoder can leak a {{Request}} object in cases when a client > disconnects while the request is in progress. In such cases, the decoder's > destructor won't delete the active {{Request}} object that it had allocated > on the heap. > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6527) Memory leak in the libprocess request decoder.
[ https://issues.apache.org/jira/browse/MESOS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6527: -- Story Points: 2 (was: 1) > Memory leak in the libprocess request decoder. > -- > > Key: MESOS-6527 > URL: https://issues.apache.org/jira/browse/MESOS-6527 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.2, 1.1.0, 1.2.0 > > > The libprocess decoder can leak a {{Request}} object in cases when a client > disconnects while the request is in progress. In such cases, the decoder's > destructor won't delete the active {{Request}} object that it had allocated > on the heap. > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6527) Memory leak in the libprocess request decoder.
Anand Mazumdar created MESOS-6527: - Summary: Memory leak in the libprocess request decoder. Key: MESOS-6527 URL: https://issues.apache.org/jira/browse/MESOS-6527 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Anand Mazumdar Assignee: Anand Mazumdar Priority: Blocker The libprocess decoder can leak a {{Request}} object in cases when a client disconnects while the request is in progress. In such cases, the decoder's destructor won't delete the active {{Request}} object that it had allocated on the heap. https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/decoder.hpp#L271 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6466) Add support for streaming HTTP requests in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6466: -- Shepherd: Benjamin Mahler > Add support for streaming HTTP requests in Mesos > > > Key: MESOS-6466 > URL: https://issues.apache.org/jira/browse/MESOS-6466 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > > We already have support for streaming HTTP responses in Mesos. We now also > need to add support for streaming HTTP requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6466) Add support for streaming HTTP requests in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar reassigned MESOS-6466: - Assignee: Anand Mazumdar (was: Kevin Klues) > Add support for streaming HTTP requests in Mesos > > > Key: MESOS-6466 > URL: https://issues.apache.org/jira/browse/MESOS-6466 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Anand Mazumdar > Labels: debugging, mesosphere > > We already have support for streaming HTTP responses in Mesos. We now also > need to add support for streaming HTTP requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.
[ https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617270#comment-15617270 ] Anand Mazumdar edited comment on MESOS-6507 at 10/29/16 2:53 AM: - It was an oversight on my part, had forgotten backporting Ben's UUID patches. This should unblock the 1.0.2 release. {noformat} commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91 Author: Benjamin MahlerDate: Tue Sep 20 14:17:30 2016 -0700 Updated scheduler library to handle UUID parsing error. Previously this would have thrown an exception. Review: https://reviews.apache.org/r/52099 commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb Author: Benjamin Mahler Date: Tue Sep 20 14:14:39 2016 -0700 Updated UUID::fromString to not throw an exception on error. The exception from the string_generator needs to be caught so that we can surface a Try to the caller. Review: https://reviews.apache.org/r/52098 {noformat} was (Author: anandmazumdar): It was an oversight on my part, had forgotten backporting Ben's UUID patches. This should unblock the 1.0.2 release. However, we still need to fix the test flakiness on HEAD. I would update the JIRA with the flaky test log. {noformat} commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91 Author: Benjamin Mahler Date: Tue Sep 20 14:17:30 2016 -0700 Updated scheduler library to handle UUID parsing error. Previously this would have thrown an exception. Review: https://reviews.apache.org/r/52099 commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb Author: Benjamin Mahler Date: Tue Sep 20 14:14:39 2016 -0700 Updated UUID::fromString to not throw an exception on error. The exception from the string_generator needs to be caught so that we can surface a Try to the caller. Review: https://reviews.apache.org/r/52098 {noformat} > 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails > consistently. > -- > > Key: MESOS-6507 > URL: https://issues.apache.org/jira/browse/MESOS-6507 > Project: Mesos > Issue Type: Bug > Components: docker, test >Reporter: Gilbert Song >Priority: Blocker > Labels: failure > > Here is the log: > {noformat} > [23:09:24] : [Step 10/10] [ RUN ] > DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] > Parsing resources as JSON failed: cpus:1;mem:512 > [23:09:24] : [Step 10/10] Trying semicolon-delimited string format instead > [23:09:24] : [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] > Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory > 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e > MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v > /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox > --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c > sleep 1000 > [23:09:24] : [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] > Recovering Docker containers > [23:09:24] : [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] > Running docker -H unix:///var/run/docker.sock ps -a > [23:09:24] : [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] > Skipping recovery of executor '' of framework '' because its latest run could > not be recovered > [23:09:25] : [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] > Checking if Docker container named '/mesos-s1.malformedUUID' was started by > Mesos > [23:09:25] : [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] > Checking if Mesos container with ID 'malformedUUID' has been orphaned > [23:09:25] : [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] > Running docker -H unix:///var/run/docker.sock stop -t 0 > 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747 > [23:09:25] : [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -v >
[jira] [Updated] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.
[ https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6507: -- Target Version/s: (was: 1.0.2) > 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails > consistently. > -- > > Key: MESOS-6507 > URL: https://issues.apache.org/jira/browse/MESOS-6507 > Project: Mesos > Issue Type: Bug > Components: docker, test >Reporter: Gilbert Song >Priority: Blocker > Labels: failure > > Here is the log: > {noformat} > [23:09:24] : [Step 10/10] [ RUN ] > DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] > Parsing resources as JSON failed: cpus:1;mem:512 > [23:09:24] : [Step 10/10] Trying semicolon-delimited string format instead > [23:09:24] : [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] > Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory > 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e > MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v > /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox > --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c > sleep 1000 > [23:09:24] : [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] > Recovering Docker containers > [23:09:24] : [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] > Running docker -H unix:///var/run/docker.sock ps -a > [23:09:24] : [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] > Skipping recovery of executor '' of framework '' because its latest run could > not be recovered > [23:09:25] : [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] > Checking if Docker container named '/mesos-s1.malformedUUID' was started by > Mesos > [23:09:25] : [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] > Checking if Mesos container with ID 'malformedUUID' has been orphaned > [23:09:25] : [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] > Running docker -H unix:///var/run/docker.sock stop -t 0 > 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747 > [23:09:25] : [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -v > 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747 > [23:09:25] : [Step 10/10] I1028 23:09:25.212870 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] I1028 23:09:25.513255 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] I1028 23:09:25.815946 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.119107 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.421722 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.724777 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:27] : [Step 10/10] I1028 23:09:27.028252 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:27] : [Step 10/10] I1028 23:09:27.331799 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:27] : [Step 10/10] I1028 23:09:27.634660 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:27] : [Step 10/10] I1028 23:09:27.938190 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:28] : [Step 10/10] I1028 23:09:28.241756 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:28] : [Step 10/10] I1028
[jira] [Commented] (MESOS-6507) 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails consistently.
[ https://issues.apache.org/jira/browse/MESOS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617270#comment-15617270 ] Anand Mazumdar commented on MESOS-6507: --- It was an oversight on my part, had forgotten backporting Ben's UUID patches. This should unblock the 1.0.2 release. However, we still need to fix the test flakiness on HEAD. I would update the JIRA with the flaky test log. {noformat} commit 9e0f9505bae40a5f803d9a3eebfebe62287fbe91 Author: Benjamin MahlerDate: Tue Sep 20 14:17:30 2016 -0700 Updated scheduler library to handle UUID parsing error. Previously this would have thrown an exception. Review: https://reviews.apache.org/r/52099 commit 4e4d058ea3c012b2e6d4bbed58ef7fbaea5b60fb Author: Benjamin Mahler Date: Tue Sep 20 14:14:39 2016 -0700 Updated UUID::fromString to not throw an exception on error. The exception from the string_generator needs to be caught so that we can surface a Try to the caller. Review: https://reviews.apache.org/r/52098 {noformat} > 'DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID' fails > consistently. > -- > > Key: MESOS-6507 > URL: https://issues.apache.org/jira/browse/MESOS-6507 > Project: Mesos > Issue Type: Bug > Components: docker, test >Reporter: Gilbert Song >Priority: Blocker > Labels: failure > > Here is the log: > {noformat} > [23:09:24] : [Step 10/10] [ RUN ] > DockerContainerizerTest.ROOT_DOCKER_SkipRecoverMalformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.304638 31435 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -f -v mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.398941 31435 resources.cpp:572] > Parsing resources as JSON failed: cpus:1;mem:512 > [23:09:24] : [Step 10/10] Trying semicolon-delimited string format instead > [23:09:24] : [Step 10/10] I1028 23:09:24.399123 31435 docker.cpp:809] > Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory > 536870912 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e > MESOS_CONTAINER_NAME=mesos-s1.malformedUUID -v > /mnt/teamcity/temp/buildTmp/DockerContainerizerTest_ROOT_DOCKER_SkipRecoverMalformedUUID_rjDyqa:/mnt/mesos/sandbox > --net host --entrypoint /bin/sh --name mesos-s1.malformedUUID alpine -c > sleep 1000 > [23:09:24] : [Step 10/10] I1028 23:09:24.401227 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.700460 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:24] : [Step 10/10] I1028 23:09:24.804401 31453 docker.cpp:785] > Recovering Docker containers > [23:09:24] : [Step 10/10] I1028 23:09:24.804477 31453 docker.cpp:1091] > Running docker -H unix:///var/run/docker.sock ps -a > [23:09:24] : [Step 10/10] I1028 23:09:24.905027 31454 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] W1028 23:09:25.008965 31454 docker.cpp:838] > Skipping recovery of executor '' of framework '' because its latest run could > not be recovered > [23:09:25] : [Step 10/10] I1028 23:09:25.008996 31454 docker.cpp:957] > Checking if Docker container named '/mesos-s1.malformedUUID' was started by > Mesos > [23:09:25] : [Step 10/10] I1028 23:09:25.009019 31454 docker.cpp:967] > Checking if Mesos container with ID 'malformedUUID' has been orphaned > [23:09:25] : [Step 10/10] I1028 23:09:25.009052 31454 docker.cpp:860] > Running docker -H unix:///var/run/docker.sock stop -t 0 > 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747 > [23:09:25] : [Step 10/10] I1028 23:09:25.109345 31451 docker.cpp:933] > Running docker -H unix:///var/run/docker.sock rm -v > 1e9990dbadad6078ceda5d5e0cbfd62b9242c22359126b42dca77d6fdd9a2747 > [23:09:25] : [Step 10/10] I1028 23:09:25.212870 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] I1028 23:09:25.513255 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:25] : [Step 10/10] I1028 23:09:25.815946 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.119107 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.421722 31435 docker.cpp:972] > Running docker -H unix:///var/run/docker.sock inspect mesos-s1.malformedUUID > [23:09:26] : [Step 10/10] I1028 23:09:26.724777 31435
[jira] [Updated] (MESOS-6497) Java Scheduler Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6497: -- Sprint: Mesosphere Sprint 46 Story Points: 2 > Java Scheduler Adapter does not surface MasterInfo. > --- > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > Fix For: 1.1.0 > > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6500) SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky
Anand Mazumdar created MESOS-6500: - Summary: SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky Key: MESOS-6500 URL: https://issues.apache.org/jira/browse/MESOS-6500 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Showed up on ReviewBot. Unfortunately, the ReviewBot cleaned up the logs. It seems like we are leaving orphan processes upon the test suite completion that leads to this test failing. {code} ../../src/tests/environment.cpp:825: Failure Failed Tests completed with child processes remaining: -+- 29429 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-tests \-+- 5970 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-containerizer launch --command={"arguments":["mesos-executor","--launcher_dir=\/mesos\/mesos-1.2.0\/_build\/src"],"shell":false,"value":"\/mesos\/mesos-1.2.0\/_build\/src\/mesos-executor"} --environment={"LIBPROCESS_PORT":"0","MESOS_AGENT_ENDPOINT":"172.17.0.2:52560","MESOS_CHECKPOINT":"1","MESOS_DIRECTORY":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_EXECUTOR_ID":"e4f3e7e4-1acf-46d6-9768-259be617a17a","MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD":"5secs","MESOS_FRAMEWORK_ID":"6ace31e5-eac7-41f8-a938-64d648610484-","M ESOS_HTTP_COMMAND_EXECUTOR":"1","MESOS_RECOVERY_TIMEOUT":"15mins","MESOS_SANDBOX":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_SLAVE_ID":"6ace31e5-eac7-41f8-a938-64d648610484-S0","MESOS_SLAVE_PID":"agent@172.17.0.2:52560","MESOS_SUBSCRIPTION_BACKOFF_MAX":"2secs","PATH":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"} --help=false --pipe_read=72 --pipe_write=77 --pre_exec_commands=[] --runtime_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_FIHcEr/containers/6517ed10-859f-41d1-b5b4-75dc5c0c2a23 --unshare_namespace_mnt=false --user=mesos --working_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_ kyPmzZ/slaves/6ace31e5-eac7-41f8-a938-64d648610484-S0/frameworks/6ace31e5-eac7-41f8-a938-64d648610484-/executors/e4f3e7e4-1acf-46d6-9768-259be617a17a/runs/6517ed10-859f-41d1-b5b4-75dc5c0c2a23 \-+- 5984 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-executor --launcher_dir=/mesos/mesos-1.2.0/_build/src \-+- 6015 sh -c sleep 1000 \--- 6029 sleep 1000 [==] 1369 tests from 155 test cases ran. (465777 ms total) [ PASSED ] 1368 tests. [ FAILED ] 1 test, listed below: [ FAILED ] SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6500) SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky
[ https://issues.apache.org/jira/browse/MESOS-6500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616236#comment-15616236 ] Anand Mazumdar commented on MESOS-6500: --- cc: [~jieyu] [~gilbert] > SlaveRecoveryTest/0.ReconnectHTTPExecutor is flaky > -- > > Key: MESOS-6500 > URL: https://issues.apache.org/jira/browse/MESOS-6500 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > Labels: flaky, flaky-test > > Showed up on ReviewBot. Unfortunately, the ReviewBot cleaned up the logs. > It seems like we are leaving orphan processes upon the test suite completion > that leads to this test failing. > {code} > ../../src/tests/environment.cpp:825: Failure > Failed > Tests completed with child processes remaining: > -+- 29429 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-tests > \-+- 5970 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-containerizer launch > --command={"arguments":["mesos-executor","--launcher_dir=\/mesos\/mesos-1.2.0\/_build\/src"],"shell":false,"value":"\/mesos\/mesos-1.2.0\/_build\/src\/mesos-executor"} > > --environment={"LIBPROCESS_PORT":"0","MESOS_AGENT_ENDPOINT":"172.17.0.2:52560","MESOS_CHECKPOINT":"1","MESOS_DIRECTORY":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_EXECUTOR_ID":"e4f3e7e4-1acf-46d6-9768-259be617a17a","MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD":"5secs","MESOS_FRAMEWORK_ID":"6ace31e5-eac7-41f8-a938-64d648610484-","M > > ESOS_HTTP_COMMAND_EXECUTOR":"1","MESOS_RECOVERY_TIMEOUT":"15mins","MESOS_SANDBOX":"\/tmp\/SlaveRecoveryTest_0_ReconnectHTTPExecutor_kyPmzZ\/slaves\/6ace31e5-eac7-41f8-a938-64d648610484-S0\/frameworks\/6ace31e5-eac7-41f8-a938-64d648610484-\/executors\/e4f3e7e4-1acf-46d6-9768-259be617a17a\/runs\/6517ed10-859f-41d1-b5b4-75dc5c0c2a23","MESOS_SLAVE_ID":"6ace31e5-eac7-41f8-a938-64d648610484-S0","MESOS_SLAVE_PID":"agent@172.17.0.2:52560","MESOS_SUBSCRIPTION_BACKOFF_MAX":"2secs","PATH":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"} > --help=false --pipe_read=72 --pipe_write=77 --pre_exec_commands=[] > --runtime_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_FIHcEr/containers/6517ed10-859f-41d1-b5b4-75dc5c0c2a23 > --unshare_namespace_mnt=false --user=mesos > --working_directory=/tmp/SlaveRecoveryTest_0_ReconnectHTTPExecutor_ > > kyPmzZ/slaves/6ace31e5-eac7-41f8-a938-64d648610484-S0/frameworks/6ace31e5-eac7-41f8-a938-64d648610484-/executors/e4f3e7e4-1acf-46d6-9768-259be617a17a/runs/6517ed10-859f-41d1-b5b4-75dc5c0c2a23 > >\-+- 5984 /mesos/mesos-1.2.0/_build/src/.libs/lt-mesos-executor > --launcher_dir=/mesos/mesos-1.2.0/_build/src > \-+- 6015 sh -c sleep 1000 >\--- 6029 sleep 1000 > [==] 1369 tests from 155 test cases ran. (465777 ms total) > [ PASSED ] 1368 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SlaveRecoveryTest/0.ReconnectHTTPExecutor, where TypeParam = > mesos::internal::slave::MesosContainerizer > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6497) Java Scheduler Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6497: -- Summary: Java Scheduler Adapter does not surface MasterInfo. (was: HTTP Adapter does not surface MasterInfo.) > Java Scheduler Adapter does not surface MasterInfo. > --- > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > Fix For: 1.1.0 > > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'
[ https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615520#comment-15615520 ] Anand Mazumdar commented on MESOS-6202: --- Nopes, we can close this issue. > Docker containerizer kills containers whose name starts with 'mesos-' > - > > Key: MESOS-6202 > URL: https://issues.apache.org/jira/browse/MESOS-6202 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.1 > Environment: Dockerized > {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}} >Reporter: Marc Villacorta > > I run 3 docker containers in my CoreOS system whose names start with > _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_. > I can start the first two without any problem but when I start the third one > _('mesos-agent')_ all three containers are killed by the docker daemon. > If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and > _'m3s0s-agent'_ everything works. > I tracked down the problem to > [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120] > code which is marked to be removed after deprecation cycle. > I was previously running Mesos 0.28.2 without this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613287#comment-15613287 ] Anand Mazumdar edited comment on MESOS-6497 at 10/27/16 9:30 PM: - We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK to populate {{MasterInfo}} correctly. was (Author: anandmazumdar): We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK. > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613287#comment-15613287 ] Anand Mazumdar commented on MESOS-6497: --- We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK. > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6497: -- Shepherd: Vinod Kone Description: The HTTP adapter does not surface the {{MasterInfo}}. This makes it not compatible with the V0 API where the {{registered}} and {{reregistered}} calls provided the MasterInfo to the framework. cc [~vinodkone] was: The HTTP adapter does not surface the MasterInfo. This makes it not compatible with the V0 API where the {{registered}} and {{reregistered}} calls provided the MasterInfo to the framework. cc [~vinodkone] Summary: HTTP Adapter does not surface MasterInfo. (was: HTTP Adapter does not surface MasterInfo) > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612536#comment-15612536 ] Anand Mazumdar commented on MESOS-6212: --- Keeping the JIRA open till I complete the backport to 1.0.2. > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.1.0 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6212: -- Target Version/s: 1.0.2, 1.1.0 (was: 1.0.2) Fix Version/s: (was: 1.0.2) 1.1.0 > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.1.0 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6458) Add test to check fromString function of stout library
[ https://issues.apache.org/jira/browse/MESOS-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6458: -- Target Version/s: (was: 1.0.2) Fix Version/s: 1.1.0 > Add test to check fromString function of stout library > -- > > Key: MESOS-6458 > URL: https://issues.apache.org/jira/browse/MESOS-6458 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 1.0.1 >Reporter: Manuwela Kanade >Assignee: Manuwela Kanade >Priority: Trivial > Fix For: 1.1.0 > > > For the 3rdparty stout library, there is a testcase for checking Malformed > UUID. > But this testcase does not have a positive test for the fromString function > to test if it returns correct UUID when passed a correctly formatted UUID > string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6212: -- Shepherd: Timothy Chen (was: Anand Mazumdar) > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.0.2 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5103) Enhance mesos-health-check to send v1:: TaskHealthStatus message
[ https://issues.apache.org/jira/browse/MESOS-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602306#comment-15602306 ] Anand Mazumdar commented on MESOS-5103: --- Looks like we missed closing this issue when we made the command executor unversioned. For more context see: https://github.com/apache/mesos/commit/00709d0dbb71d61242b902d3d324fa2dd5f12adc > Enhance mesos-health-check to send v1:: TaskHealthStatus message > > > Key: MESOS-5103 > URL: https://issues.apache.org/jira/browse/MESOS-5103 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Assignee: haosdent > > The existing {{mesos-health-check}} > (https://github.com/apache/mesos/blob/master/src/health-check/main.cpp) can > only send the unversioned {{TaskHealthStatus}} message. However, with the new > Executor HTTP Library, there will be executor which is based on v1 HTTP > executor API and all the protobuf messages used by it should be v1 as well > (e.g., {{v1::TaskHealthStatus}}). > So we may either modify the existing {{mesos-health-check}} binary to send > {{v1::TaskHealthStatus}} messages in addition to the unversioned ones or > create a new binary for versioned health checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6212: -- Shepherd: Anand Mazumdar > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597997#comment-15597997 ] Anand Mazumdar commented on MESOS-6212: --- Would be happy to. > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6407) Move DEFAULT_v1_xxx macros to the v1 namespace.
Anand Mazumdar created MESOS-6407: - Summary: Move DEFAULT_v1_xxx macros to the v1 namespace. Key: MESOS-6407 URL: https://issues.apache.org/jira/browse/MESOS-6407 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Assignee: Joris Van Remoortere We should clean up the existing {{DEFAULT_v1_*}} macros and bring it under the {{v1}} namespace e.g., {{v1::DEFAULT_FRAMEWORK_INFO}}. This is necessary for doing a larger cleanup i.e., we would like to introduce {{createXXX}} for the {{v1}} API and would not like to add {{createV1XXX}} functions eventually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6405) Benchmark call ingestion path on the Mesos master.
[ https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar reassigned MESOS-6405: - Assignee: Anand Mazumdar > Benchmark call ingestion path on the Mesos master. > -- > > Key: MESOS-6405 > URL: https://issues.apache.org/jira/browse/MESOS-6405 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Critical > Labels: mesosphere > > [~drexin] reported on the user mailing > [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E] > that there seems to be a significant regression in performance on the call > ingestion path on the Mesos master wrt to the scheduler driver (v0 API). > We should create a benchmark to first get a sense of the numbers and then go > about fixing the performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.
[ https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6405: -- Shepherd: Vinod Kone Sprint: Mesosphere Sprint 45 Story Points: 3 Labels: mesosphere (was: ) > Benchmark call ingestion path on the Mesos master. > -- > > Key: MESOS-6405 > URL: https://issues.apache.org/jira/browse/MESOS-6405 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Critical > Labels: mesosphere > > [~drexin] reported on the user mailing > [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E] > that there seems to be a significant regression in performance on the call > ingestion path on the Mesos master wrt to the scheduler driver (v0 API). > We should create a benchmark to first get a sense of the numbers and then go > about fixing the performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6405) Benchmark call ingestion path on the Mesos master.
[ https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6405: -- Priority: Critical (was: Major) > Benchmark call ingestion path on the Mesos master. > -- > > Key: MESOS-6405 > URL: https://issues.apache.org/jira/browse/MESOS-6405 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Priority: Critical > > [~drexin] reported on the user mailing > [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E] > that there seems to be a significant regression in performance on the call > ingestion path on the Mesos master wrt to the scheduler driver (v0 API). > We should create a benchmark to first get a sense of the numbers and then go > about fixing the performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6405) Benchmark call ingestion path on the Mesos master.
Anand Mazumdar created MESOS-6405: - Summary: Benchmark call ingestion path on the Mesos master. Key: MESOS-6405 URL: https://issues.apache.org/jira/browse/MESOS-6405 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar [~drexin] reported on the user mailing [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E] that there seems to be a significant regression in performance on the call ingestion path on the Mesos master wrt to the scheduler driver (v0 API). We should create a benchmark to first get a sense of the numbers and then go about fixing the performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5222) Add benchmark for writing events on the persistent connection.
[ https://issues.apache.org/jira/browse/MESOS-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5222: -- Description: It would be good to add a benchmark for testing writing events on the persistent connection for HTTP frameworks wrt driver based frameworks. The benchmark can be as simple as trying to stream generated reconciliation status update events on the persistent connection between the master and the scheduler. (was: It would be good to add a benchmark for scale testing the HTTP frameworks wrt driver based frameworks. The benchmark can be as simple as trying to launch N tasks (parameterized) with the old/new API. We can then focus on fixing performance issues that we find as a result of this exercise.) > Add benchmark for writing events on the persistent connection. > -- > > Key: MESOS-5222 > URL: https://issues.apache.org/jira/browse/MESOS-5222 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > Fix For: 1.0.0 > > > It would be good to add a benchmark for testing writing events on the > persistent connection for HTTP frameworks wrt driver based frameworks. The > benchmark can be as simple as trying to stream generated reconciliation > status update events on the persistent connection between the master and the > scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5222) Add benchmark for writing events on the persistent connection.
[ https://issues.apache.org/jira/browse/MESOS-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5222: -- Summary: Add benchmark for writing events on the persistent connection. (was: Create a benchmark for scale testing HTTP frameworks) > Add benchmark for writing events on the persistent connection. > -- > > Key: MESOS-5222 > URL: https://issues.apache.org/jira/browse/MESOS-5222 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > Fix For: 1.0.0 > > > It would be good to add a benchmark for scale testing the HTTP frameworks wrt > driver based frameworks. The benchmark can be as simple as trying to launch N > tasks (parameterized) with the old/new API. We can then focus on fixing > performance issues that we find as a result of this exercise. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6373) Add the ability to not accept any connections from a client.
Anand Mazumdar created MESOS-6373: - Summary: Add the ability to not accept any connections from a client. Key: MESOS-6373 URL: https://issues.apache.org/jira/browse/MESOS-6373 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Similar to the old {{DROP_MESSAGES}} abstraction allowing us to drop all messages from a client, we need a way to drop all incoming connection requests from a client. When running our tests, we initialize the libprocess instance once. When we notice a disconnection via the {{Connection}} abstraction, we immediately reconnect back due to being able to initiate a connection with the running libprocess instance. Use Cases: - Upon using the mock executor instance, we are never able to test if the recovery timeout expired due to the executor being able to reconnect with the agent immediately upon a disconnection since the libprocess instance running the tests is always alive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6370) The executor library does not invoke the shutdown callback upon recovery timeout.
Anand Mazumdar created MESOS-6370: - Summary: The executor library does not invoke the shutdown callback upon recovery timeout. Key: MESOS-6370 URL: https://issues.apache.org/jira/browse/MESOS-6370 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Assignee: Anand Mazumdar The executor library does not invoke the {{shutdown}} callback for checkpointed frameworks upon recovery timeout before committing suicide. This is inconsistent with the executor driver that invokes shutdown upon recovery timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6363) Default executor should not crash with a failed assertion if it notices a disconnection from the agent for non checkpointed frameworks.
[ https://issues.apache.org/jira/browse/MESOS-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6363: -- Target Version/s: 1.1.0 > Default executor should not crash with a failed assertion if it notices a > disconnection from the agent for non checkpointed frameworks. > --- > > Key: MESOS-6363 > URL: https://issues.apache.org/jira/browse/MESOS-6363 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > > If the executor library detects a disconnection for non-checkpointed > frameworks, it injects a {{SHUTDOWN}} event. For checkpointed frameworks, it > injects the {{SHUTDOWN}} event post the recovery timeout. In both these > cases, the default executor would die with a failed assertion in the > {{shutdown()}} handler: > {code} > CHECK_EQ(SUBSCRIBED, state); > {code} > The executor should commit suicide in both these cases with a successful > status code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6368) HTTP API v1 testing abstraction improvements
Anand Mazumdar created MESOS-6368: - Summary: HTTP API v1 testing abstraction improvements Key: MESOS-6368 URL: https://issues.apache.org/jira/browse/MESOS-6368 Project: Mesos Issue Type: Epic Reporter: Anand Mazumdar This epic covers all improvements needed to the existing testing infrastructure for the v1 HTTP API's (Scheduler/Executor/Operator). Some of the existing testing libprocess primitives for old driver based schedulers/executors cannot be used for the new API's since the communication does not happen via traditional libprocess message passing. Also, there might be some test helpers that exist for the old protobuf's (v0) that we should consider introducing for the v1 {{Call/Event}} protobufs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6363) Default executor should not crash with a failed assertion if it notices a disconnection from the agent for non checkpointed frameworks.
Anand Mazumdar created MESOS-6363: - Summary: Default executor should not crash with a failed assertion if it notices a disconnection from the agent for non checkpointed frameworks. Key: MESOS-6363 URL: https://issues.apache.org/jira/browse/MESOS-6363 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Assignee: Anand Mazumdar If the executor library detects a disconnection for non-checkpointed frameworks, it injects a {{SHUTDOWN}} event. For checkpointed frameworks, it injects the {{SHUTDOWN}} event post the recovery timeout. In both these cases, the default executor would die with a failed assertion in the {{shutdown()}} handler: {code} CHECK_EQ(SUBSCRIBED, state); {code} The executor should commit suicide in both these cases with a successful status code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6177: -- Target Version/s: (was: 1.1.0) > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6356) ASF CI has interleaved logging.
[ https://issues.apache.org/jira/browse/MESOS-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564509#comment-15564509 ] Anand Mazumdar commented on MESOS-6356: --- We need to find a way to (un-)interleave them across *different* test executions. Also, how did this use to work before? I had never seen this until recently on the ASF CI. > ASF CI has interleaved logging. > --- > > Key: MESOS-6356 > URL: https://issues.apache.org/jira/browse/MESOS-6356 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Priority: Critical > Labels: flaky, flaky-test, mesosphere, test > Attachments: consoleText.zip > > > It seems that the build output for test runs from ASF CI has interleaved > loggingmaking it very hard to debug test failures. This looks to have > started happening after the unified unified cgroups isolator patches went in > but we are yet to find a correlation. > An example ASF CI run with interleaved logs: > https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)/2762/changes > (Also attached this to the ticket) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6356) ASF CI has interleaved logging.
[ https://issues.apache.org/jira/browse/MESOS-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6356: -- Attachment: consoleText.zip > ASF CI has interleaved logging. > --- > > Key: MESOS-6356 > URL: https://issues.apache.org/jira/browse/MESOS-6356 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Priority: Critical > Labels: flaky, flaky-test, mesosphere, test > Attachments: consoleText.zip > > > It seems that the build output for test runs from ASF CI has interleaved > loggingmaking it very hard to debug test failures. This looks to have > started happening after the unified unified cgroups isolator patches went in > but we are yet to find a correlation. > An example ASF CI run with interleaved logs: > https://builds.apache.org/job/Mesos/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-6)/2762/changes > (Also attached this to the ticket) -- This message was sent by Atlassian JIRA (v6.3.4#6332)