[jira] [Updated] (MESOS-7503) Consider improving the WebUI failed to connect dialog.

2017-05-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7503:
--
Description: 
Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
timeout value would be set to a small value e.g., 60 seconds for nginx. This 
results in the persistent connection between the browser and the Mesos master 
breaking resulting in the connection lost dialog (see attached screenshot). 
This is very inconvenient when debugging using the Web UI.

We should consider making the error dialog less intrusive e.g., update an 
element to signify that a reconnection is in progress similar to what other 
online services like gmail etc. do.

  was:
Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
timeout value would be set to a small value e.g., 60 seconds for nginx. This 
results in the persistent connection between the browser and the Mesos master 
breaking resulting in the connection lost dialog. This is very inconvenient 
when debugging using the Web UI.

We should consider making the error dialog less intrusive e.g., update an 
element to signify that a reconnection is in progress similar to what other 
online services like gmail etc. do.

I am attaching a screenshot of the error message.


> Consider improving the WebUI failed to connect dialog.
> --
>
> Key: MESOS-7503
> URL: https://issues.apache.org/jira/browse/MESOS-7503
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere, webui
> Attachments: Capture d’écran 2017-05-12 à 15.06.07.png
>
>
> Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
> timeout value would be set to a small value e.g., 60 seconds for nginx. This 
> results in the persistent connection between the browser and the Mesos master 
> breaking resulting in the connection lost dialog (see attached screenshot). 
> This is very inconvenient when debugging using the Web UI.
> We should consider making the error dialog less intrusive e.g., update an 
> element to signify that a reconnection is in progress similar to what other 
> online services like gmail etc. do.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7503) Consider improving the WebUI failed to connect dialog.

2017-05-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7503:
--
Attachment: Capture d’écran 2017-05-12 à 15.06.07.png

> Consider improving the WebUI failed to connect dialog.
> --
>
> Key: MESOS-7503
> URL: https://issues.apache.org/jira/browse/MESOS-7503
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere, webui
> Attachments: Capture d’écran 2017-05-12 à 15.06.07.png
>
>
> Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
> timeout value would be set to a small value e.g., 60 seconds for nginx. This 
> results in the persistent connection between the browser and the Mesos master 
> breaking resulting in the connection lost dialog. This is very inconvenient 
> when debugging using the Web UI.
> We should consider making the error dialog less intrusive e.g., update an 
> element to signify that a reconnection is in progress similar to what other 
> online services like gmail etc. do.
> I am attaching a screenshot of the error message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7503) Consider improving the WebUI failed to connect dialog.

2017-05-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7503:
--
 Labels: mesosphere webui  (was: mesosphere)
Component/s: webui

> Consider improving the WebUI failed to connect dialog.
> --
>
> Key: MESOS-7503
> URL: https://issues.apache.org/jira/browse/MESOS-7503
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere, webui
>
> Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
> timeout value would be set to a small value e.g., 60 seconds for nginx. This 
> results in the persistent connection between the browser and the Mesos master 
> breaking resulting in the connection lost dialog. This is very inconvenient 
> when debugging using the Web UI.
> We should consider making the error dialog less intrusive e.g., update an 
> element to signify that a reconnection is in progress similar to what other 
> online services like gmail etc. do.
> I am attaching a screenshot of the error message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7503) Consider improving the WebUI failed to connect dialog.

2017-05-12 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-7503:
-

 Summary: Consider improving the WebUI failed to connect dialog.
 Key: MESOS-7503
 URL: https://issues.apache.org/jira/browse/MESOS-7503
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Anand Mazumdar


Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
timeout value would be set to a small value e.g., 60 seconds for nginx. This 
results in the persistent connection between the browser and the Mesos master 
breaking resulting in the connection lost dialog. This is very inconvenient 
when debugging using the Web UI.

We should consider making the error dialog less intrusive e.g., update an 
element to signify that a reconnection is in progress similar to what other 
online services like gmail etc. do.

I am attaching a screenshot of the error message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7502) Build error on Windows when using "int" for a file descriptor

2017-05-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7502:
--

Shepherd: Vinod Kone  (was: Andrew Schwartzmeyer)
Assignee: Alexander Rukletsov
  Sprint: Mesosphere Sprint 57
Story Points: 1
  Labels: health-check mesosphere windows  (was: windows)

> Build error on Windows when using "int" for a file descriptor
> -
>
> Key: MESOS-7502
> URL: https://issues.apache.org/jira/browse/MESOS-7502
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: John Kordich
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere, windows
>
> There is a build error for mesos-tests in src/tests/check_tests.cpp on 
> Windows associated with the use of an "int" file descriptor:
> C:\mesos\mesos\src\tests\check_tests.cpp(1890): error C2440: 'initializing': 
> cannot convert from 'Try,Error>' to 
> 'Try,Error>' 
> [C:\mesos\mesos\build\src\tests\mesos-tests.vcxproj]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7416) Filter results of `/master/slaves` and the v1 call GET_AGENTS

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7416:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Filter results of `/master/slaves` and the v1 call GET_AGENTS
> -
>
> Key: MESOS-7416
> URL: https://issues.apache.org/jira/browse/MESOS-7416
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API, master
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> The results returned by both the endpoint {{/master/slaves}} and the API v1 
> {{GET_AGENTS}} return full information about the agent state which probably 
> need to be filtered for certain uses, particularly in a multi-tenancy 
> scenario.
> The kind of leaked data includes specific role names and their specific 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7469) Add local resource provider driver.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7469:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Add local resource provider driver.
> ---
>
> Key: MESOS-7469
> URL: https://issues.apache.org/jira/browse/MESOS-7469
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> Similar to scheduler/executor driver, resource provider driver will be used 
> to connect the resource provider and the master. For local resource 
> providers, the driver will try to use the agent as the proxy, instead of 
> initiating a direct connection to the master. There multiple reasons we want 
> to do that:
> * Reduce the load on the master because there will less connections.
> * More easy to control the life cycle of a local resource provider. For 
> instance, it'll be very straightforward to force the subscription of the 
> local resource providers to be _after_ the agent registration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7193) Use of `GTEST_IS_THREADSAFE` in asserts is problematic.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7193:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Use of `GTEST_IS_THREADSAFE` in asserts is problematic.
> ---
>
> Key: MESOS-7193
> URL: https://issues.apache.org/jira/browse/MESOS-7193
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Some test cases in libprocess use {{ASSERT_TRUE(GTEST_IS_THREADSAFE)}}. This 
> is a misuse of that define, [the documentation in GTest 
> says|https://github.com/google/googletest/blob/master/googletest/include/gtest/internal/gtest-port.h#L155-L163]:
> {noformat}
> Macros indicating which Google Test features are available (a macro
> is defined to 1 if the corresponding feature is supported;
> otherwise UNDEFINED -- it's never defined to 0.).  Google Test
> defines these macros automatically.  Code outside Google Test MUST
> NOT define them.
> {noformat}
> Currently, the use of {{GTEST_IS_THREADSAFE}} works fine in the assert, 
> because it is defined to be {{1}}. But newer upstream versions of GTest use a 
> more complicated define, that can yield to be undefined, causing compilation 
> errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7417) Design doc for file-based secrets.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7417:
--
Sprint: Mesosphere Sprint 55, Mesosphere Sprint 56, Mesosphere Sprint 57  
(was: Mesosphere Sprint 55, Mesosphere Sprint 56)

> Design doc for file-based secrets.
> --
>
> Key: MESOS-7417
> URL: https://issues.apache.org/jira/browse/MESOS-7417
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization, master, modules
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7347) Prototype resource offer operation handling in the master

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7347:
--
Sprint: Mesosphere Sprint 55, Mesosphere Sprint 56, Mesosphere Sprint 57  
(was: Mesosphere Sprint 55, Mesosphere Sprint 56)

> Prototype resource offer operation handling in the master
> -
>
> Key: MESOS-7347
> URL: https://issues.apache.org/jira/browse/MESOS-7347
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Prototype the following workflow in the master, in accordance with the 
> resource provider design;
> * Handle accept calls including resource provider related offer operations 
> ({{CREATE_VOLUME}}, ...)
> * Implement internal bookkeeping of the disk resources these operations will 
> be applied on
> * Implement resource bookkeeping for resource providers in the master
> * Send resource provider operations to resource providers



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7092) Health checker duplicates a lot of checker's functionality.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7092:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Health checker duplicates a lot of checker's functionality.
> ---
>
> Key: MESOS-7092
> URL: https://issues.apache.org/jira/browse/MESOS-7092
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.0
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>
> With the introduction of a general check (MESOS-6906), health checker should 
> leverage a general check plus add interpretation on top. This will avoid code 
> duplication and increase maintainability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7388) Update allocator interfaces to support resource providers

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7388:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Update allocator interfaces to support resource providers
> -
>
> Key: MESOS-7388
> URL: https://issues.apache.org/jira/browse/MESOS-7388
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7444) Add support for storing gone agents to the master registry.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7444:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Add support for storing gone agents to the master registry.
> ---
>
> Key: MESOS-7444
> URL: https://issues.apache.org/jira/browse/MESOS-7444
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> We need to add the {{MarkSlaveGone}} registry operation to the master 
> allowing it to store agents that have been marked as gone. The relevant 
> changes to {{registry.proto}} would also be done as part of this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7449) Refactor containerizers to not depend on TaskInfo or ExecutorInfo

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7449:
--
Sprint: Mesosphere Sprint 55, Mesosphere Sprint 56, Mesosphere Sprint 57  
(was: Mesosphere Sprint 55, Mesosphere Sprint 56)

> Refactor containerizers to not depend on TaskInfo or ExecutorInfo
> -
>
> Key: MESOS-7449
> URL: https://issues.apache.org/jira/browse/MESOS-7449
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> The Containerizer interfaces should be refactored so that they do not depend 
> on {{TaskInfo}} or {{ExecutorInfo}}, as a standalone container will have 
> neither.
> Currently, the {{launch}} interface depends on those fields.  Instead, we 
> should consistently use {{ContainerInfo}} and {{CommandInfo}} in 
> Containerizer and isolators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7304) Fetcher should not depend on SlaveID.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7304:
--
Sprint: Mesosphere Sprint 55, Mesosphere Sprint 56, Mesosphere Sprint 57  
(was: Mesosphere Sprint 55, Mesosphere Sprint 56)

> Fetcher should not depend on SlaveID.
> -
>
> Key: MESOS-7304
> URL: https://issues.apache.org/jira/browse/MESOS-7304
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, fetcher
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Currently, various Fetcher interfaces depends on SlaveID, which is an 
> unnecessary coupling. For instance:
> {code}
> Try Fetcher::recover(const SlaveID& slaveId, const Flags& flags);
> Future Fetcher::fetch(
> const ContainerID& containerId,
> const CommandInfo& commandInfo,
> const string& sandboxDirectory,
> const Option& user,
> const SlaveID& slaveId,
> const Flags& flags);
> {code}
> Looks like the only reason we need a SlaveID is because we need to calculate 
> the fetcher cache directory based on that. We should calculate the fetcher 
> cache directory in the caller and pass that directory to Fetcher.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7415) Add authorization to master's operator maintenance API in v0 and v1

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7415:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Add authorization to master's operator maintenance API in v0 and v1
> ---
>
> Key: MESOS-7415
> URL: https://issues.apache.org/jira/browse/MESOS-7415
> Project: Mesos
>  Issue Type: Task
>  Components: c++ api, HTTP API, master
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: authorization, mesosphere, security
>
> None of the maintenance primitives in either API v0 or API v1 have any kind 
> of authorization, which allows any user with valid credentials to do things 
> such as shutting down a machine, schedule time off on an agent, modify 
> maintenance schedule, etc.
> The authorization support needs to be added to the v0 endpoints:
> * {{/master/machine/up}}
> * {{/master/machine/down}}
> * {{/master/maintenance/schedule}}
> * {{/master/maintenance/status}}
> as well as to the v1 calls:
> * {{GET_MAINTENANCE_STATUS}}
> * {{GET_MAINTENANCE_SCHEDULE}}
> * {{UPDATE_MAINTENANCE_SCHEDULE}}
> * {{START_MAINTENANCE}}
> * {{STOP_MAINTENANCE}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7088) Support private registry credential per container.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7088:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Support private registry credential per container.
> --
>
> Key: MESOS-7088
> URL: https://issues.apache.org/jira/browse/MESOS-7088
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: containerizer
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7312) Update Resource proto for storage resource providers.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7312:
--
Sprint: Mesosphere Sprint 53, Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 53, 
Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56)

> Update Resource proto for storage resource providers.
> -
>
> Key: MESOS-7312
> URL: https://issues.apache.org/jira/browse/MESOS-7312
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>
> Storage resource provider support requires a number of changes to the 
> {{Resource}} proto:
> * support for {{RAW}} and {{BLOCK}} type {{Resource::DiskInfo::Source}}
> * {{ResourceProviderID}} in Resource
> * {{Resource::DiskInfo::Source::Path}} should be {{optional}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7500:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7349) Document Mesos "check" feature.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7349:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Document Mesos "check" feature.
> ---
>
> Key: MESOS-7349
> URL: https://issues.apache.org/jira/browse/MESOS-7349
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: documentaion, mesosphere
>
> This should include framework authors recommendations about how and when to 
> use general checks as well as comparison with health checks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7414) Enable authorization for master's logging API calls: GET_LOGGING_LEVEL and SET_LOGGING_LEVEL

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7414:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Enable authorization for master's logging API calls: GET_LOGGING_LEVEL  and 
> SET_LOGGING_LEVEL
> -
>
> Key: MESOS-7414
> URL: https://issues.apache.org/jira/browse/MESOS-7414
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API, master
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, operator, security
>
> The Operator API calls {{GET_LOGGING_LEVEL}}  and {{SET_LOGGING_LEVEL}} lack 
> authorization so any recognized user will be able to change the logging level 
> of a given master.
> The v0 endpoint {{/logging/toggle}} has authorization through the 
> {{GET_ENDPOINT_WITH_PATH}} action. We need to decide whether it should also 
> use additional authorization.
> Note that there are already actions defined for authorization of these 
> actions as they were already implemented in the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7327) Add a test with multiple tasks and checks for the default executor.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7327:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Add a test with multiple tasks and checks for the default executor.
> ---
>
> Key: MESOS-7327
> URL: https://issues.apache.org/jira/browse/MESOS-7327
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7364) Upgrade vendored GMock / GTest

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7364:
--
Sprint: Mesosphere Sprint 54, Mesosphere Sprint 55, Mesosphere Sprint 56, 
Mesosphere Sprint 57  (was: Mesosphere Sprint 54, Mesosphere Sprint 55, 
Mesosphere Sprint 56)

> Upgrade vendored GMock / GTest
> --
>
> Key: MESOS-7364
> URL: https://issues.apache.org/jira/browse/MESOS-7364
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Reporter: Neil Conway
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> We currently vendor gmock 1.7.0. The latest upstream version of gmock is 
> 1.8.0, which fixes at least one annoying warning (MESOS-6539).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7314) Add offer operations for converting disk resources

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7314:
--
Sprint: Mesosphere Sprint 55, Mesosphere Sprint 56, Mesosphere Sprint 57  
(was: Mesosphere Sprint 55, Mesosphere Sprint 56)

> Add offer operations for converting disk resources
> --
>
> Key: MESOS-7314
> URL: https://issues.apache.org/jira/browse/MESOS-7314
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> One should be able to convert {{RAW}} and {{BLOCK}} disk resources into a 
> different types by applying operations to them. The offer operations and the 
> related validation and resource handling needs to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7443) Add the MARK_AGENT_GONE call to the Operator v1 API protos.

2017-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7443:
--
Sprint: Mesosphere Sprint 56, Mesosphere Sprint 57  (was: Mesosphere Sprint 
56)

> Add the MARK_AGENT_GONE call to the Operator v1 API protos.
> ---
>
> Key: MESOS-7443
> URL: https://issues.apache.org/jira/browse/MESOS-7443
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> We need to add the relevant call to the v1 Operator API protos to mark an 
> agent as GONE. The actual handler implementation on the master would be done 
> in a separate ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7482) #elif does not match #ifdef when checking the platform.

2017-05-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7482:
---
Summary: #elif does not match #ifdef when checking the platform.  (was: 
#elif does not match #ifdef when checking the platform)

> #elif does not match #ifdef when checking the platform.
> ---
>
> Key: MESOS-7482
> URL: https://issues.apache.org/jira/browse/MESOS-7482
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Trivial
>
> When doing conditional compilation for different platforms, we mostly use 
> {{#ifdef X}} ... {{#elif defined(Y)}} ... {{#endif}}. But there are some 
> places in the codebase that uses {{#elif Y}}. Although in the current GCC 
> checking either the existence or the value of a platform macro works, making 
> the checks more consistent is more preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6724) The test "HTTPCommandExecutorTest.TerminateWithACK" is flaky

2017-05-12 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008142#comment-16008142
 ] 

Alexander Rukletsov commented on MESOS-6724:


Observed it again on Apache CI for 1.1.2 release.
{noformat}
[ RUN  ] HTTPCommandExecutorTest.TerminateWithACK
I0504 15:43:05.341382 32064 cluster.cpp:158] Creating default 'local' authorizer
I0504 15:43:05.345090 32064 leveldb.cpp:174] Opened db in 3.444533ms
I0504 15:43:05.345728 32064 leveldb.cpp:181] Compacted db in 603462ns
I0504 15:43:05.345772 32064 leveldb.cpp:196] Created db iterator in 16838ns
I0504 15:43:05.345788 32064 leveldb.cpp:202] Seeked to beginning of db in 1987ns
I0504 15:43:05.345799 32064 leveldb.cpp:271] Iterated through 0 keys in the db 
in 269ns
I0504 15:43:05.345834 32064 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0504 15:43:05.346590 32091 recover.cpp:451] Starting replica recovery
I0504 15:43:05.346793 32091 recover.cpp:477] Replica is in EMPTY status
I0504 15:43:05.347823 32098 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(168)@172.17.0.3:41866
I0504 15:43:05.348352 32090 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I0504 15:43:05.348784 32098 recover.cpp:568] Updating replica status to STARTING
I0504 15:43:05.349874 32095 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 840720ns
I0504 15:43:05.349900 32095 replica.cpp:320] Persisted replica status to 
STARTING
I0504 15:43:05.350070 32088 recover.cpp:477] Replica is in STARTING status
I0504 15:43:05.350971 32102 master.cpp:380] Master 
2075640b-b7dc-44f0-89b5-b0f9af99be7e (41c61dc99119) started on 172.17.0.3:41866
I0504 15:43:05.351112 32088 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(169)@172.17.0.3:41866
I0504 15:43:05.350991 32102 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/t7Ea9P/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/mesos/mesos-1.1.2/_inst/share/mesos/webui" 
--work_dir="/tmp/t7Ea9P/master" --zk_session_timeout="10secs"
I0504 15:43:05.351322 32102 master.cpp:432] Master only allowing authenticated 
frameworks to register
I0504 15:43:05.351335 32102 master.cpp:446] Master only allowing authenticated 
agents to register
I0504 15:43:05.351341 32102 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I0504 15:43:05.351348 32102 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/t7Ea9P/credentials'
I0504 15:43:05.351394 32094 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0504 15:43:05.351594 32102 master.cpp:504] Using default 'crammd5' 
authenticator
I0504 15:43:05.351850 32102 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0504 15:43:05.352252 32102 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0504 15:43:05.352270 32090 recover.cpp:568] Updating replica status to VOTING
I0504 15:43:05.352500 32102 http.cpp:887] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0504 15:43:05.352635 32092 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 225076ns
I0504 15:43:05.352660 32092 replica.cpp:320] Persisted replica status to VOTING
I0504 15:43:05.352707 32102 master.cpp:584] Authorization enabled
I0504 15:43:05.352778 32087 recover.cpp:582] Successfully joined the Paxos group
I0504 15:43:05.352880 32091 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0504 15:43:05.352883 32089 whitelist_watcher.cpp:77] No whitelist given
I0504 15:43:05.353144 32087 recover.cpp:466] Recover process terminated
I0504 15:43:05.355403 32085 master.cpp:2017] Elected as the leading master!
I0504 15:43:05.355437 32085 maste

[jira] [Updated] (MESOS-6724) The test "HTTPCommandExecutorTest.TerminateWithACK" is flaky

2017-05-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6724:
---
Labels: flaky-test mesosphere  (was: )

> The test "HTTPCommandExecutorTest.TerminateWithACK" is flaky
> 
>
> Key: MESOS-6724
> URL: https://issues.apache.org/jira/browse/MESOS-6724
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Qian Zhang
>  Labels: flaky-test, mesosphere
>
> It seems the test "HTTPCommandExecutorTest.TerminateWithACK" may fail when 
> the machine is under heavy load (e.g., using the “stress” utility to generate 
> load on the machine, like {{stress --cpu 4 --io 4 --timeout 120}}).
> {code}
> I1104 21:43:47.768609 31812 authenticator.cpp:98] Creating new server SASL 
> connection
> I1104 21:43:47.768844 31812 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 21:43:47.768874 31812 authenticatee.cpp:239] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> I1104 21:43:47.768960 31812 authenticator.cpp:204] Received SASL 
> authentication start
> I1104 21:43:47.769021 31812 authenticator.cpp:326] Authentication requires 
> more steps
> I1104 21:43:47.769300 31812 authenticatee.cpp:259] Received SASL 
> authentication step
> I1104 21:43:47.770079 31821 authenticator.cpp:232] Received SASL 
> authentication step
> I1104 21:43:47.770108 31817 state.cpp:57] Recovering state from 
> '/tmp/HTTPCommandExecutorTest_TerminateWithACK_zuye4X/meta'
> I1104 21:43:47.770146 31821 auxprop.cpp:109] Request to lookup properties for 
> user: 'test-principal' realm: 'b7fb1902101b' server FQDN: 'b7fb1902101b' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false
> I1104 21:43:47.770166 31821 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> I1104 21:43:47.770213 31821 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> I1104 21:43:47.770244 31821 auxprop.cpp:109] Request to lookup properties for 
> user: 'test-principal' realm: 'b7fb1902101b' server FQDN: 'b7fb1902101b' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true
> I1104 21:43:47.770258 31821 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> I1104 21:43:47.770268 31821 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> I1104 21:43:47.770290 31821 authenticator.cpp:318] Authentication success
> I1104 21:43:47.770388 31826 authenticatee.cpp:299] Authentication success
> I1104 21:43:47.770429 31823 status_update_manager.cpp:203] Recovering status 
> update manager
> I1104 21:43:47.770494 31820 master.cpp:6775] Successfully authenticated 
> principal 'test-principal' at 
> scheduler-3a4e1df5-e0b3-4373-937e-daf6914ba47d@172.17.0.3:50719
> I1104 21:43:47.770536 31815 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(32)@172.17.0.3:50719
> I1104 21:43:47.771021 31814 sched.cpp:502] Successfully authenticated with 
> master master@172.17.0.3:50719
> I1104 21:43:47.771066 31814 sched.cpp:820] Sending SUBSCRIBE call to 
> master@172.17.0.3:50719
> I1104 21:43:47.771214 31814 sched.cpp:853] Will retry registration in 
> 384.083867ms if necessary
> I1104 21:43:47.771524 31827 master.cpp:2612] Received SUBSCRIBE call for 
> framework 'default' at 
> scheduler-3a4e1df5-e0b3-4373-937e-daf6914ba47d@172.17.0.3:50719
> I1104 21:43:47.771775 31827 master.cpp:2069] Authorizing framework principal 
> 'test-principal' to receive offers for role '*'
> I1104 21:43:47.772073 31822 containerizer.cpp:557] Recovering containerizer
> I1104 21:43:47.772267 31827 master.cpp:2688] Subscribing framework default 
> with checkpointing disabled and capabilities [  ]
> I1104 21:43:47.772873 31818 sched.cpp:743] Framework registered with 
> 266cf569-3b26-40cb-be6f-080699ef02a1-
> I1104 21:43:47.772953 31824 hierarchical.cpp:275] Added framework 
> 266cf569-3b26-40cb-be6f-080699ef02a1-
> I1104 21:43:47.773032 31818 sched.cpp:757] Scheduler::registered took 22098ns
> I1104 21:43:47.773084 31824 hierarchical.cpp:1694] No allocations performed
> I1104 21:43:47.773125 31824 hierarchical.cpp:1789] No inverse offers to send 
> out!
> I1104 21:43:47.773246 31824 hierarchical.cpp:1286] Performed allocation for 0 
> agents in 226006ns
> I1104 21:43:47.773977 31821 provisioner.cpp:253] Provisioner recovery complete
> I1104 21:43:47.774346 31814 slave.cpp:5399] Finished recovery
> I1104 21:43:47.788095 31814 slave.cpp:5573] Querying resource estimator for 
> oversubscribable resources
> I1104 21:43:47.788635 31821 status_update_manager.cpp:177] Pausing sending 
> status updates
> I1104 21:43:47.788642 31814 

[jira] [Updated] (MESOS-7482) #elif does not match #ifdef when checking the platform

2017-05-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7482:

Target Version/s: 1.4.0
   Fix Version/s: (was: 1.3.0)

> #elif does not match #ifdef when checking the platform
> --
>
> Key: MESOS-7482
> URL: https://issues.apache.org/jira/browse/MESOS-7482
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Trivial
>
> When doing conditional compilation for different platforms, we mostly use 
> {{#ifdef X}} ... {{#elif defined(Y)}} ... {{#endif}}. But there are some 
> places in the codebase that uses {{#elif Y}}. Although in the current GCC 
> checking either the existence or the value of a platform macro works, making 
> the checks more consistent is more preferable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.

2017-05-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7471:

Fix Version/s: (was: 1.3.1)
   1.3.0

> Provisioner recover should not always assume 'rootfses' dir exists.
> ---
>
> Key: MESOS-7471
> URL: https://issues.apache.org/jira/browse/MESOS-7471
> Project: Mesos
>  Issue Type: Bug
>  Components: provisioner
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: provisioner
> Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
> '/var/lib/mesos/slave/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/fra

[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.

2017-05-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7471:

Target Version/s: 1.1.2, 1.2.1, 1.3.0, 1.4.0  (was: 1.1.2, 1.2.1, 1.3.1, 
1.4.0)

> Provisioner recover should not always assume 'rootfses' dir exists.
> ---
>
> Key: MESOS-7471
> URL: https://issues.apache.org/jira/browse/MESOS-7471
> Project: Mesos
>  Issue Type: Bug
>  Components: provisioner
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: provisioner
> Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
> '/var/lib/mesos/slave/slaves/36a25adb-4ea2-49d3-a