[jira] [Assigned] (MESOS-8087) Add operation status update handler in Master.

2017-10-16 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8087:
---

Assignee: Jan Schlicht

> Add operation status update handler in Master.
> --
>
> Key: MESOS-8087
> URL: https://issues.apache.org/jira/browse/MESOS-8087
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jan Schlicht
>
> Please follow this doc for details.
> https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit#
> This handler will process operation status update from resource providers. 
> Depends on whether it's old or new operations, the logic is slightly 
> different.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8087) Add operation status update handler in Master.

2017-10-16 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8087:

  Sprint: Mesosphere Sprint 65
Story Points: 5
  Labels: mesosphere  (was: )

> Add operation status update handler in Master.
> --
>
> Key: MESOS-8087
> URL: https://issues.apache.org/jira/browse/MESOS-8087
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Please follow this doc for details.
> https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit#
> This handler will process operation status update from resource providers. 
> Depends on whether it's old or new operations, the logic is slightly 
> different.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider

2017-10-16 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8089:

Sprint: Mesosphere Sprint 65  (was: Mesosphere Sprint 66)

> Add messages to publish resources on a resource provider
> 
>
> Key: MESOS-8089
> URL: https://issues.apache.org/jira/browse/MESOS-8089
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Before launching a task that uses resource provider resources, the resource 
> provider needs to be informed to "publish" these resources as it may take 
> some necessary actions. For external resource providers resources might also 
> have to be "unpublished" when a task is finished. The resource provider needs 
> to ack these calls after it's ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7509) CniIsolatorPortMapperTest.ROOT_INTERNET_CURL_PortMapper fails on some Linux distros.

2017-10-16 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206154#comment-16206154
 ] 

Joseph Wu commented on MESOS-7509:
--

+1 to disabling the tests on CentOS 6, given we write up the necessary 
precautionary documentation.

> CniIsolatorPortMapperTest.ROOT_INTERNET_CURL_PortMapper fails on some Linux 
> distros.
> 
>
> Key: MESOS-7509
> URL: https://issues.apache.org/jira/browse/MESOS-7509
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: CentOS 6, Ubuntu 12.04
>Reporter: Alexander Rukletsov
>  Labels: containerizer, flaky-test, isolation, mesosphere
> Attachments: ROOT_DOCKER_DefaultDNS-badrun.txt, 
> ROOT_INTERNET_CURL_PortMapper-badrun.txt, 
> ROOT_INTERNET_CURL_PortMapper_failure_log_1.txt, 
> ROOT_INTERNET_CURL_PortMapper_failure_log_2_centos6.txt
>
>
> I see this test failing consistently on CentOS 6 and Ubuntu 12.04.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8085:
--
Description: 
The {{UnavailableResources}} sent from the allocator to the master are going to 
be dropped by the master anyways, which results in the following line to be 
printed per inactive framework per allocation which spams the master log. We 
could tune down the log level but it's better to just not send the 
{{UnavailableResources}} by the allocator.

{code:title=}
LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
  << " because the framework has terminated or is inactive";
{code}

  was:
The {{UnavailableResources}} sent from the allocator to the master are going to 
be dropped by the master anyways, which results in the following line to be 
printed per inactive framework per allocation which spams the master log. We 
could tune down the log level but it's better to just not send the 
{{UnavailableResources}}.

{code:title=}
LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
  << " because the framework has terminated or is inactive";
{code}


> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}} by the allocator.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8097) Add filesystem layout for local resource providers.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8097:
--

 Summary: Add filesystem layout for local resource providers.
 Key: MESOS-8097
 URL: https://issues.apache.org/jira/browse/MESOS-8097
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.5.0


We the following two paths for local resource providers:
1. Checkpoint directory: this should be tied to the agent ID, since the master 
only keeps track of total resources on agents.
2. Resource provider work directory: we may want this to not be tied to agent 
IDs since it could store persistent information for external CSI plugins such 
as domain socket files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8098) Benchmark Master failover performance

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8098:
-

Assignee: Yan Xu

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8098) Benchmark Master failover performance

2017-10-16 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8098:
-

 Summary: Benchmark Master failover performance
 Key: MESOS-8098
 URL: https://issues.apache.org/jira/browse/MESOS-8098
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Yan Xu


Master failover performance often sheds light on the master's performance in 
general as it's often the time the master experiences the highest load. Ways we 
can benchmark the failover include the time it takes for all agents to 
reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8071) Add agent capability for resource provider.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8071:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add agent capability for resource provider.
> ---
>
> Key: MESOS-8071
> URL: https://issues.apache.org/jira/browse/MESOS-8071
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> This capability will be used by master to tell if an agent is resource 
> provider capable. There will be some logic in the master that depends on this 
> capability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7305) Adjust the recover logic of MesosContainerizer to allow standalone containers.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7305:
---
Sprint: Mesosphere Sprint 57, Mesosphere Sprint 58, Mesosphere Sprint 59, 
Mesosphere Sprint 60, Mesosphere Sprint 61, Mesosphere Sprint 62, Mesosphere 
Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65, Mesosphere Sprint 66  
(was: Mesosphere Sprint 57, Mesosphere Sprint 58, Mesosphere Sprint 59, 
Mesosphere Sprint 60, Mesosphere Sprint 61, Mesosphere Sprint 62, Mesosphere 
Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65)

> Adjust the recover logic of MesosContainerizer to allow standalone containers.
> --
>
> Key: MESOS-7305
> URL: https://issues.apache.org/jira/browse/MESOS-7305
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> The current recovery logic in MesosContainerizer assumes that all top level 
> containers are tied to some Mesos executors. Adding standalone containers 
> will invalid this assumption. The recovery logic must be changed to adapt to 
> that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8054) Feedback for offer operations

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8054:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Feedback for offer operations
> -
>
> Key: MESOS-8054
> URL: https://issues.apache.org/jira/browse/MESOS-8054
> Project: Mesos
>  Issue Type: Epic
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>
> Only LAUNCH operations provide feedback on success or failure. All Operations 
> should do so. RESERVE, UNRESERVE, CREATE, DESTROY, CREATE_VOLUME, AND 
> DESTROY_VOLUME should all provide feedback on success or failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7851) Master stores old resource format in the registry

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7851:
---
Sprint: Mesosphere Sprint 61, Mesosphere Sprint 62, Mesosphere Sprint 63, 
Mesosphere Sprint 64, Mesosphere Sprint 65, Mesosphere Sprint 66  (was: 
Mesosphere Sprint 61, Mesosphere Sprint 62, Mesosphere Sprint 63, Mesosphere 
Sprint 64, Mesosphere Sprint 65)

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-564) Update Contribution Documentation

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-564:
--
Sprint: Mesosphere Sprint 64, Mesosphere Sprint 65, Mesosphere Sprint 66  
(was: Mesosphere Sprint 64, Mesosphere Sprint 65)

> Update Contribution Documentation
> -
>
> Key: MESOS-564
> URL: https://issues.apache.org/jira/browse/MESOS-564
> Project: Mesos
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Dave Lester
>Assignee: Greg Mann
>  Labels: documentation, mesosphere
>
> Our contribution guide is currently fairly verbose, and it focuses on the 
> ReviewBoard workflow for making code contributions. It would be helpful for 
> new contributors to have a first-time contribution guide which focuses on 
> using GitHub PRs to make small contributions, since that workflow has a 
> smaller barrier to entry for new users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7975) The command/default/docker executor can incorrectly send a TASK_FINISHED update even when the task is killed

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7975:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> The command/default/docker executor can incorrectly send a TASK_FINISHED 
> update even when the task is killed
> 
>
> Key: MESOS-7975
> URL: https://issues.apache.org/jira/browse/MESOS-7975
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: mesosphere
>
> Currently, when a task is killed, the default/command/docker executor 
> incorrectly send a {{TASK_FINISHED}} status update instead of 
> {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when 
> the task exits with a zero status code.
> {code}
>   if (WSUCCEEDED(status)) {
> taskState = TASK_FINISHED;
>   } else if (killed) {
> // Send TASK_KILLED if the task was killed as a result of
> // kill() or shutdown().
> taskState = TASK_KILLED;
>   } else {
> taskState = TASK_FAILED;
>   }
> {code}
> We should modify the code to correctly send {{TASK_KILLED}} status updates 
> when a task is killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8079) Checkpoint and recover layers used to provision rootfs in provisioner

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8079:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Checkpoint and recover layers used to provision rootfs in provisioner
> -
>
> Key: MESOS-8079
> URL: https://issues.apache.org/jira/browse/MESOS-8079
> Project: Mesos
>  Issue Type: Task
>  Components: provisioner
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: Mesosphere
>
> This information will be necessary for {{provisioner}} to determine all 
> layers of active containers, which we need to retain when image gc happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8089:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add messages to publish resources on a resource provider
> 
>
> Key: MESOS-8089
> URL: https://issues.apache.org/jira/browse/MESOS-8089
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Before launching a task that uses resource provider resources, the resource 
> provider needs to be informed to "publish" these resources as it may take 
> some necessary actions. For external resource providers resources might also 
> have to be "unpublished" when a task is finished. The resource provider needs 
> to ack these calls after it's ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7944) Implement jemalloc support for Mesos

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7944:
---
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 65, Mesosphere Sprint 66  
(was: Mesosphere Sprint 63, Mesosphere Sprint 65)

> Implement jemalloc support for Mesos
> 
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>  Labels: mesosphere
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8052) "protoc" not found when running "make -j4 check" directly in stout

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8052:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> "protoc" not found when running "make -j4 check" directly in stout
> --
>
> Key: MESOS-8052
> URL: https://issues.apache.org/jira/browse/MESOS-8052
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: compile-error
> Fix For: 1.4.1
>
>
> If we run {{make -j4 check}} without running {{make}} first, we will get the 
> following error message:
> {noformat}
> 3rdparty/protobuf-3.3.0/src/protoc -I../tests --cpp_out=. 
> ../tests/protobuf_tests.proto
> /bin/bash: 3rdparty/protobuf-3.3.0/src/protoc: No such file or directory
> Makefile:1934: recipe for target 'protobuf_tests.pb.cc' failed
> make: *** [protobuf_tests.pb.cc] Error 127
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-4945) Garbage collect unused docker layers in the store.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4945:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>  Labels: Mesosphere
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8055) Design doc for offer operations feedback

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8055:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Design doc for offer operations feedback
> 
>
> Key: MESOS-8055
> URL: https://issues.apache.org/jira/browse/MESOS-8055
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8074) Change Libprocess actor state transitions verbose logs to use VLOG(3) instead of 2

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8074:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Change Libprocess actor state transitions verbose logs to use VLOG(3) instead 
> of 2
> --
>
> Key: MESOS-8074
> URL: https://issues.apache.org/jira/browse/MESOS-8074
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: logging, mesosphere
>
> Without claiming a general change or a holistic approach, the amount of logs 
> concerning states being resumed when running a Mesos cluster with 
> {{GLOG_v=2}} is quite noisy. We should thus use {{VLOG(3)}} for such messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8095) ResourceProviderRegistrarTest.AgentRegistrar is flaky.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8095:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> ResourceProviderRegistrarTest.AgentRegistrar is flaky.
> --
>
> Key: MESOS-8095
> URL: https://issues.apache.org/jira/browse/MESOS-8095
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Bannier
>  Labels: flaky-test, mesosphere
> Attachments: AgentRegistrar-badrun.txt
>
>
> Observed it in internal CI. Test log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7939) Early disk usage check for garbage collection during recovery

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7939:
---
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65, 
Mesosphere Sprint 66  (was: Mesosphere Sprint 63, Mesosphere Sprint 64, 
Mesosphere Sprint 65)

> Early disk usage check for garbage collection during recovery
> -
>
> Key: MESOS-7939
> URL: https://issues.apache.org/jira/browse/MESOS-7939
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
> Fix For: 1.4.1
>
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is 
> not fast enough and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.00 25510 slave.cpp:5896] Current disk usage 99.87%. Max 
> allowed age: 0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of nested 
> containers, each used a lot of disk space. Note that there is no way for 
> Mesos agent to schedule individual nested containers for GC since nested 
> containers are not necessarily tied to tasks. When the top-lovel container is 
> completed, it was scheduled for GC, and the nested containers would be GC'ed 
> as well: 
> {noformat}
> I0901 17:54:44.00 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.9466483852days in the future
> I0901 17:54:44.00 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.9466405037days in the future
> I0901 17:54:44.00 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.946635763days in the future
> I0901 17:54:44.00 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.9466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40ish seconds away, no GC was 
> performed even though the disk was full. As a result, Mesos agent failed to 
> checkpoint the task status:
> {noformat}
> I0901 17:54:49.00 25513 status_update_manager.cpp:323] Received status 
> update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.00 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again. 
> However, since the first disk usage check was scheduled 1 minute after 
> startup, the agent failed before GC kicked in, falling into a restart failure 
> loop:
> {noformat}
> F0901 17:55:06.00 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> We should kick in GC early, so the agent can recover from this state.
> Re

[jira] [Updated] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7941:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Benno Evers
>Assignee: Benno Evers
>  Labels: executor, executors
> Fix For: 1.5.0
>
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7561) Add storage resource provider specific information in ResourceProviderInfo.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7561:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add storage resource provider specific information in ResourceProviderInfo.
> ---
>
> Key: MESOS-7561
> URL: https://issues.apache.org/jira/browse/MESOS-7561
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
> Fix For: 1.5.0
>
>
> For storage resource provider, there will be some specific configuration 
> information. For instance, the most important one is the `ContainerConfig` of 
> the CSI Plugin container.
> That config information will be sent to the corresponding agent that will use 
> the resources provided by the resource provider. For storage resource 
> provider particularly, the agent needs to launch the CSI Node Plugin to mount 
> the volumes.
> Comparing to adding first class storage resource provider information, an 
> alternative is to add a generic labels field in ResourceProviderInfo and let 
> resource provider itself figure out the format of the labels. However, I 
> believe a first class solution is better and more clear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8072) Change Mesos common events verbose logs to use VLOG(2) instead of 1

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8072:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Change Mesos common events verbose logs to use VLOG(2) instead of 1
> ---
>
> Key: MESOS-8072
> URL: https://issues.apache.org/jira/browse/MESOS-8072
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: logging, mesosphere
>
> The original commit 
> https://github.com/apache/mesos/commit/fa6ffdfcd22136c171b43aed2e7949a07fd263d7
>  that started using VLOG(1) for the allocator does not state why this level 
> was chosen and the periodic messages such as "No allocations performed" 
> should be displayed at a higher level to simplify debugging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7837) Propagate resource updates from local resource providers to master

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7837:
---
Sprint: Mesosphere Sprint 60, Mesosphere Sprint 61, Mesosphere Sprint 62, 
Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65, Mesosphere 
Sprint 66  (was: Mesosphere Sprint 60, Mesosphere Sprint 61, Mesosphere Sprint 
62, Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65)

> Propagate resource updates from local resource providers to master
> --
>
> Key: MESOS-7837
> URL: https://issues.apache.org/jira/browse/MESOS-7837
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere, storage
>
> When a resource provider registers with a resource provider manager, the 
> manager should sent a message to its subscribers informing them on the 
> changed resources.
> For the first iteration where we add agent-specific, local resource 
> providers, the agent would be subscribed to the manager. It should be changed 
> to handle such a resource update by informing the master about its changed 
> resources. In order to support master failovers, we should make sure to 
> similarly inform the master on agent reregistration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8097) Add filesystem layout for local resource providers.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8097:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add filesystem layout for local resource providers.
> ---
>
> Key: MESOS-8097
> URL: https://issues.apache.org/jira/browse/MESOS-8097
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.5.0
>
>
> We the following two paths for local resource providers:
> 1. Checkpoint directory: this should be tied to the agent ID, since the 
> master only keeps track of total resources on agents.
> 2. Resource provider work directory: we may want this to not be tied to agent 
> IDs since it could store persistent information for external CSI plugins such 
> as domain socket files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8032) Launching SLRP

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8032:
---
Sprint: Mesosphere Sprint 64, Mesosphere Sprint 65, Mesosphere Sprint 66  
(was: Mesosphere Sprint 64, Mesosphere Sprint 65)

> Launching SLRP
> --
>
> Key: MESOS-8032
> URL: https://issues.apache.org/jira/browse/MESOS-8032
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
> Fix For: 1.5.0
>
>
> Launching a SLRP requires the following steps:
> 1. Verify the configuration
> 2. Launch CSI plugins in standalone containers. It needs to use the V1 API to 
> talk to the agent to launch the plugins, which may require authN/authZ.
> 3. Get the resources from CSI plugins and register to the resource provider 
> manager through the Resource Provider API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7504:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"

[jira] [Updated] (MESOS-8087) Add operation status update handler in Master.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8087:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add operation status update handler in Master.
> --
>
> Key: MESOS-8087
> URL: https://issues.apache.org/jira/browse/MESOS-8087
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Please follow this doc for details.
> https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit#
> This handler will process operation status update from resource providers. 
> Depends on whether it's old or new operations, the logic is slightly 
> different.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8070:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Bundled GRPC build does not build on Debian 8
> -
>
> Key: MESOS-8070
> URL: https://issues.apache.org/jira/browse/MESOS-8070
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Chun-Hung Hsiao
> Fix For: 1.5.0
>
>
> Debian 8 includes an outdated version of libc-ares-dev, which prevents 
> bundled GRPC to build.
> I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4812:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8075) Add RWMutex to libprocess

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8075:
---
Sprint: Mesosphere Sprint 65, Mesosphere Sprint 66  (was: Mesosphere Sprint 
65)

> Add RWMutex to libprocess
> -
>
> Key: MESOS-8075
> URL: https://issues.apache.org/jira/browse/MESOS-8075
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: Mesosphere
>
> We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide 
> better concurrecy protection for mutual exclusive actions, but allow high 
> concurrency for actions which can be performed at the same time.
> One use case is image garbage collection: the new API 
> {{provisioner::pruneImages}} needs to be mutually exclusive from 
> {{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
> concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7924) Add a javascript linter to the webui.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7924:
---
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64  (was: Mesosphere Sprint 
63, Mesosphere Sprint 64, Mesosphere Sprint 66)

> Add a javascript linter to the webui.
> -
>
> Key: MESOS-7924
> URL: https://issues.apache.org/jira/browse/MESOS-7924
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Armand Grillet
>  Labels: tech-debt
> Fix For: 1.5.0
>
>
> As far as I can tell, javascript linters (e.g. ESLint) help catch some 
> functional errors as well, for example, we've made some "strict" mistakes a 
> few times that ESLint can catch: MESOS-6624, MESOS-7912.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8099) Add protobuf for checkpointing resource provider states.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8099:
--

 Summary: Add protobuf for checkpointing resource provider states.
 Key: MESOS-8099
 URL: https://issues.apache.org/jira/browse/MESOS-8099
 Project: Mesos
  Issue Type: Task
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.5.0


Resource providers need to checkpoint operations and resource 
reservations/volume types/etc atomically, so we will need to add a 
{{src/resource_providers/state.proto}} file that contains protobuf messages for 
checkpointing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7991) fatal, check failed !framework->recovered()

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7991:
---
Sprint:   (was: Mesosphere Sprint 66)

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Kapil Arya
>Priority: Blocker
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> ```
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-16 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-8085:
-
Shepherd: Joseph Wu

> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}} by the allocator.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8032) Launch CSI plugins in storage local resource providers.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8032:
---
Description: 
Launching a CSI plugin requires the following steps:
1. Verify the configuration.
2. Prepare a directory in the work directory of the resource provider where the 
socket file should be placed, and construct the path of the socket file.
3. If the socket file already exists and the plugin is already running, we 
should not launch another plugin instance.
4. Otherwise, launch a standalone container to run the plugin and connect to it 
through the socket file.

  was:
Launching a SLRP requires the following steps:
1. Verify the configuration
2. Launch CSI plugins in standalone containers. It needs to use the V1 API to 
talk to the agent to launch the plugins, which may require authN/authZ.
3. Get the resources from CSI plugins and register to the resource provider 
manager through the Resource Provider API.

Summary: Launch CSI plugins in storage local resource providers.  (was: 
Launching SLRP)

> Launch CSI plugins in storage local resource providers.
> ---
>
> Key: MESOS-8032
> URL: https://issues.apache.org/jira/browse/MESOS-8032
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
> Fix For: 1.5.0
>
>
> Launching a CSI plugin requires the following steps:
> 1. Verify the configuration.
> 2. Prepare a directory in the work directory of the resource provider where 
> the socket file should be placed, and construct the path of the socket file.
> 3. If the socket file already exists and the plugin is already running, we 
> should not launch another plugin instance.
> 4. Otherwise, launch a standalone container to run the plugin and connect to 
> it through the socket file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8085:
-

Assignee: Yan Xu

> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Assignee: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}} by the allocator.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7966) check for maintenance on agent causes fatal error

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7966:
--

Assignee: Armand Grillet  (was: Joseph Wu)

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Armand Grillet
>Priority: Blocker
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7991) fatal, check failed !framework->recovered()

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7991:
--

Assignee: Armand Grillet  (was: Kapil Arya)

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> ```
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7991) fatal, check failed !framework->recovered()

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7991:
---
Sprint: Mesosphere Sprint 66
Labels: reliability  (was: )

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> ```
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7966) check for maintenance on agent causes fatal error

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7966:
---
Shepherd: Vinod Kone

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7966) check for maintenance on agent causes fatal error

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7966:
---
Labels: reliability  (was: )

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7991) fatal, check failed !framework->recovered()

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7991:
---
Shepherd: Vinod Kone

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> ```
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7921) process::EventQueue sometimes crashes

2017-10-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7921:
---
Target Version/s: 1.4.0, 1.4.1, 1.5.0  (was: 1.4.0)

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Assignee: Benjamin Hindman
>Priority: Blocker
> Fix For: 1.4.0
>
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A bui...@mesos.apache.org query shows many such instances: 
> https://lists.apache.org/list.html?bui...@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes

2017-10-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206683#comment-16206683
 ] 

Benjamin Mahler commented on MESOS-7921:


Re-opening since this has shown up in CI a number of times again.

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Assignee: Benjamin Hindman
>Priority: Blocker
> Fix For: 1.4.0
>
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A bui...@mesos.apache.org query shows many such instances: 
> https://lists.apache.org/list.html?bui...@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8100) Authorize standalone container calls from local resource providers.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8100:
--

 Summary: Authorize standalone container calls from local resource 
providers.
 Key: MESOS-8100
 URL: https://issues.apache.org/jira/browse/MESOS-8100
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.5.0


We need to add authorization for a local resource provider to call the 
standalone container API to prevent the provider from manipulating arbitrary 
containers. We can use the same JWT-based authN/authZ mechanism for executors, 
where the agent will create a auth token for each local resource provider 
instance:
{noformat}
class LecalResourceProvider
{
public:
  static Try> create(
  const process::http::URL& url,
  const std::string& workDir,
  const mesos::ResourceProviderInfo& info,
  const Option& authToken);

  ...
};
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8032) Launch CSI plugins in storage local resource providers.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8032:
---
Labels: mesosphere storage  (was: storage)

> Launch CSI plugins in storage local resource providers.
> ---
>
> Key: MESOS-8032
> URL: https://issues.apache.org/jira/browse/MESOS-8032
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
> Fix For: 1.5.0
>
>
> Launching a CSI plugin requires the following steps:
> 1. Verify the configuration.
> 2. Prepare a directory in the work directory of the resource provider where 
> the socket file should be placed, and construct the path of the socket file.
> 3. If the socket file already exists and the plugin is already running, we 
> should not launch another plugin instance.
> 4. Otherwise, launch a standalone container to run the plugin and connect to 
> it through the socket file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8097) Add filesystem layout for local resource providers.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8097:
---
Labels: mesosphere storage  (was: )

> Add filesystem layout for local resource providers.
> ---
>
> Key: MESOS-8097
> URL: https://issues.apache.org/jira/browse/MESOS-8097
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
> Fix For: 1.5.0
>
>
> We the following two paths for local resource providers:
> 1. Checkpoint directory: this should be tied to the agent ID, since the 
> master only keeps track of total resources on agents.
> 2. Resource provider work directory: we may want this to not be tied to agent 
> IDs since it could store persistent information for external CSI plugins such 
> as domain socket files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8032) Launch CSI plugins in storage local resource provider.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-8032:
---
Summary: Launch CSI plugins in storage local resource provider.  (was: 
Launch CSI plugins in storage local resource providers.)

> Launch CSI plugins in storage local resource provider.
> --
>
> Key: MESOS-8032
> URL: https://issues.apache.org/jira/browse/MESOS-8032
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
> Fix For: 1.5.0
>
>
> Launching a CSI plugin requires the following steps:
> 1. Verify the configuration.
> 2. Prepare a directory in the work directory of the resource provider where 
> the socket file should be placed, and construct the path of the socket file.
> 3. If the socket file already exists and the plugin is already running, we 
> should not launch another plugin instance.
> 4. Otherwise, launch a standalone container to run the plugin and connect to 
> it through the socket file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8101) Import resources from CSI plugins in storage local resource provider.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8101:
--

 Summary: Import resources from CSI plugins in storage local 
resource provider.
 Key: MESOS-8101
 URL: https://issues.apache.org/jira/browse/MESOS-8101
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.5.0


The following lists the steps to import resources from a CSI plugin:
1. Launch the node plugin
1.1 GetSupportedVersions
1.2 GetPluginInfo
1.3 ProbeNode
1.4 GetNodeCapabilities
2. Launch the controller plugin
2.1 GetSuportedVersions
2.2 GetPluginInfo
2.3 GetControllerCapabilities
3. GetCapacity
4. ListVolumes
5. Report to the resource provider through UPDATE_TOTAL_RESOURCES



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8102) Add a test CSI plugin for storage local resource provider.

2017-10-16 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8102:
--

 Summary: Add a test CSI plugin for storage local resource provider.
 Key: MESOS-8102
 URL: https://issues.apache.org/jira/browse/MESOS-8102
 Project: Mesos
  Issue Type: Task
  Components: test
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.5.0


We need a dummy CSI plugin for testing storage local resoure providers. The 
test CSI plugin would just create subdirectories under its working directories 
to mimic the behavior of creating volumes, then bind-mount those volumes to 
mimic publish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8095) ResourceProviderRegistrarTest.AgentRegistrar is flaky.

2017-10-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8095:
---
Shepherd: Alexander Rukletsov

> ResourceProviderRegistrarTest.AgentRegistrar is flaky.
> --
>
> Key: MESOS-8095
> URL: https://issues.apache.org/jira/browse/MESOS-8095
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Bannier
>  Labels: flaky-test, mesosphere
> Fix For: 1.5.0
>
> Attachments: AgentRegistrar-badrun.txt
>
>
> Observed it in internal CI. Test log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-2013) Slave read endpoint doesn't encode non-ascii characters correctly

2017-10-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206899#comment-16206899
 ] 

Benjamin Mahler commented on MESOS-2013:


[~jgehrcke] the original intention was just to pass through the file contents 
directly. Unfortunately at the time we decided to return the data via JSON and 
didn't quite realize that doing so consisted of interpreting the data as UTF-8 
(this is the default JSON encoding and JSON strings are unicode). The original 
implementation would have had to encode the data (e.g. base64) into the JSON 
string for us to have been agnostic of encoding.

I will resolve this ticket since the description above is fixed (it now works 
without issue if the file contains UTF-8). 

The current state is that if the file contains any encoding other than UTF-8, 
we'll spit out the wrong thing from the endpoint. I linked to MESOS-4642 which 
is one case of this. Ideally, we can instead spit out base64 data at the 
minimum, with potential support for the client telling us which encoding the 
file is expected (and us returning an error if invalid), or us trying to detect 
it, or something else.

Note that in the V1 API, this issue is resolved. The 
[Response::ReadFile|https://github.com/apache/mesos/blob/1.4.0/include/mesos/v1/master/master.proto#L278-L284]
 returns a {{bytes}} in protobuf or a base64 JSON string if the client wants 
JSON.

[~jgehrcke] How did you run into this ticket?

> Slave read endpoint doesn't encode non-ascii characters correctly
> -
>
> Key: MESOS-2013
> URL: https://issues.apache.org/jira/browse/MESOS-2013
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Reporter: Whitney Sorenson
>Assignee: Anand Mazumdar
>
> Create a file in a sandbox with a non-ascii character, like this one: 
> http://www.fileformat.info/info/unicode/char/2018/index.htm
> Hit the read endpoint for that file.
> The response will have something like: 
> data: "\u00E2\u0080\u0098"
> It should actually be:
> data: "\u2018"
> If you put either into JSON.parse() in the browser you will see the first 
> does not render correctly but the second does.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8103) The /files/read endpoint assumes UTF-8 encoding of file contents.

2017-10-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8103:
--

 Summary: The /files/read endpoint assumes UTF-8 encoding of file 
contents.
 Key: MESOS-8103
 URL: https://issues.apache.org/jira/browse/MESOS-8103
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler


The original intention of the /files/read endpoint was just to pass through the 
file contents directly to the client.

Unfortunately at the time we decided to return the data via a JSON string and 
didn't quite realize that doing so consisted of interpreting the data as UTF-8 
(this is the default JSON encoding and JSON strings are unicode). The original 
implementation would have had to encode the data (e.g. base64) into the JSON 
string for us to have been agnostic of encoding.

If the file contains any encoding other than UTF-8, we'll spit out an incorrect 
or invalid encoding of it into UTF-8 from the endpoint.

Ideally, we can instead spit out base64 data at the minimum, with potential 
support for the client telling us which encoding the file is expected (and us 
returning an error if invalid), or us trying to detect it, or something else.

Note that in the V1 API, this issue is resolved. The Response::ReadFile returns 
a bytes in protobuf or a base64 JSON string if the client wants JSON.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7790) Design hierarchical quota allocation.

2017-10-16 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7790:

Sprint: Mesosphere Sprint 66

> Design hierarchical quota allocation.
> -
>
> Key: MESOS-7790
> URL: https://issues.apache.org/jira/browse/MESOS-7790
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>  Labels: multitenancy
>
> When quota is assigned in the role hierarchy (see MESOS-6375), it's possible 
> for there to be "undelegated" quota for a role. For example:
> {noformat}
> ^
>   /   \
> /   \
>eng (90 cpus)   sales (10 cpus)
>  ^
>/   \
>  /   \
>  ads (50 cpus)   build (10 cpus)
> {noformat}
> Here, the "eng" role has 60 of its 90 cpus of quota delegated to its 
> children, and 30 cpus remain undelegated. We need to design how to allocate 
> these 30 cpus undelegated cpus. Are they allocated entirely to the "eng" 
> role? Are they allocated to the "eng" role tree? If so, how do we determine 
> how much is allocated to each role in the "eng" tree (i.e. "eng", "eng/ads", 
> "eng/build").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8104) Add code coverage to continuous integration.

2017-10-16 Thread James Peach (JIRA)
James Peach created MESOS-8104:
--

 Summary: Add code coverage to continuous integration.
 Key: MESOS-8104
 URL: https://issues.apache.org/jira/browse/MESOS-8104
 Project: Mesos
  Issue Type: Bug
  Components: build, test
Reporter: James Peach


We should integrate code coverage in the the CI testing. Adding the right 
compiler options looks like 
[this|https://github.com/apache/trafficserver/commit/be237ea7ee874355c6f8209b4793dfe4c4fedd88]
 in automake (though we need to remove the bugs). We can push the coverage data 
to coveralls.io from specific test build configurations, and there's even 
precedent for the ASF infra team wiring it into Github directly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8104) Add code coverage to continuous integration.

2017-10-16 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207022#comment-16207022
 ] 

James Peach commented on MESOS-8104:


/cc [~vinodkone] [~karya] [~mpark]

> Add code coverage to continuous integration.
> 
>
> Key: MESOS-8104
> URL: https://issues.apache.org/jira/browse/MESOS-8104
> Project: Mesos
>  Issue Type: Bug
>  Components: build, test
>Reporter: James Peach
>
> We should integrate code coverage in the the CI testing. Adding the right 
> compiler options looks like 
> [this|https://github.com/apache/trafficserver/commit/be237ea7ee874355c6f8209b4793dfe4c4fedd88]
>  in automake (though we need to remove the bugs). We can push the coverage 
> data to coveralls.io from specific test build configurations, and there's 
> even precedent for the ASF infra team wiring it into Github directly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)