[jira] [Created] (MESOS-9411) Validation of JWT tokens using HS256 hashing algorithm is not thread safe.

2018-11-21 Thread Alexander Rojas (JIRA)
Alexander Rojas created MESOS-9411:
--

 Summary: Validation of JWT tokens using HS256 hashing algorithm is 
not thread safe.
 Key: MESOS-9411
 URL: https://issues.apache.org/jira/browse/MESOS-9411
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 1.8.0
Reporter: Alexander Rojas
Assignee: Alexander Rojas


from the [OpenSSL 
documentation|https://www.openssl.org/docs/man1.0.2/crypto/hmac.html]:

{quote}
It places the result in {{md}} (which must have space for the output of the 
hash function, which is no more than {{EVP_MAX_MD_SIZE}} bytes). If {{md}} is 
{{NULL}}, the digest is placed in a static array. The size of the output is 
placed in {{md_len}}, unless it is {{NULL}}. Note: passing a {{NULL}} value for 
{{md}} to use the static array is not thread safe.
{quote}

We are calling {{HMAC()}} as follows:

{code}
  unsigned int md_len = 0;

  unsigned char* rc = HMAC(
  EVP_sha256(),
  secret.data(),
  secret.size(),
  reinterpret_cast(message.data()),
  message.size(),
  nullptr,   // <- This is `md`
  &md_len);

  if (rc == nullptr) {
return Error(addErrorReason("HMAC failed"));
  }

  return string(reinterpret_cast(rc), md_len);
{code}

Given that this code does not run inside a process, race conditions could occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9411) Validation of JWT tokens using HS256 hashing algorithm is not thread safe.

2018-11-21 Thread Alexander Rojas (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694723#comment-16694723
 ] 

Alexander Rojas commented on MESOS-9411:


[r/69412/|https://reviews.apache.org/r/69412/]: Fixed thread safety issue in 
jwt signature validation.

> Validation of JWT tokens using HS256 hashing algorithm is not thread safe.
> --
>
> Key: MESOS-9411
> URL: https://issues.apache.org/jira/browse/MESOS-9411
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.8.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>Priority: Major
>  Labels: mesosphere
>
> from the [OpenSSL 
> documentation|https://www.openssl.org/docs/man1.0.2/crypto/hmac.html]:
> {quote}
> It places the result in {{md}} (which must have space for the output of the 
> hash function, which is no more than {{EVP_MAX_MD_SIZE}} bytes). If {{md}} is 
> {{NULL}}, the digest is placed in a static array. The size of the output is 
> placed in {{md_len}}, unless it is {{NULL}}. Note: passing a {{NULL}} value 
> for {{md}} to use the static array is not thread safe.
> {quote}
> We are calling {{HMAC()}} as follows:
> {code}
>   unsigned int md_len = 0;
>   unsigned char* rc = HMAC(
>   EVP_sha256(),
>   secret.data(),
>   secret.size(),
>   reinterpret_cast(message.data()),
>   message.size(),
>   nullptr,   // <- This is `md`
>   &md_len);
>   if (rc == nullptr) {
> return Error(addErrorReason("HMAC failed"));
>   }
>   return string(reinterpret_cast(rc), md_len);
> {code}
> Given that this code does not run inside a process, race conditions could 
> occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8810) Grant non-root task user the permissions to access the SANDBOX_PATH volume of PARENT type

2018-11-21 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550481#comment-16550481
 ] 

Qian Zhang edited comment on MESOS-8810 at 11/21/18 2:40 PM:
-

RR: https://reviews.apache.org/r/69342/


was (Author: qianzhang):
RR: https://reviews.apache.org/r/67996/

> Grant non-root task user the permissions to access the SANDBOX_PATH volume of 
> PARENT type
> -
>
> Key: MESOS-8810
> URL: https://issues.apache.org/jira/browse/MESOS-8810
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> See [design 
> doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p]
>  for why we need to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9387) Surface errors for publishing CSI volumes in task status updates.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9387:
--

Assignee: (was: Chun-Hung Hsiao)

> Surface errors for publishing CSI volumes in task status updates.
> -
>
> Key: MESOS-9387
> URL: https://issues.apache.org/jira/browse/MESOS-9387
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> Currently if a CSI volumes is failed to publish (e.g., due to {{mkfs}} 
> errors), the framework will get a {{TASK_FAILED}} with reason 
> {{REASON_CONTAINER_LAUNCH_FAILED}} or {{REASON_CONTAINER_UPDATE_FAILED}} and 
> message "{{Failed to publish resources for resource provider XXX: Received 
> FAILED status}}", which is not informative. We should surface the actual 
> error message to the {{TASK_FAILED}} status update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9387) Surface errors for publishing CSI volumes in task status updates.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695092#comment-16695092
 ] 

Chun-Hung Hsiao commented on MESOS-9387:


Thoughts dump:

We need to at least deliver the error message through the RP API.
However, since we're changing the API, I'm thinking about something more 
aggressive, and more compliant to the ERP story in the RP API:
{noformat}
message Event {
  message PublishResources {
required UUID uuid = 1;

// The set of resources that are required to be published.
repeated Resource required = 2;

// The set of resources that are allowed to be published. Any resource
// beyond this set should be unpublished. This set should contain the set of
// required resources.
repeated Resource allowed = 3;
  }
}

message Call {
  enum Type {
UPDATE_PUBLISHED_RESOURCES = 4; // See 'UpdatePublishedResources'.
  }

  message UpdatePublishedResources {
enum Status {
  UNKNOWN = 0;

  // All required resources are published and all resources that are not in
  // the set of allowed resources are unpublished. In this case, the set of
  // published resources in the `resources` field would be a superset of the
  // required resources and a subset of the allowed resources.
  OK = 1;

  // The resource provider fails to publish certain required resources, or
  // fails to unpublish certain resources that are not in the set of allowed
  // resources. In this case, the set of published resources should still be
  // reported through the `resources` field, and more human-readable
  // information should be provided in the `message` field.
  FAILED = 2;
}

required UUID uuid = 1;
required Status status = 2;
repeated Resource resources = 3;
optional string message = 4;
  }

  optional updated_published_resources = 6;
}
{noformat}
The {{UpdatePublishedResources}} is backward-compatible with the original 
{{UpdatePublishResourcesStatus}} message.

The reason of the change in {{PublishResources}} and 
{{UpdatePublishedResources}} is to make the call idempotent and be able to 
apply to resources without identifiers.
The agent should keep track of the current set of published resources through 
the {{UpdatePublishedResources.resources}} field.
When launching a task, it should call {{PUBLISH_RESOURCES}} with {{required}} 
set to the set of used resources + resources for a new task,
and {{allowed}} set to the set of required resources + the set of published 
resources,
then examine the new set of published resources to determine if a task is good 
to be launched, even if it receives a {{FAILED}} status.
If the new set of published resources does not contain the set of resources 
used by the task, the error message is surfaced in the task status update.

> Surface errors for publishing CSI volumes in task status updates.
> -
>
> Key: MESOS-9387
> URL: https://issues.apache.org/jira/browse/MESOS-9387
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> Currently if a CSI volumes is failed to publish (e.g., due to {{mkfs}} 
> errors), the framework will get a {{TASK_FAILED}} with reason 
> {{REASON_CONTAINER_LAUNCH_FAILED}} or {{REASON_CONTAINER_UPDATE_FAILED}} and 
> message "{{Failed to publish resources for resource provider XXX: Received 
> FAILED status}}", which is not informative. We should surface the actual 
> error message to the {{TASK_FAILED}} status update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9254) Make SLRP be able to update its volumes and storage pools.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695121#comment-16695121
 ] 

Chun-Hung Hsiao commented on MESOS-9254:


Thought dumps:

1. Change {{reconcileStoragePools}} to reconcile both storage pools and 
preprovisioned volumes, similar to {{reconcileResourceProviderState}}.
2. Invoke {{reconcileStoragePools}} periodically, with a proper default 
interval.
3. Remove the calls to {{reconcileStoragePools}} in {{watchProfiles}} and 
{{applyDestroyDisk}}. These reconciliation could be done together with 2. 
Alternatively, we can keep the call in {{watchProfiles}} to avoid the delay but 
need some coordination between that call and 2.
4. Most importantly, don't drop operations during reconciliations since this 
leads to poor user experience. The approach I have in mind is that, in 
{{applyOperation}}, we wait for any ongoing reconciilation to finish, before 
checking for the resource version and accept the operation. The invariant here 
is that once an operation has been accepted, it is guaranteed to be applicable 
to the current set of total resources, so the next reconciliation would have to 
wait for these operations to become terminal.
5. Optionally, we can optimistically apply the operation even the resource 
version does not match after the reconciliation to improve user experience, 
provided that we handle the pipeline dependency properly.

> Make SLRP be able to update its volumes and storage pools.
> --
>
> Key: MESOS-9254
> URL: https://issues.apache.org/jira/browse/MESOS-9254
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We should consider making SLRP update its resources periodically, or adding 
> an endpoint to trigger that, for the following reasons:
> 1. Mesos currently assumes all profiles have disjoint storage pools. This is 
> because Mesos models each resource independently. However, in practice an 
> operator can set up, say two profiles, one for linear volumes and one for 
> raid volumes, and an "LVM" resource provider that can provision both linear 
> and raid volumes. The correlation between the storage pools of the linear and 
> raid profiles would reduce one's pool capacity when a volume of the other 
> type is provisioned. To reflect the actual sizes of correlated storage pools, 
> we need a way to make SLRP update its resources.
> 2. The SLRP now only queries the CSI plugin to report a list of volumes 
> during startup, so if a new device is added, the operator will have to 
> restart the agent to trigger another SLRP startup, which is inconvenient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9254) Make SLRP be able to update its volumes and storage pools.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9254:
--

Assignee: (was: Chun-Hung Hsiao)

> Make SLRP be able to update its volumes and storage pools.
> --
>
> Key: MESOS-9254
> URL: https://issues.apache.org/jira/browse/MESOS-9254
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We should consider making SLRP update its resources periodically, or adding 
> an endpoint to trigger that, for the following reasons:
> 1. Mesos currently assumes all profiles have disjoint storage pools. This is 
> because Mesos models each resource independently. However, in practice an 
> operator can set up, say two profiles, one for linear volumes and one for 
> raid volumes, and an "LVM" resource provider that can provision both linear 
> and raid volumes. The correlation between the storage pools of the linear and 
> raid profiles would reduce one's pool capacity when a volume of the other 
> type is provisioned. To reflect the actual sizes of correlated storage pools, 
> we need a way to make SLRP update its resources.
> 2. The SLRP now only queries the CSI plugin to report a list of volumes 
> during startup, so if a new device is added, the operator will have to 
> restart the agent to trigger another SLRP startup, which is inconvenient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8370) `liburi_volume_profile.la` cannot be built standalone.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8370:
--

Assignee: (was: Chun-Hung Hsiao)

> `liburi_volume_profile.la` cannot be built standalone.
> --
>
> Key: MESOS-8370
> URL: https://issues.apache.org/jira/browse/MESOS-8370
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Priority: Major
>
> Currently the `liburi_volume_profile.la` module cannot be built standalone. 
> The reason is that this module depends on the following three generated 
> header files:
> {noformat}
> ../include/csi/csi.grpc.pb.h
> ../include/csi/csi.pb.h
> resource_provider/storage/volume_profile.pb.h
> {noformat}
> But there is no way in autotools to specify such dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9109) Windows agent uses reserved character :(colon) for file name and crashes when attempting to remove link

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9109:
--

Assignee: (was: Chun-Hung Hsiao)

> Windows agent uses reserved character :(colon) for file name and crashes when 
> attempting to remove link
> ---
>
> Key: MESOS-9109
> URL: https://issues.apache.org/jira/browse/MESOS-9109
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.7.0
>Reporter: Constantin Eduard Staniloiu
>Priority: Blocker
>  Labels: windows
>
> I have a hybrid cluster running Mesos Agents on Windows, and I am using 
> Chronos to launch jobs on Windows Agents.
> Chronos is using the character : (colon) internally when spawning jobs. The 
> Windows Mesos Agent spawns those jobs and creates the paths on disk, but when 
> the job terminates and it attempts to remove the link it crashes with the 
> following error message 
>   
> {code:java}
> I0719 09:20:00.621385 14788 gc.cpp:129] Unscheduling 
> 'D:\ws\mes-wd\meta\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640:0
> :sample-child-job-lv2:' from gc
> I0719 09:20:00.622387 24124 slave.cpp:2406] Authorizing task 
> 'ct:153200640:0:sample-child-job2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.630340 24124 slave.cpp:2406] Authorizing task 
> 'ct:153200640:0:sample-child-job-lv2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.644341 24124 slave.cpp:2849] Launching task 
> 'ct:153200640:0:sample-child-job2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.649345 24124 paths.cpp:748] Creating sandbox 
> 'D:\ws\mes-wd\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640
> :0:sample-child-job2:\runs\cecbf7ab-ace3-4f45-a208-9c104f69624c'
> F0719 09:20:00.653342 24124 paths.cpp:763] CHECK_SOME(os::rm(latest)): The 
> filename, directory name, or volume label syntax is incorrect.
> Failed to remove latest symlink 
> 'D:\ws\mes-wd\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640:0:sample-child-job2:\runs\
> latest'
> *** Check failure stack trace: ***
> {code}
>  
> The problem seems to be the job name: 
> {code:java}
> 'ct:153200640:0:sample-child-job2:'
> {code}
> Chronos is using internally : (colon) which is a reserved character on 
> Windows 
> [https://docs.microsoft.com/en-us/windows/desktop/FileIO/naming-a-file|http://example.com/]
>  
> I believe it's the responsibility of the agent to check and sanitize the task 
> names against restricted characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8428) SLRP recovery tests leak file descriptors.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8428:
--

Assignee: (was: Chun-Hung Hsiao)

> SLRP recovery tests leak file descriptors.
> --
>
> Key: MESOS-8428
> URL: https://issues.apache.org/jira/browse/MESOS-8428
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> The {{CreateDestroyVolumeRecovery}} (formerly {{NewVolumeRecovery}}) and 
> {{PublishResourcesRecovery}} (formerly {{LaunchTaskRecovery}}) tests leak 
> fds. When running them in repetition, either the following error will 
> manifest:
> {noformat}
> process_posix.hpp:257] CHECK_SOME(pipe): Too many open files
> {noformat}
> or the plugin container will exit possibly due to no fd.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695138#comment-16695138
 ] 

Chun-Hung Hsiao commented on MESOS-9223:


There are to problems here:
1. How to make the agent more robust when handling SLRP failures.
2. How to surface the SLRP failures.

It seems to me 2 can be addressed by MESOS-8380.

Thought dumps for 1:
We can make the {{LocalResourceProviderDaemon}} act like systemd: retry 
launching the SLRP when there is a launch failure, potentially with an 
exponential backoff.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9223:
--

Assignee: (was: Chun-Hung Hsiao)

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9109) Windows agent uses reserved character :(colon) for file name and crashes when attempting to remove link

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695132#comment-16695132
 ] 

Chun-Hung Hsiao commented on MESOS-9109:


Unassigned myself for now since I'm going oncall. [~bbannier] It would be great 
if you can pick this up. Otherwise I'll pick it up after my oncall rotation.
My proposed approach is in the following link:
https://lists.apache.org/thread.html/49961f30445eac7d61ef6cf1a384646760dad3754d3c332d9a9ae18a@%3Cuser.mesos.apache.org%3E

> Windows agent uses reserved character :(colon) for file name and crashes when 
> attempting to remove link
> ---
>
> Key: MESOS-9109
> URL: https://issues.apache.org/jira/browse/MESOS-9109
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.7.0
>Reporter: Constantin Eduard Staniloiu
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: windows
>
> I have a hybrid cluster running Mesos Agents on Windows, and I am using 
> Chronos to launch jobs on Windows Agents.
> Chronos is using the character : (colon) internally when spawning jobs. The 
> Windows Mesos Agent spawns those jobs and creates the paths on disk, but when 
> the job terminates and it attempts to remove the link it crashes with the 
> following error message 
>   
> {code:java}
> I0719 09:20:00.621385 14788 gc.cpp:129] Unscheduling 
> 'D:\ws\mes-wd\meta\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640:0
> :sample-child-job-lv2:' from gc
> I0719 09:20:00.622387 24124 slave.cpp:2406] Authorizing task 
> 'ct:153200640:0:sample-child-job2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.630340 24124 slave.cpp:2406] Authorizing task 
> 'ct:153200640:0:sample-child-job-lv2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.644341 24124 slave.cpp:2849] Launching task 
> 'ct:153200640:0:sample-child-job2:' for framework 
> 77a0fb6f-3c43-4d7b-ae16-af2dfd728567-
> I0719 09:20:00.649345 24124 paths.cpp:748] Creating sandbox 
> 'D:\ws\mes-wd\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640
> :0:sample-child-job2:\runs\cecbf7ab-ace3-4f45-a208-9c104f69624c'
> F0719 09:20:00.653342 24124 paths.cpp:763] CHECK_SOME(os::rm(latest)): The 
> filename, directory name, or volume label syntax is incorrect.
> Failed to remove latest symlink 
> 'D:\ws\mes-wd\slaves\5563b512-518e-44c6-bdc1-3c927d0622da-S1\frameworks\77a0fb6f-3c43-4d7b-ae16-af2dfd728567-\executors\ct:153200640:0:sample-child-job2:\runs\
> latest'
> *** Check failure stack trace: ***
> {code}
>  
> The problem seems to be the job name: 
> {code:java}
> 'ct:153200640:0:sample-child-job2:'
> {code}
> Chronos is using internally : (colon) which is a reserved character on 
> Windows 
> [https://docs.microsoft.com/en-us/windows/desktop/FileIO/naming-a-file|http://example.com/]
>  
> I believe it's the responsibility of the agent to check and sanitize the task 
> names against restricted characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8467) Destroyed executors might be used after `Slave::publishResource()`.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8467:
--

Assignee: (was: Chun-Hung Hsiao)

> Destroyed executors might be used after `Slave::publishResource()`.
> ---
>
> Key: MESOS-8467
> URL: https://issues.apache.org/jira/browse/MESOS-8467
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>
> In the following code from 
> [https://github.com/apache/mesos/blob/7b30b9ccd63dbcd3375e012dae6e2ffb9dc6a79f/src/slave/slave.cpp#L2652:]
> {code:cpp}
> publishResources()
>   .then(defer(self(), [=] {
> return containerizer->update(
> executor->containerId,
> executor->allocatedResources());
>   }))
> {code}
> A destroyed executor might be dereferenced if it has been move to 
> {{Framework.completedExecutors}} and kicked out from this circular buffer. We 
> should refactor {{Slave::publishResources()}} and its uses to make the code 
> less fragile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695138#comment-16695138
 ] 

Chun-Hung Hsiao edited comment on MESOS-9223 at 11/21/18 7:18 PM:
--

There are to problems here:
 1. How to make the agent more robust when handling SLRP failures.
 2. How to surface the SLRP failures.

It seems to me 2 can be addressed by MESOS-8380.

Thought dumps for 1:

 We can make the {{LocalResourceProviderDaemon}} act like systemd: retry 
launching the SLRP when there is a launch failure, potentially with an 
exponential backoff.
 To achieve this, we'll need a way for the daemon to monitor SLRP failures.
 Since the daemon is not aware of the resource provider manager (and should not 
be aware of it for low coupling),
 we could add the following virtual method to {{LocalResourceProvider}}, 
similar to {{ContainerDaemon::wait}}:
{noformat}
  // Returns a future that only reaches a terminal state when a local resource
  // provider is terminated. This is intended to capture any fatal error
  // encountered by the resource provider.
  virtual process::Future wait() = 0;
{noformat}
Then, retry to launch a new SLRP instance if a failed future is returned by 
{{wait}}.


was (Author: chhsia0):
There are to problems here:
1. How to make the agent more robust when handling SLRP failures.
2. How to surface the SLRP failures.

It seems to me 2 can be addressed by MESOS-8380.

Thought dumps for 1:
We can make the {{LocalResourceProviderDaemon}} act like systemd: retry 
launching the SLRP when there is a launch failure, potentially with an 
exponential backoff.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8400) Retry logic for CSI calls when plugin crashes

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695160#comment-16695160
 ] 

Chun-Hung Hsiao commented on MESOS-8400:


Thought dumps:

This can be tackled in either ways:
1. Adding a retry logic with an exponential backoff in the 
{{StorageLocalResourceProviderProcess::call}} method.
2. Fail the resource provider and simply rely on MESOS-9223 to restart a new 
instance. Pros and cons:
* + SLRP no longer needs to manage its container daemon, just do a launch and 
fail itself if {{Probe}} fails. In the future we may want an external 
orchestrator, e.g., Marathon, to manage the lifecycle of a local resource 
provider, to enable features like rolling upgrade. To achieve this, we can add 
a very simple relaunch policy into the default executor, and make it 
responsible to relaunch the SLRP pod containing an SLRP task and a CSI task 
upon failure.
* - A failure would lead to RP reregistration, therefore multiple 
{{UpdateSlaveMessage}}s.

In that future vision, 1 might still be needed if the relaunch policy is on a 
per-task basis instead of a per-pod basis.
So we can go for 1 for now, and do the remaining refactoring in the future.

> Retry logic for CSI calls when plugin crashes
> -
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if there is a racy CSI 
> call, the call may be issued before the future is reset, resulting in a 
> failure for that CSI call. This could be avoided by introducing a retry 
> logic. The following lists two possibilities:
> 1. If a GRPC channel can continue to work after its underlying domain socket 
> is unbinded, removed and binded with the same filename (but different fd) 
> again, then we can consider implementing the retry logic in `csi::Client`. 
> The downside is that the racy call would go to the old future and all 
> succeeding calls would go to the new future set up by the container daemon.
> 2. If the GRPC channel is bound to the domain socket fd, then we need to 
> implement the retry logic in SLRP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8561) Test profile checkpointing in SLRP

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8561:
--

Assignee: (was: Chun-Hung Hsiao)

> Test profile checkpointing in SLRP
> --
>
> Key: MESOS-8561
> URL: https://issues.apache.org/jira/browse/MESOS-8561
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> Once MESOS-8492 is addressed, we should add a test to verify that RP can 
> recover and complete pending `CREATE_VOLUME` and `CREATE_BLOCK` operations 
> even if the disk profile adaptor no longer knows about the profiles these 
> operations use.
> Note that this is currently not doable since there is no way to "pause" the 
> progress of a `CREATE_VOLUME` and fail the resource provider. Once MESOS-9003 
> is done this becomes viable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8592) Avoid failure for invalid profile in `UriDiskProfileAdaptor`

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8592:
--

Assignee: (was: Chun-Hung Hsiao)

> Avoid failure for invalid profile in `UriDiskProfileAdaptor`
> 
>
> Key: MESOS-8592
> URL: https://issues.apache.org/jira/browse/MESOS-8592
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We should be defensive and not fail the profile module when the user provides 
> an invalid profile in the profile matrix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8560) Test resource provider selection for URI disk profile adaptor.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8560:
--

Assignee: (was: Chun-Hung Hsiao)

> Test resource provider selection for URI disk profile adaptor.
> --
>
> Key: MESOS-8560
> URL: https://issues.apache.org/jira/browse/MESOS-8560
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> The {{DiskProfileAdaptor}} module provides an interface to filter resource 
> providers for a given profile, and is implemented in 
> {{UriDiskProfileAdaptor}}. We should add tests for the filtering feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9004) Add unit tests for dropping operations during SLRP reconciliation.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695165#comment-16695165
 ] 

Chun-Hung Hsiao commented on MESOS-9004:


Note: this is subject to the changes in MESOS-9254.

> Add unit tests for dropping operations during SLRP reconciliation.
> --
>
> Key: MESOS-9004
> URL: https://issues.apache.org/jira/browse/MESOS-9004
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We should add unit tests to verify that the SLRP will reject certain 
> operations and not others during reconciliation.
> Note that this is currently not doable since there is no way to "pause" the 
> progress of a reconciliation. Once MESOS-9003 is done this becomes viable by 
> delaying the response of the {{GetCapacity}} CSI call though a mock plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9016) Make all SLRP tests become non-root tests.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9016:
--

Assignee: (was: Chun-Hung Hsiao)

> Make all SLRP tests become non-root tests.
> --
>
> Key: MESOS-9016
> URL: https://issues.apache.org/jira/browse/MESOS-9016
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage, test
>
> All SLRP and Resource Provider Config API tests enables {{filesystem/linux}} 
> isolation and thus require root permission. However, we remove the isolation 
> and unmount all paths under the test sandboxes instead, to make them non-root 
> tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8745) Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8745:
--

Assignee: (was: Chun-Hung Hsiao)

> Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.
> --
>
> Key: MESOS-8745
> URL: https://issues.apache.org/jira/browse/MESOS-8745
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Chun-Hung Hsiao
>Priority: Minor
>  Labels: mesosphere, storage
>
> For API completeness, it would be nice if we can provider a call to list all 
> valid resource provider configs on an agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9016) Make all SLRP tests become non-root tests.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695173#comment-16695173
 ] 

Chun-Hung Hsiao commented on MESOS-9016:


Most of the work has been finished by [~bbannier]: 
https://reviews.apache.org/r/68472/.
The remaining work is to make tests related to publishing resources non-root.
We might be able to achieve this by doing {{fs::unmountAll}} in the 
{{TearDown}} function of the test fixture.

> Make all SLRP tests become non-root tests.
> --
>
> Key: MESOS-9016
> URL: https://issues.apache.org/jira/browse/MESOS-9016
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage, test
>
> All SLRP and Resource Provider Config API tests enables {{filesystem/linux}} 
> isolation and thus require root permission. However, we remove the isolation 
> and unmount all paths under the test sandboxes instead, to make them non-root 
> tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9004) Add unit tests for dropping operations during SLRP reconciliation.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9004:
--

Assignee: (was: Chun-Hung Hsiao)

> Add unit tests for dropping operations during SLRP reconciliation.
> --
>
> Key: MESOS-9004
> URL: https://issues.apache.org/jira/browse/MESOS-9004
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We should add unit tests to verify that the SLRP will reject certain 
> operations and not others during reconciliation.
> Note that this is currently not doable since there is no way to "pause" the 
> progress of a reconciliation. Once MESOS-9003 is done this becomes viable by 
> delaying the response of the {{GetCapacity}} CSI call though a mock plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8760) Make resource provider aware of workloads.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695192#comment-16695192
 ] 

Chun-Hung Hsiao commented on MESOS-8760:


If we go with MESOS-9387, an alternative approach for this is to add a 
framework-supplied workload ID in the {{Volume}} protobuf, then do proper 
unpublish/publish in SLRP.

> Make resource provider aware of workloads.
> --
>
> Key: MESOS-8760
> URL: https://issues.apache.org/jira/browse/MESOS-8760
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>
> Since the {{NodePublishVolume}} CSI call is supposed to be called for each 
> workload, SLRP it self should be aware of workloads. Potentially, we could 
> have the following event in the resource provider API:
> {noformat}
> // Received when the master or agent wants to update the resource usage of
> // this resource provider for each workload (e.g., framework or container).
> message ApplyResourceUsage {
>   required UUID uuid = 1;
>   // A map from a workload identifier (e.g., FrameworkID or ContainerID) to
>   // the resources used by the workload.
>   map resources = 2;
> }
> {noformat}
> For SLRP or any local resource provider, a workload is a container, and SLRP 
> can implement {{ApplyResourceUsage}} by checking if a resource is used by a 
> new workload, and call {{NodeUnpublishVolume}} and {{NodePublishVolume}} 
> accordingly.
> For ERP, a workload can be a framework, so the resource provider can 
> checkpoint which framework is using what resources and provide such 
> information to the allocator after a failover.
> Note that the {{ApplyResourceUsage}} call should report *all* resources being 
> used on an agent, so it can handle resources without identifiers (such as 
> cpus, mem) correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9369) Avoid blocking `Future::get()` calls

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9369:
--

Assignee: (was: Chun-Hung Hsiao)

> Avoid blocking `Future::get()` calls
> 
>
> Key: MESOS-9369
> URL: https://issues.apache.org/jira/browse/MESOS-9369
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: libprocess
>
> {{Future::get()}} does a wait if the future is still pending. If this is 
> accidentally called in an actor, the actor will be blocked. We should avoid 
> calling {{Future::get()}} in the code. The plan would be:
>  # Introduce {{Future::value()}}: crash if not READY
>  # Make {{Future::operator*}} and {{Future::operator->}} akin to 
> {{Future::value()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-9312) SLRP reports the same pre-existing disk multiple times.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-9312:
---
Comment: was deleted

(was: This shares the same root cause as MESOS-9395.)

> SLRP reports the same pre-existing disk multiple times.
> ---
>
> Key: MESOS-9312
> URL: https://issues.apache.org/jira/browse/MESOS-9312
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We have observed the following error messages in our testing cluster:
> {noformat}
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdk,)]:3072'.
>  This might cause further operations to fail.
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-eacdd75c-6f06-417f-9d4c-3aabc467f687, dss-asset-id: 
> lvm:vg(lvm-single-1539097161-144):eacdd75c-6f06-417f-9d4c-3aabc467f687})])[RAW(xvdi,)]:2048'.
>  This might cause further operations to fail.
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdj,)]:3072'.
>  This might cause further operations to fail.
> I1011 21:04:17.00 4261 provider.cpp:3515] Sending UPDATE_STATE call with 
> resources 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvda,)]:40960; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvda1,)]:40958.984; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvde,)]:256000; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdf,)]:102400; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdg,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdh,)]:512000; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdp,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdk,)]:3072;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-eacdd75c-6f06-417f-9d4c-3aabc467f687, dss-asset-id: 
> lvm:vg(lvm-single-1539097161-144):eacdd75c-6f06-417f-9d4c-3aabc467f687})])[RAW(xvdi,)]:2048;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdj,)]:3072;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdm,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdn,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdo,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdj,)]:3072; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdi,)]:2048; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdk,)]:3072' and 2 
> operations to agent 57f52a33-b0db-49e8-b41c-913bbd75b543-S21{noformat}
> However, judging by the last line, the {{xvdi}}, {{xvdj}} and {{xvdk}} disks 
> were not missing, and the SLRP reported them twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9312) SLRP reports the same pre-existing disk multiple times.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695196#comment-16695196
 ] 

Chun-Hung Hsiao commented on MESOS-9312:


This shares the same root cause as MESOS-9395.

> SLRP reports the same pre-existing disk multiple times.
> ---
>
> Key: MESOS-9312
> URL: https://issues.apache.org/jira/browse/MESOS-9312
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> We have observed the following error messages in our testing cluster:
> {noformat}
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdk,)]:3072'.
>  This might cause further operations to fail.
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-eacdd75c-6f06-417f-9d4c-3aabc467f687, dss-asset-id: 
> lvm:vg(lvm-single-1539097161-144):eacdd75c-6f06-417f-9d4c-3aabc467f687})])[RAW(xvdi,)]:2048'.
>  This might cause further operations to fail.
> W1011 21:04:17.00 4261 provider.cpp:1393] Missing converted resource 
> 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdj,)]:3072'.
>  This might cause further operations to fail.
> I1011 21:04:17.00 4261 provider.cpp:3515] Sending UPDATE_STATE call with 
> resources 'disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvda,)]:40960; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvda1,)]:40958.984; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvde,)]:256000; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdf,)]:102400; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdg,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdh,)]:512000; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdp,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdk,)]:3072;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-eacdd75c-6f06-417f-9d4c-3aabc467f687, dss-asset-id: 
> lvm:vg(lvm-single-1539097161-144):eacdd75c-6f06-417f-9d4c-3aabc467f687})])[RAW(xvdi,)]:2048;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal),(DYNAMIC,dcos-storage/claimed,storage-principal,{dss-claim-id:
>  createvg-e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11, dss-asset-id: 
> lvm:vg(lvm-double-1539097173-374):e8bb2b69-c3c3-4f30-9cf4-6b61fe38dc11})])[RAW(xvdj,)]:3072;
>  disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdm,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdn,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdo,)]:51200; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdj,)]:3072; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdi,)]:2048; 
> disk(reservations: 
> [(DYNAMIC,dcos-storage,storage-principal)])[RAW(xvdk,)]:3072' and 2 
> operations to agent 57f52a33-b0db-49e8-b41c-913bbd75b543-S21{noformat}
> However, judging by the last line, the {{xvdi}}, {{xvdj}} and {{xvdk}} disks 
> were not missing, and the SLRP reported them twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8760) Make resource provider aware of workloads.

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8760:
--

Assignee: (was: Chun-Hung Hsiao)

> Make resource provider aware of workloads.
> --
>
> Key: MESOS-8760
> URL: https://issues.apache.org/jira/browse/MESOS-8760
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Priority: Major
>
> Since the {{NodePublishVolume}} CSI call is supposed to be called for each 
> workload, SLRP it self should be aware of workloads. Potentially, we could 
> have the following event in the resource provider API:
> {noformat}
> // Received when the master or agent wants to update the resource usage of
> // this resource provider for each workload (e.g., framework or container).
> message ApplyResourceUsage {
>   required UUID uuid = 1;
>   // A map from a workload identifier (e.g., FrameworkID or ContainerID) to
>   // the resources used by the workload.
>   map resources = 2;
> }
> {noformat}
> For SLRP or any local resource provider, a workload is a container, and SLRP 
> can implement {{ApplyResourceUsage}} by checking if a resource is used by a 
> new workload, and call {{NodeUnpublishVolume}} and {{NodePublishVolume}} 
> accordingly.
> For ERP, a workload can be a framework, so the resource provider can 
> checkpoint which framework is using what resources and provide such 
> information to the allocator after a failover.
> Note that the {{ApplyResourceUsage}} call should report *all* resources being 
> used on an agent, so it can handle resources without identifiers (such as 
> cpus, mem) correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)