date:20161215


 [ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6780:
---
Attachment: attach_container_input_no_ssl.log

This test fails consistently for me on OSX Sierra without SSL enabled. Verbose 
test log attached.

> ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably
> ---
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
> Attachments: attach_container_input_no_ssl.log
>
>
> The test {{ContentType/AgentAPIStreamTest.AttachContainerInput}} (both {{/0}} 
> and {{/1}}) fail consistently for me in an SSL-enabled, optimized build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: 
>

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`

[
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Clemmer updated MESOS-6803:

Description:
When an agent registers, there is currently a somewhat subtle difference in
behavior between the cases when it does and does not authenticate:

* In the case that a credential IS NOT passed to the agent, we will choose a
random time between 0 and the agent `registration_backoff_factor` to initiate
registration. The reason for this is to avoid every agent hitting the master at
once during master failover. (We also employ backoff to help this.) See: [1]
* In the case that a credential IS passed to the agent, we always attempt to
authenticate and register the Agent immediately. So currently in authenticated
clusters, after failover, all agents will immediately try to register with a
master upon failover; though, this is helped somewhat by the fact that the
authenticated codepath still uses backoff. See: [2]

It is important to resolve this disparity, not only to make the system more
resilient, but also because it directly blocks us from passing many tests on
platforms where authentication is not supported at all (Windows in particular).
Note that there are several solutions that won't work in this case:

* The default credential in the agent tests is CRAM-MD5. Windows doesn't
support this, so passing CRAM-MD5 as default to Windows will cause these tests
to fail on Windows. So the only sensible default, at least if we want the
Windows tests to pass, here is `none`.
* We can't remove the `delay` from the non-authenticated codepath. This
provides real value to large-scale users.
* Setting `registration_backoff_factor` to be 0 or -1 will change the semantics
of backoff for tests, specifically, it will cause all attempts to `delay`
registration to execute immediately. It is highly undesirable to exercise a
different registration backoff in tests.

So, the best long-term solution is probably to just fix the tests to work in
both the `delay`'d and non-`delay`'d cases.

For some time, we have meant to make both the authenticated and unauthenticated
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically,
people seem to have had a few problems with this:

1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end
up trying to authenticate twice, if a new master is detected before the auth is
processed. It seems to me that this should not be an issue (or at least, not
any more).
2. Many of our tests depend on authenticated registration happening even if
`Clock::pause()` has been called; that is, because our first attempt at
authentication and Agent registration are dispatched for immediate execution,
even when we pause the clock, these events should still happen. If we use a
`delay`, then they are scheduled to happen in the future, and any tests
employing `Clock::pause` during this time will fail.

SUCCESS CRITERIA: The resolution of this bug, at minimum, involves:

* Fixing the semantics of the above tests to pass when `HAS_AUTHENTICATION` is
set to false.
* Adding `delay` to the authentication codepath as well.
* Making sure that all tests pass with `delay` and with `HAS_AUTHENTICATION`
set to both true AND false.

In terms of resolution, it is useful to know the specific tests that will fail
if we add a `delay` and `HAS_AUTHENTICATION` is set to `true`:

```
[ FAILED ] ExamplesTest.V1JavaFramework
[ FAILED ] ExamplesTest.PythonFramework
[ FAILED ] FaultToleranceTest.FrameworkReregister
[ FAILED ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam
=
mesos::internal::master::allocator::MesosAllocator >
[ FAILED ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam
= mesos::internal::tests::Module
[ FAILED ] MasterTest.EndpointsForHalfRemovedSlave
[ FAILED ] MasterTest.UnreachableTaskAfterFailover
[ FAILED ] MasterTest.CancelRecoveredSlaveRemoval
[ FAILED ] MasterTest.RecoveredFramework
[ FAILED ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[ FAILED ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[ FAILED ] OversubscriptionTest.Reregistration
[ FAILED ] PartitionTest.ReregisterSlavePartitionAware
[ FAILED ] PartitionTest.ReregisterSlaveNotPartitionAware
[ FAILED ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[ FAILED ] PartitionTest.PartitionedSlaveOrphanedTask
[ FAILED ] PartitionTest.SpuriousSlaveReregistration
[ FAILED ] PartitionTest.PartitionedSlaveStatusUpdates
[ FAILED ] PartitionTest.RegistryGcByCount
[ FAILED ]

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`

[
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Clemmer updated MESOS-6803:

Description:
When an agent registers, there is currently a somewhat subtle difference in
behavior between the cases when it does and does not authenticate:

So, the best long-term solution is probably to just fix the tests to work in
both the `delay`'d and non-`delay`'d cases.

SUCCESS CRITERIA: The resolution of this bug, at minimum, involves:

In terms of resolution, it is useful to know the specific tests that will fail
if `HAS_AUTHENTICATION` is set to false:

[jira] [Created] (MESOS-6805) Check unreachable task cache for task ID collisions on launch

Neil Conway created MESOS-6805:
--

 Summary: Check unreachable task cache for task ID collisions on 
launch
 Key: MESOS-6805
 URL: https://issues.apache.org/jira/browse/MESOS-6805
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Neil Conway
Assignee: Neil Conway


As discussed in MESOS-6785, it is possible to crash the master by launching a 
task that reuses the ID of an unreachable/partitioned task. A complete solution 
to this problem will be quite involved, but an incremental improvement is easy: 
when we see a task launch operation, reject the launch attempt if the task ID 
collides with an ID in the per-framework {{unreachableTasks}} cache. This 
doesn't catch all situations in which IDs are reused, but it is better than 
nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6785) CHECK failure on duplicate task IDs


 [ 
https://issues.apache.org/jira/browse/MESOS-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6785:
---
Sprint:   (was: Mesosphere Sprint 48)

> CHECK failure on duplicate task IDs
> ---
>
> Key: MESOS-6785
> URL: https://issues.apache.org/jira/browse/MESOS-6785
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: duplicate_id_check_fail_test-1.patch
>
>
> The master crashes with a CHECK failure in the following scenario:
> # Framework launches task X on agent A1. The framework may or may not be 
> partition-aware; let's assume it is not partition-aware.
> # A1 becomes partitioned from the master.
> # Framework launches task X on agent A2.
> # Master fails over.
> # Agents A1 and A2 both re-register with the master. Because the master has 
> failed over, the task on A1 is _not_ terminated ("non-strict registry 
> semantics").
> This results in two running tasks with the same ID, which causes a master 
> {{CHECK}} failure among other badness:
> {noformat}
> master.hpp:2299] Check failed: !tasks.contains(task->task_id()) Duplicate 
> task b88153a2-571a-41e7-9e9b-c297fef4f3cd of framework 
> eaef1879-8cc9-412f-928d-86c9925a7abb-
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6803:

Description: 
When an agent registers, there is currently a somewhat subtle difference in 
behavior between the cases when it does and does not authenticate:

* In the case that a credential IS NOT passed to the agent, we will choose a 
random time between 0 and the agent `registration_backoff_factor` to initiate 
registration. The reason for this is to avoid every agent hitting the master at 
once during master failover. (We also employ backoff to help this.) See: [1]
* In the case that a credential IS passed to the agent, we always attempt to 
authenticate and register the Agent immediately. So currently in authenticated 
clusters, after failover, all agents will immediately try to register with a 
master upon failover; though, this is helped somewhat by the fact that the 
authenticated codepath still uses backoff. See: [2]

It is important to resolve this disparity, not only to make the system more 
resilient, but also because it directly blocks us from passing many tests on 
platforms where authentication is not supported at all (Windows in particular). 
Note that there are several solutions that won't work in this case:

* The default credential in the agent tests is CRAM-MD5. Windows doesn't 
support this, so passing CRAM-MD5 as default to Windows will cause these tests 
to fail on Windows. So the only sensible default, at least if we want the 
Windows tests to pass, here is `none`.
* We can't remove the `delay` from the non-authenticated codepath. This 
provides real value to large-scale users.
* Setting `registration_backoff_factor` to be 0 or -1 will change the semantics 
of backoff for tests, specifically, it will cause all attempts to `delay` 
registration to execute immediately. It is highly undesirable to exercise a 
different registration backoff in tests.

So, the best long-term solution is probably to just fix the tests to work in 
both the `delay`'d and non-`delay`'d cases.

For some time, we have meant to make both the authenticated and unauthenticated 
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically, 
people seem to have had a few problems with this:

1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
up trying to authenticate twice, if a new master is detected before the auth is 
processed. It seems to me that this should not be an issue (or at least, not 
any more).
2. Many of our tests depend on authenticated registration happening even if 
`Clock::pause()` has been called; that is, because our first attempt at 
authentication and Agent registration are dispatched for immediate execution, 
even when we pause the clock, these events should still happen. If we use a 
`delay`, then they are scheduled to happen in the future, and any tests 
employing `Clock::pause` during this time will fail.

The resolution of this bug, at minimum, involves fixing the semantics of the 
above tests to pass when `HAS_AUTHENTICATION` is set to false. Following this, 
it is realistic to expect that we add `delay` to the authentication codepath as 
well.

In terms of resolution, it is useful to know the specific tests that will fail 
if `HAS_AUTHENTICATION` is set to false:

```
[  FAILED  ] ExamplesTest.V1JavaFramework
[  FAILED  ] ExamplesTest.PythonFramework
[  FAILED  ] FaultToleranceTest.FrameworkReregister
[  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam 
= 
mesos::internal::master::allocator::MesosAllocator >
[  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam 
= mesos::internal::tests::Module
[  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
[  FAILED  ] MasterTest.UnreachableTaskAfterFailover
[  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
[  FAILED  ] MasterTest.RecoveredFramework
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[  FAILED  ] OversubscriptionTest.Reregistration
[  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
[  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
[  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
[  FAILED  ] PartitionTest.SpuriousSlaveReregistration
[  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
[  FAILED  ] PartitionTest.RegistryGcByCount
[  FAILED  ] PartitionTest.RegistryGcByAge
[  FAILED  ] PartitionTest.RegistryGcRace
[  FAILED  ] OneWayPartitionTest.MasterToSlave
[

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6803:

Description: 
When an agent registers, there is currently a somewhat subtle difference in 
behavior between the cases when it does and does not authenticate:

* In the case that a credential IS NOT passed to the agent, we will choose a 
random time between 0 and the agent `registration_backoff_factor` to initiate 
registration. The reason for this is to avoid every agent hitting the master at 
once during master failover. (We also employ backoff to help this.) See: [1]
* In the case that a credential IS passed to the agent, we always attempt to 
authenticate and register the Agent immediately. So currently in authenticated 
clusters, after failover, all agents will immediately try to register with a 
master upon failover; though, this is helped somewhat by the fact that the 
authenticated codepath still uses backoff. See: [2]

It is important to resolve this disparity, not only to make the system more 
resilient, but also because it directly blocks us from passing many tests on 
platforms where authentication is not supported at all (Windows in particular). 
Note that there are several solutions that won't work in this case:

* The default credential in the agent tests is CRAM-MD5. Windows doesn't 
support this, so passing CRAM-MD5 as default to Windows will cause these tests 
to fail on Windows. So the only sensible default, at least if we want the 
Windows tests to pass, here is `none`.
* We can't remove the `delay` from the non-authenticated codepath. This 
provides real value to large-scale users.

So, the best long-term solution is probably to just fix the tests to work in 
both the `delay`'d and non-`delay`'d cases.

For some time, we have meant to make both the authenticated and unauthenticated 
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically, 
people seem to have had a few problems with this:

1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
up trying to authenticate twice, if a new master is detected before the auth is 
processed. It seems to me that this should not be an issue (or at least, not 
any more).
2. Many of our tests depend on authenticated registration happening even if 
`Clock::pause()` has been called; that is, because our first attempt at 
authentication and Agent registration are dispatched for immediate execution, 
even when we pause the clock, these events should still happen. If we use a 
`delay`, then they are scheduled to happen in the future, and any tests 
employing `Clock::pause` during this time will fail.

The resolution of this bug, at minimum, involves fixing the semantics of the 
above tests to pass when `HAS_AUTHENTICATION` is set to false. Following this, 
it is realistic to expect that we add `delay` to the authentication codepath as 
well.

In terms of resolution, it is useful to know the specific tests that will fail 
if `HAS_AUTHENTICATION` is set to false:

```
[  FAILED  ] ExamplesTest.V1JavaFramework
[  FAILED  ] ExamplesTest.PythonFramework
[  FAILED  ] FaultToleranceTest.FrameworkReregister
[  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam 
= 
mesos::internal::master::allocator::MesosAllocator >
[  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam 
= mesos::internal::tests::Module
[  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
[  FAILED  ] MasterTest.UnreachableTaskAfterFailover
[  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
[  FAILED  ] MasterTest.RecoveredFramework
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[  FAILED  ] OversubscriptionTest.Reregistration
[  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
[  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
[  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
[  FAILED  ] PartitionTest.SpuriousSlaveReregistration
[  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
[  FAILED  ] PartitionTest.RegistryGcByCount
[  FAILED  ] PartitionTest.RegistryGcByAge
[  FAILED  ] PartitionTest.RegistryGcRace
[  FAILED  ] OneWayPartitionTest.MasterToSlave
[  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
[  FAILED  ] ReservationTest.ACLMultipleOperations
[  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
[  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
[  FAILED  ]

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6803:

Description: 
When an agent registers, there is currently a somewhat subtle difference in 
behavior between the cases when it does and does not authenticate:

* In the case that a credential IS NOT passed to the agent, we will choose a 
random time between 0 and the agent `registration_backoff_factor` to initiate 
registration. The reason for this is to avoid every agent hitting the master at 
once during master failover. (We also employ backoff to help this.) See: [1]
* In the case that a credential IS passed to the agent, we always attempt to 
authenticate and register the Agent immediately. So currently in authenticated 
clusters, after failover, all agents will immediately try to register with a 
master upon failover; though, this is helped somewhat by the fact that the 
authenticated codepath still uses backoff. See: [2]

It is important to resolve this disparity, not only to make the system more 
resilient, but also because it directly blocks us from passing many tests on 
platforms where authentication is not supported at all (Windows in particular). 
Note that there are several solutions that won't work in this case:

* The default credential in the agent tests is CRAM-MD5. Windows doesn't 
support this, so passing CRAM-MD5 as default to Windows will cause these tests 
to fail on Windows. So the only sensible default, at least if we want the 
Windows tests to pass, here is `none`.
* We can't remove the `delay` from the non-authenticated codepath. This 
provides real value to large-scale users.

For some time, we have meant to make both the authenticated and unauthenticated 
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically, 
people seem to have had a few problems with this:

1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
up trying to authenticate twice, if a new master is detected before the auth is 
processed. It seems to me that this should not be an issue (or at least, not 
any more).
2. Many of our tests depend on authenticated registration happening even if 
`Clock::pause()` has been called; that is, because our first attempt at 
authentication and Agent registration are dispatched for immediate execution, 
even when we pause the clock, these events should still happen. If we use a 
`delay`, then they are scheduled to happen in the future, and any tests 
employing `Clock::pause` during this time will fail.

The resolution of this bug, at minimum, involves fixing the semantics of the 
above tests to pass when `HAS_AUTHENTICATION` is set to false. Following this, 
it is realistic to expect that we add `delay` to the authentication codepath as 
well.

In terms of resolution, it is useful to know the specific tests that will fail 
if `HAS_AUTHENTICATION` is set to false:

```
[  FAILED  ] ExamplesTest.V1JavaFramework
[  FAILED  ] ExamplesTest.PythonFramework
[  FAILED  ] FaultToleranceTest.FrameworkReregister
[  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam 
= 
mesos::internal::master::allocator::MesosAllocator >
[  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam 
= mesos::internal::tests::Module
[  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
[  FAILED  ] MasterTest.UnreachableTaskAfterFailover
[  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
[  FAILED  ] MasterTest.RecoveredFramework
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[  FAILED  ] OversubscriptionTest.Reregistration
[  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
[  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
[  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
[  FAILED  ] PartitionTest.SpuriousSlaveReregistration
[  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
[  FAILED  ] PartitionTest.RegistryGcByCount
[  FAILED  ] PartitionTest.RegistryGcByAge
[  FAILED  ] PartitionTest.RegistryGcRace
[  FAILED  ] OneWayPartitionTest.MasterToSlave
[  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
[  FAILED  ] ReservationTest.ACLMultipleOperations
[  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
[  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
[  FAILED  ] SlaveTest.DuplicateTerminalUpdateBeforeAck
[  FAILED  ] SlaveTest.StateEndpoint
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ]

[jira] [Updated] (MESOS-6785) CHECK failure on duplicate task IDs


 [ 
https://issues.apache.org/jira/browse/MESOS-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6785:
---
Attachment: duplicate_id_check_fail_test-1.patch

Attached a patch that adds a unit test for this situation.

> CHECK failure on duplicate task IDs
> ---
>
> Key: MESOS-6785
> URL: https://issues.apache.org/jira/browse/MESOS-6785
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: duplicate_id_check_fail_test-1.patch
>
>
> The master crashes with a CHECK failure in the following scenario:
> # Framework launches task X on agent A1. The framework may or may not be 
> partition-aware; let's assume it is not partition-aware.
> # A1 becomes partitioned from the master.
> # Framework launches task X on agent A2.
> # Master fails over.
> # Agents A1 and A2 both re-register with the master. Because the master has 
> failed over, the task on A1 is _not_ terminated ("non-strict registry 
> semantics").
> This results in two running tasks with the same ID, which causes a master 
> {{CHECK}} failure among other badness:
> {noformat}
> master.hpp:2299] Check failed: !tasks.contains(task->task_id()) Duplicate 
> task b88153a2-571a-41e7-9e9b-c297fef4f3cd of framework 
> eaef1879-8cc9-412f-928d-86c9925a7abb-
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2016-12-15 Thread Kevin Klues (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-6784:
---
Assignee: Anand Mazumdar  (was: Kevin Klues)

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8 testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c0f9b1 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c0a9f2 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1bf18ee testing::UnitTest::Run()
> @  0x11bc9e3 RUN_ALL_TESTS()
> @  0x11bc599 main
> @ 0x7faece663b15 __libc_start_main
> @   0xa9c219 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6804) Running 'tty' inside a debug container that has a tty reports "Not a tty"

2016-12-15 Thread Kevin Klues (JIRA)

Kevin Klues created MESOS-6804:
--

 Summary: Running 'tty' inside a debug container that has a tty 
reports "Not a tty"
 Key: MESOS-6804
 URL: https://issues.apache.org/jira/browse/MESOS-6804
 Project: Mesos
  Issue Type: Bug
Reporter: Kevin Klues


We need to inject `/dev/console` into the container and map it to the slave end 
of the TTY we are attached to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6803:

Priority: Blocker  (was: Critical)

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, scheduler driver
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Blocker
>  Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752265#comment-15752265
 ] 

Benjamin Bannier commented on MESOS-6801:
-

Ah sorry, I missed the explicit {{self()}} to {{process::loop}} and also looked 
at {{process::internal::loop}}.

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6803:
--
Priority: Critical  (was: Major)

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, scheduler driver
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Critical
>  Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [  FAILED  ]

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6803:
--
Component/s: scheduler driver

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, scheduler driver
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Critical
>  Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [  FAILED  ]

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6803:
--
Labels: microsoft security windows-mvp  (was: microsoft windows-mvp)

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, scheduler driver
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Critical
>  Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ]

[jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`


 [ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6803:
--
Target Version/s: 1.2.0

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, scheduler driver
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Critical
>  Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [  FAILED  ]

[jira] [Commented] (MESOS-6803) Agent authentication does not have an initial `delay`


[ 
https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752242#comment-15752242
 ] 

Adam B commented on MESOS-6803:
---

Thanks for re-motivating this! cc: [~tillt], [~karya] [~bbannier]
Let's also consider this for framework authentication, as per 
https://github.com/apache/mesos/blob/1.1.0/src/sched/sched.cpp#L336

> Agent authentication does not have an initial `delay`
> -
>
> Key: MESOS-6803
> URL: https://issues.apache.org/jira/browse/MESOS-6803
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: microsoft, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in 
> behavior between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time 
> between 0 and the agent `registration_backoff_factor` to initiate 
> registration. The reason for this is to avoid every agent hitting the master 
> at once during master failover. (We also employ backoff to help this.) See: 
> [1]
> * In the case that it DOES authenticate, we always attempt to authenticate 
> and register the Agent immediately. So currently in authenticated clusters, 
> after failover, all agents will immediately try to register with a master 
> upon failover; though, this is helped somewhat by the fact that the 
> authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more 
> resilient, but also because it directly blocks us from passing many tests on 
> platforms where authentication is not supported at all (Windows in 
> particular).
> For some time, we have meant to make both the authenticated and 
> unauthenticated codepaths use a random `delay` to begin. See Adam's TODO in 
> [3]. Historically, people seem to have had a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
> up trying to authenticate twice, if a new master is detected before the auth 
> is processed. It seems to me that this should not be an issue (or at least, 
> not any more).
> 2. Many of our tests depend on authenticated registration happening even if 
> `Clock::pause()` has been called; that is, because our first attempt at 
> authentication and Agent registration are dispatched for immediate execution, 
> even when we pause the clock, these events should still happen. If we use a 
> `delay`, then they are scheduled to happen in the future, and any tests 
> employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the 
> above tests to pass when `HAS_AUTHENTICATION` is set to false. Following 
> this, it is realistic to expect that we add `delay` to the authentication 
> codepath as well.
> In terms of resolution, it is useful to know the specific tests that will 
> fail if `HAS_AUTHENTICATION` is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where 
> TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter, 
> mesos::internal::master::allocator::DRFSorter> >
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where 
> TypeParam = mesos::internal::tests::Module (mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752229#comment-15752229
 ] 

Anand Mazumdar commented on MESOS-6801:
---

{{process::loop}} would use the {{pid}} passed to it as the async execution 
context i.e., it would implicitly do a {{defer}} to the actor.
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/loop.hpp#L74

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6803) Agent authentication does not have an initial `delay`

Alex Clemmer created MESOS-6803:
---

 Summary: Agent authentication does not have an initial `delay`
 Key: MESOS-6803
 URL: https://issues.apache.org/jira/browse/MESOS-6803
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Alex Clemmer
Assignee: Alex Clemmer


When an agent registers, there is currently a somewhat subtle difference in 
behavior between the cases when it does and does not authenticate:

* In the case that it DOES NOT authenticate, we will choose a random time 
between 0 and the agent `registration_backoff_factor` to initiate registration. 
The reason for this is to avoid every agent hitting the master at once during 
master failover. (We also employ backoff to help this.) See: [1]
* In the case that it DOES authenticate, we always attempt to authenticate and 
register the Agent immediately. So currently in authenticated clusters, after 
failover, all agents will immediately try to register with a master upon 
failover; though, this is helped somewhat by the fact that the authenticated 
codepath still uses backoff. See: [2]

It is important to resolve this disparity, not only to make the system more 
resilient, but also because it directly blocks us from passing many tests on 
platforms where authentication is not supported at all (Windows in particular).

For some time, we have meant to make both the authenticated and unauthenticated 
codepaths use a random `delay` to begin. See Adam's TODO in [3]. Historically, 
people seem to have had a few problems with this:

1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end 
up trying to authenticate twice, if a new master is detected before the auth is 
processed. It seems to me that this should not be an issue (or at least, not 
any more).
2. Many of our tests depend on authenticated registration happening even if 
`Clock::pause()` has been called; that is, because our first attempt at 
authentication and Agent registration are dispatched for immediate execution, 
even when we pause the clock, these events should still happen. If we use a 
`delay`, then they are scheduled to happen in the future, and any tests 
employing `Clock::pause` during this time will fail.

The resolution of this bug, at minimum, involves fixing the semantics of the 
above tests to pass when `HAS_AUTHENTICATION` is set to false. Following this, 
it is realistic to expect that we add `delay` to the authentication codepath as 
well.

In terms of resolution, it is useful to know the specific tests that will fail 
if `HAS_AUTHENTICATION` is set to false:

```
[  FAILED  ] ExamplesTest.V1JavaFramework
[  FAILED  ] ExamplesTest.PythonFramework
[  FAILED  ] FaultToleranceTest.FrameworkReregister
[  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam 
= 
mesos::internal::master::allocator::MesosAllocator >
[  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam 
= mesos::internal::tests::Module
[  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
[  FAILED  ] MasterTest.UnreachableTaskAfterFailover
[  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
[  FAILED  ] MasterTest.RecoveredFramework
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
[  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
[  FAILED  ] OversubscriptionTest.Reregistration
[  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
[  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
[  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
[  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
[  FAILED  ] PartitionTest.SpuriousSlaveReregistration
[  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
[  FAILED  ] PartitionTest.RegistryGcByCount
[  FAILED  ] PartitionTest.RegistryGcByAge
[  FAILED  ] PartitionTest.RegistryGcRace
[  FAILED  ] OneWayPartitionTest.MasterToSlave
[  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
[  FAILED  ] ReservationTest.ACLMultipleOperations
[  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
[  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
[  FAILED  ] SlaveTest.DuplicateTerminalUpdateBeforeAck
[  FAILED  ] SlaveTest.StateEndpoint
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] SlaveTest.PingTimeoutSomePings
[  FAILED  ] SlaveTest.ReregisterWithStatusUpdateTaskState
[  FAILED  ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
[  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where 
GetParam() = application/x-protobuf
[  FAILED  ]

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752209#comment-15752209
 ] 

Benjamin Bannier commented on MESOS-6801:
-

[~anandmazumdar]: Other member functions in {{IOSwitchboard}} perform direct 
access on {{IOSwitchboard::infos}}; at the moment {{IOSwitchboard::recover}} 
and {{IOSwitchboard::_prepare}} perform write updates. Would {{process::loop}} 
abort iteration in its {{process}} should e.g., {{IOSwitchboard::recover}} be 
called elsewhere?

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6799:


Assignee: Greg Mann

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>Assignee: Greg Mann
>Priority: Critical
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky


[ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752045#comment-15752045
 ] 

Greg Mann commented on MESOS-6799:
--

[~anandmazumdar] these tests were recently parametrized to run with SSL 
enabled: https://reviews.apache.org/r/50737/

We discovered a bug in which the SSL socket can either fail to receive an EOF, 
or can lose data when an EOF is received. This failure is triggered by the 
latter case, when the test's client socket drops some data and delivers an 
unexpected EOF before the entire HTTP response has been received.

This issue should be resolved by [this 
patch|https://reviews.apache.org/r/53802/], which you can find in MESOS-6802.

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6802) SSL socket can lose bytes in the case of EOF


 [ 
https://issues.apache.org/jira/browse/MESOS-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6802:


Assignee: Greg Mann

> SSL socket can lose bytes in the case of EOF
> 
>
> Key: MESOS-6802
> URL: https://issues.apache.org/jira/browse/MESOS-6802
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: libevent, libprocess, ssl
>
> During recent work on SSL-enabled tests in libprocess (MESOS-5966), we 
> discovered a bug in {{LibeventSSLSocketImpl}}, wherein the socket can either 
> fail to receive an EOF, or lose data when an EOF is received.
> The {{LibeventSSLSocketImpl::event_callback(short events)}} method 
> immediately sets any pending {{RecvRequest}}'s promise to zero upon receipt 
> of an EOF. However, at the time the promise is set, there may actually be 
> data waiting to be read by libevent. Upon receipt of an EOF, we should 
> attempt to read the socket's bufferevent first to ensure that we aren't 
> losing any data previously received by the socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6802) SSL socket can lose bytes in the case of EOF


[ 
https://issues.apache.org/jira/browse/MESOS-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752026#comment-15752026
 ] 

Greg Mann commented on MESOS-6802:
--

Reviews here:
https://reviews.apache.org/r/53802/
https://reviews.apache.org/r/53803/

> SSL socket can lose bytes in the case of EOF
> 
>
> Key: MESOS-6802
> URL: https://issues.apache.org/jira/browse/MESOS-6802
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libevent, libprocess, ssl
>
> During recent work on SSL-enabled tests in libprocess (MESOS-5966), we 
> discovered a bug in {{LibeventSSLSocketImpl}}, wherein the socket can either 
> fail to receive an EOF, or lose data when an EOF is received.
> The {{LibeventSSLSocketImpl::event_callback(short events)}} method 
> immediately sets any pending {{RecvRequest}}'s promise to zero upon receipt 
> of an EOF. However, at the time the promise is set, there may actually be 
> data waiting to be read by libevent. Upon receipt of an EOF, we should 
> attempt to read the socket's bufferevent first to ensure that we aren't 
> losing any data previously received by the socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6802) SSL socket can lose bytes in the case of EOF

Greg Mann created MESOS-6802:


 Summary: SSL socket can lose bytes in the case of EOF
 Key: MESOS-6802
 URL: https://issues.apache.org/jira/browse/MESOS-6802
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Greg Mann


During recent work on SSL-enabled tests in libprocess (MESOS-5966), we 
discovered a bug in {{LibeventSSLSocketImpl}}, wherein the socket can either 
fail to receive an EOF, or lose data when an EOF is received.

The {{LibeventSSLSocketImpl::event_callback(short events)}} method immediately 
sets any pending {{RecvRequest}}'s promise to zero upon receipt of an EOF. 
However, at the time the promise is set, there may actually be data waiting to 
be read by libevent. Upon receipt of an EOF, we should attempt to read the 
socket's bufferevent first to ensure that we aren't losing any data previously 
received by the socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6802) SSL socket can lose bytes in the case of EOF


 [ 
https://issues.apache.org/jira/browse/MESOS-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6802:
-
Shepherd: Joseph Wu

> SSL socket can lose bytes in the case of EOF
> 
>
> Key: MESOS-6802
> URL: https://issues.apache.org/jira/browse/MESOS-6802
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libevent, libprocess, ssl
>
> During recent work on SSL-enabled tests in libprocess (MESOS-5966), we 
> discovered a bug in {{LibeventSSLSocketImpl}}, wherein the socket can either 
> fail to receive an EOF, or lose data when an EOF is received.
> The {{LibeventSSLSocketImpl::event_callback(short events)}} method 
> immediately sets any pending {{RecvRequest}}'s promise to zero upon receipt 
> of an EOF. However, at the time the promise is set, there may actually be 
> data waiting to be read by libevent. Upon receipt of an EOF, we should 
> attempt to read the socket's bufferevent first to ensure that we aren't 
> losing any data previously received by the socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5795) Add support for Nvidia GPUs in the docker containerizer


[ 
https://issues.apache.org/jira/browse/MESOS-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751984#comment-15751984
 ] 

Anand Mazumdar commented on MESOS-5795:
---

^^ [~klueska] , 

> Add support for Nvidia GPUs in the docker containerizer
> ---
>
> Key: MESOS-5795
> URL: https://issues.apache.org/jira/browse/MESOS-5795
> Project: Mesos
>  Issue Type: Epic
>  Components: docker, isolation
>Reporter: Kevin Klues
>  Labels: gpu, mesosphere
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container. This tracks the support in the docker 
> containerizer. The mesos containerizer support has already been completed in 
> MESOS-5401.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky


[ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751954#comment-15751954
 ] 

Anand Mazumdar commented on MESOS-6799:
---

[~greggomann] Can you take a look as to why this started failing when SSL 
enabled?

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6799:
--
Priority: Critical  (was: Major)

> Scheme/HTTPTest.Endpoints/0 is flaky
> 
>
> Key: MESOS-6799
> URL: https://issues.apache.org/jira/browse/MESOS-6799
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: Debian 8, gcc-4.9.2, SSL build w/optimizations and debug 
> symbols
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: flaky, flaky-test, ssl
>
> Saw {{Scheme/HTTPTest.Endpoints/0}} fail in internal CI with 
> {{812e5e3d4e4d9e044a1cfe6cc7eaab10efb499b6}},
> {noformat}
> [03:26:43] :   [Step 10/11] [ RUN  ] Scheme/HTTPTest.Endpoints/0
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.221824 23530 
> libevent_ssl_socket.cpp:1141] Socket error: 
> error::lib(0):func(0):reason(0)
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448218 23521 openssl.cpp:419] CA 
> file path is unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448226 23521 openssl.cpp:424] CA 
> directory path unspecified! NOTE: Set CA directory path with 
> LIBPROCESS_SSL_CA_DIR=
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448230 23521 openssl.cpp:429] Will 
> not verify peer certificate!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable 
> peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.448231 23521 openssl.cpp:435] Will 
> only verify peer certificate if presented!
> [03:26:43]W:   [Step 10/11] NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to 
> require peer certificate verification
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.449292 23521 process.cpp:1237] 
> libprocess is initialized on 172.16.10.123:58973 with 8 worker threads
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.452320 23871 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/body'
> [03:26:43]W:   [Step 10/11] I1215 03:26:43.455099 23870 process.cpp:3679] 
> Handling HTTP event for process '(75)' with path: '/(75)/pipe'
> [03:26:43] :   [Step 10/11] 
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:275: Failure
> [03:26:43] :   [Step 10/11] (future).failure(): failed to decode body
> [03:26:43] :   [Step 10/11] [  FAILED  ] Scheme/HTTPTest.Endpoints/0, where 
> GetParam() = "https" (234 ms)
> {noformat}
> I was not able to trigger this failure again in a couple thousand iterations, 
> so there might be some relation to load or other processes running in the 
> system.
> We should figure out when this problem first occurred as it might be worthy 
> to backport a fix (if this isn't just a test error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751943#comment-15751943
 ] 

Anand Mazumdar commented on MESOS-6801:
---

In this case, the {{loop}} abstraction ensures that we are delegating to the 
correct actor.

Can we modify the existing script we have that checks such errors and make it 
aware of how {{process::loop}} works. I am under the impression that even if we 
capture {{this}} in the lambda it would still complain due to a missing 
{{defer}}?

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


 [ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6801:
--
Labels: newbie  (was: )

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: newbie
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor


[ 
https://issues.apache.org/jira/browse/MESOS-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751878#comment-15751878
 ] 

Benjamin Bannier commented on MESOS-6801:
-

cc [~jieyu] [~klueska] [~vinodkone]

> IOSwitchboard::connect installs continuations capturing this without properly 
> deferring/dispatching to an actor
> ---
>
> Key: MESOS-6801
> URL: https://issues.apache.org/jira/browse/MESOS-6801
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>
> In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
> created and used as callbacks without properly deferring to a libprocess 
> actor.
> {noformat}
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Nothing&) {
>   ^
> /tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
> callback capturing this should be dispatched/deferred to a specific PID 
> [mesos-this-capture]
>   [=](const Result& record) -> Future {
>   ^
> {noformat}
> Patterns like this can create use-after-free scenarios or introduce data 
> races which can often be avoided by installing the callbacks via 
> {{defer}}/{{dispatch}} on some process' actor.
> This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6801) IOSwitchboard::connect installs continuations capturing this without properly deferring/dispatching to an actor

Benjamin Bannier created MESOS-6801:
---

 Summary: IOSwitchboard::connect installs continuations capturing 
this without properly deferring/dispatching to an actor
 Key: MESOS-6801
 URL: https://issues.apache.org/jira/browse/MESOS-6801
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


In the body of {{IOSwitchboard::connect}} lambdas capturing {{this}} are 
created and used as callbacks without properly deferring to a libprocess actor.
{noformat}
/tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:686:7: warning: 
callback capturing this should be dispatched/deferred to a specific PID 
[mesos-this-capture]
  [=](const Nothing&) {
  ^
/tmp/SRC/src/slave/containerizer/mesos/io/switchboard.cpp:1492:7: warning: 
callback capturing this should be dispatched/deferred to a specific PID 
[mesos-this-capture]
  [=](const Result& record) -> Future {
  ^
{noformat}

Patterns like this can create use-after-free scenarios or introduce data races 
which can often be avoided by installing the callbacks via 
{{defer}}/{{dispatch}} on some process' actor.

This code should be revisited to remove existing data races.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6800) IOSwitchBoardTest.KillSwitchboardContainerDestroyed is flaky

Benjamin Bannier created MESOS-6800:
---

 Summary: IOSwitchBoardTest.KillSwitchboardContainerDestroyed is 
flaky
 Key: MESOS-6800
 URL: https://issues.apache.org/jira/browse/MESOS-6800
 Project: Mesos
  Issue Type: Bug
 Environment: Linux
Reporter: Benjamin Bannier


Saw the test {{IOSwitchBoardTest.KillSwitchboardContainerDestroyed}} fail in 
internal CI on a number of Linux platforms,
{noformat}
[02:53:36] : [Step 11/11] [ RUN  ] 
IOSwitchboardTest.KillSwitchboardContainerDestroyed
[02:53:36] : [Step 11/11] I1215 02:53:36.159004 23129 
containerizer.cpp:220] Using isolation: posix/cpu,filesystem/posix,network/cni
[02:53:36] : [Step 11/11] I1215 02:53:36.159701 23146 
containerizer.cpp:594] Recovering containerizer
[02:53:36] : [Step 11/11] I1215 02:53:36.160013 23144 provisioner.cpp:253] 
Provisioner recovery complete
[02:53:36] : [Step 11/11] I1215 02:53:36.160274 23146 
containerizer.cpp:986] Starting container ee8415af-5253-4ba2-9a98-2072af434f0f 
for executor 'executor' of framework 
[02:53:36] : [Step 11/11] I1215 02:53:36.160823 23150 switchboard.cpp:430] 
Allocated pseudo terminal '/dev/pts/0' for container 
ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] I1215 02:53:36.160953 23150 switchboard.cpp:567] 
Launching 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
--help="false" 
--socket_address="/tmp/mesos-io-switchboard-a2bf4732-420c-4d91-b5b3-50f65c4db73c"
 --stderr_from_fd="25" --stderr_to_fd="2" --stdin_to_fd="25" 
--stdout_from_fd="25" --stdout_to_fd="1" --tty="true" 
--wait_for_connection="false"' for container 
ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] I1215 02:53:36.163383 23150 switchboard.cpp:597] 
Created I/O switchboard server (pid: 10711) listening on socket file 
'/tmp/mesos-io-switchboard-a2bf4732-420c-4d91-b5b3-50f65c4db73c' for container 
ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] I1215 02:53:36.164247 23144 
containerizer.cpp:1535] Launching 'mesos-containerizer' with flags 
'--help="false" --launch_info="{"command":{"shell":true,"value":"sleep 
1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","value":"\/mnt\/teamcity\/temp\/buildTmp\/IOSwitchboardTest_KillSwitchboardContainerDestroyed_b4902D"}]},"err":{"fd":26,"type":"FD"},"in":{"fd":26,"type":"FD"},"out":{"fd":26,"type":"FD"},"tty_slave_path":"\/dev\/pts\/0","working_directory":"\/mnt\/teamcity\/temp\/buildTmp\/IOSwitchboardTest_KillSwitchboardContainerDestroyed_b4902D"}"
 --pipe_read="25" --pipe_write="27" 
--runtime_directory="/mnt/teamcity/temp/buildTmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_OJVRdU/containers/ee8415af-5253-4ba2-9a98-2072af434f0f"
 --unshare_namespace_mnt="false"'
[02:53:36] : [Step 11/11] I1215 02:53:36.165638 23144 launcher.cpp:133] 
Forked child with pid '10712' for container 
'ee8415af-5253-4ba2-9a98-2072af434f0f'
[02:53:36] : [Step 11/11] I1215 02:53:36.165937 23144 
containerizer.cpp:1634] Checkpointing container's forked pid 10712 to 
'/mnt/teamcity/temp/buildTmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Yo7Yoa/meta/slaves/frameworks/executors/executor/runs/ee8415af-5253-4ba2-9a98-2072af434f0f/pids/forked.pid'
[02:53:36] : [Step 11/11] I1215 02:53:36.167196 23148 fetcher.cpp:349] 
Starting to fetch URIs for container: ee8415af-5253-4ba2-9a98-2072af434f0f, 
directory: 
/mnt/teamcity/temp/buildTmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_b4902D
[02:53:36] : [Step 11/11] E1215 02:53:36.243254 23148 switchboard.cpp:880] 
Unexpected termination of I/O switchboard server: 'IOSwitchboard' exited with 
signal: Killed for container ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] I1215 02:53:36.243259 23150 
containerizer.cpp:2493] Container ee8415af-5253-4ba2-9a98-2072af434f0f has 
reached its limit for resource {} and will be terminated
[02:53:36] : [Step 11/11] I1215 02:53:36.243288 23150 
containerizer.cpp:2113] Destroying container 
ee8415af-5253-4ba2-9a98-2072af434f0f in RUNNING state
[02:53:36] : [Step 11/11] I1215 02:53:36.243319 23150 
containerizer.cpp:2476] Container ee8415af-5253-4ba2-9a98-2072af434f0f has 
exited
[02:53:36] : [Step 11/11] I1215 02:53:36.243332 23150 launcher.cpp:149] 
Asked to destroy container ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] E1215 02:53:36.247699 23145 switchboard.cpp:801] 
Failed to remove unix domain socket file 
'/tmp/mesos-io-switchboard-a2bf4732-420c-4d91-b5b3-50f65c4db73c' for container 
'ee8415af-5253-4ba2-9a98-2072af434f0f': No such file or directory
[02:53:36] : [Step 11/11] I1215 02:53:36.248097 23150 provisioner.cpp:324] 
Ignoring destroy request for unknown container 
ee8415af-5253-4ba2-9a98-2072af434f0f
[02:53:36] : [Step 11/11] 
../../src/tests/containerizer/io_switchboard_tests.cpp:885: Failure
[02:53:36]

[jira] [Created] (MESOS-6799) Scheme/HTTPTest.Endpoints/0 is flaky