[jira] [Commented] (MESOS-3774) Migrate Future tests from process_tests.cpp to future_tests.cpp

2016-03-24 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211404#comment-15211404
 ] 

Anand Mazumdar commented on MESOS-3774:
---

{code}
commit 43a684e349bb9267a2562dd2716248397daf7197
Author: Cong Wang xiyou.wangc...@gmail.com
Date:   Thu Mar 10 13:17:32 2016 -0500

Moved future tests into `future_tests.cpp`.

Review: https://reviews.apache.org/r/44026/
{code}

> Migrate Future tests from process_tests.cpp to future_tests.cpp
> ---
>
> Key: MESOS-3774
> URL: https://issues.apache.org/jira/browse/MESOS-3774
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Gilbert Song
>Priority: Minor
>  Labels: mesosphere, newbie, testing
> Fix For: 0.29.0
>
>
> Currently we do not have too much `FutureTest` in 
> /mesos/3rdparty/libprocess/src/tests/future_tests.cpp
> It would be more clear to move all future-related tests
> from: /mesos/3rdparty/libprocess/src/tests/process_tests.cpp
> to: /mesos/3rdparty/libprocess/src/tests/future_tests.cpp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5023) MesosContainerizerProvisionerTest.DestroyWhileProvisioning is flaky.

2016-03-24 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-5023:

Summary: MesosContainerizerProvisionerTest.DestroyWhileProvisioning is 
flaky.  (was: MesosContainerizerProvisionerTest.ProvisionFailed is flaky.)

> MesosContainerizerProvisionerTest.DestroyWhileProvisioning is flaky.
> 
>
> Key: MESOS-5023
> URL: https://issues.apache.org/jira/browse/MESOS-5023
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> Observed on the Apache Jenkins.
> {noformat}
> [ RUN  ] MesosContainerizerProvisionerTest.ProvisionFailed
> I0324 13:38:56.284261  2948 containerizer.cpp:666] Starting container 
> 'test_container' for executor 'executor' of framework ''
> I0324 13:38:56.285825  2939 containerizer.cpp:1421] Destroying container 
> 'test_container'
> I0324 13:38:56.285854  2939 containerizer.cpp:1424] Waiting for the 
> provisioner to complete for container 'test_container'
> [   OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms)
> [ RUN  ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning
> I0324 13:38:56.291187  2944 containerizer.cpp:666] Starting container 
> 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework ''
> I0324 13:38:56.292157  2944 containerizer.cpp:1421] Destroying container 
> 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
> I0324 13:38:56.292179  2944 containerizer.cpp:1424] Waiting for the 
> provisioner to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
> F0324 13:38:56.292899  2944 containerizer.cpp:752] Check failed: 
> containers_.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x2ac9973d0ae4  google::LogMessage::Fail()
> @ 0x2ac9973d0a30  google::LogMessage::SendToLog()
> @ 0x2ac9973d0432  google::LogMessage::Flush()
> @ 0x2ac9973d3346  google::LogMessageFatal::~LogMessageFatal()
> @ 0x2ac996af897c  
> mesos::internal::slave::MesosContainerizerProcess::_launch()
> @ 0x2ac996b1f18a  
> _ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_
> @ 0x2ac996b479d9  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ac997334fef  std::function<>::operator()()
> @ 0x2ac99731b1c7  process::ProcessBase::visit()
> @ 0x2ac997321154  process::DispatchEvent::visit()
> @   0x9a699c  process::ProcessBase::serve()
> @ 0x2ac9973173c0  process::ProcessManager::resume()
> @ 0x2ac99731445a  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x2ac997320916  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x2ac9973208c6  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x2ac997320858  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x2ac9973207af  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x2ac997320748  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x2ac9989aea60  (unknown)
> @ 0x2ac999125182  start_thread
> @ 0x2ac99943547d  (unknown)
> make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[4]: *** [check-local] Aborted
> make[3]: *** [check-am] Error 2
> make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[2]: *** [check] Error 2
> make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[1]: *** [check-recursive] Error 1
> make[1]: Leaving directory `/mesos/mesos-0.29.0/_build'
> make: *** [distcheck] Error 1
> Build step 'Execute 

[jira] [Commented] (MESOS-3902) The Location header when non-leading master redirects to leading master is incomplete.

2016-03-24 Thread Ashwin Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211256#comment-15211256
 ] 

Ashwin Murthy commented on MESOS-3902:
--

 Have sent u updated diff with final changes after validating this e2e with
3 masters and zk. works as expected.




> The Location header when non-leading master redirects to leading master is 
> incomplete.
> --
>
> Key: MESOS-3902
> URL: https://issues.apache.org/jira/browse/MESOS-3902
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Affects Versions: 0.25.0
> Environment: 3 masters, 10 slaves
>Reporter: Ben Whitehead
>Assignee: Ashwin Murthy
>  Labels: mesosphere
>
> The master now sets a location header, but it's incomplete. The path of the 
> URL isn't set. Consider an example:
> {code}
> > cat /tmp/subscribe-1072944352375841456 | httpp POST 
> > 127.1.0.3:5050/api/v1/scheduler Content-Type:application/x-protobuf
> POST /api/v1/scheduler HTTP/1.1
> Accept: application/json
> Accept-Encoding: gzip, deflate
> Connection: keep-alive
> Content-Length: 123
> Content-Type: application/x-protobuf
> Host: 127.1.0.3:5050
> User-Agent: HTTPie/0.9.0
> +-+
> | NOTE: binary data not shown in terminal |
> +-+
> HTTP/1.1 307 Temporary Redirect
> Content-Length: 0
> Date: Fri, 26 Feb 2016 00:54:41 GMT
> Location: //127.1.0.1:5050
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink

2016-03-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211203#comment-15211203
 ] 

Zhitao Li commented on MESOS-5028:
--

[~gilbert] and I took a quick, and this is actually caused by the layer trying 
to replace a directory with a symlink, which is not allowed by `cp -aT` (sorry 
my previous description was a bit misleading).

> Copy provisioner does not work for docker image layers with dangling symlink
> 
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5028) Copy provisioner cannot replace directory with symlink

2016-03-24 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-5028:
-
Summary: Copy provisioner cannot replace directory with symlink  (was: Copy 
provisioner does not work for docker image layers with dangling symlink)

> Copy provisioner cannot replace directory with symlink
> --
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5023) MesosContainerizerProvisionerTest.ProvisionFailed is flaky.

2016-03-24 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-5023:
---

Assignee: Klaus Ma

> MesosContainerizerProvisionerTest.ProvisionFailed is flaky.
> ---
>
> Key: MESOS-5023
> URL: https://issues.apache.org/jira/browse/MESOS-5023
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> Observed on the Apache Jenkins.
> {noformat}
> [ RUN  ] MesosContainerizerProvisionerTest.ProvisionFailed
> I0324 13:38:56.284261  2948 containerizer.cpp:666] Starting container 
> 'test_container' for executor 'executor' of framework ''
> I0324 13:38:56.285825  2939 containerizer.cpp:1421] Destroying container 
> 'test_container'
> I0324 13:38:56.285854  2939 containerizer.cpp:1424] Waiting for the 
> provisioner to complete for container 'test_container'
> [   OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms)
> [ RUN  ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning
> I0324 13:38:56.291187  2944 containerizer.cpp:666] Starting container 
> 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework ''
> I0324 13:38:56.292157  2944 containerizer.cpp:1421] Destroying container 
> 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
> I0324 13:38:56.292179  2944 containerizer.cpp:1424] Waiting for the 
> provisioner to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
> F0324 13:38:56.292899  2944 containerizer.cpp:752] Check failed: 
> containers_.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x2ac9973d0ae4  google::LogMessage::Fail()
> @ 0x2ac9973d0a30  google::LogMessage::SendToLog()
> @ 0x2ac9973d0432  google::LogMessage::Flush()
> @ 0x2ac9973d3346  google::LogMessageFatal::~LogMessageFatal()
> @ 0x2ac996af897c  
> mesos::internal::slave::MesosContainerizerProcess::_launch()
> @ 0x2ac996b1f18a  
> _ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_
> @ 0x2ac996b479d9  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ac997334fef  std::function<>::operator()()
> @ 0x2ac99731b1c7  process::ProcessBase::visit()
> @ 0x2ac997321154  process::DispatchEvent::visit()
> @   0x9a699c  process::ProcessBase::serve()
> @ 0x2ac9973173c0  process::ProcessManager::resume()
> @ 0x2ac99731445a  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x2ac997320916  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x2ac9973208c6  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x2ac997320858  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x2ac9973207af  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x2ac997320748  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x2ac9989aea60  (unknown)
> @ 0x2ac999125182  start_thread
> @ 0x2ac99943547d  (unknown)
> make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[4]: *** [check-local] Aborted
> make[3]: *** [check-am] Error 2
> make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[2]: *** [check] Error 2
> make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
> make[1]: *** [check-recursive] Error 1
> make[1]: Leaving directory `/mesos/mesos-0.29.0/_build'
> make: *** [distcheck] Error 1
> Build step 'Execute shell' marked build as failure
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5020) Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.

2016-03-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211124#comment-15211124
 ] 

Yong Tang commented on MESOS-5020:
--

Hi [~anandmazumdar] [~vinodkone], I created a review request:
https://reviews.apache.org/r/45327/
Please take a look if you have time and let me know if there are any issues.

> Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.
> ---
>
> Key: MESOS-5020
> URL: https://issues.apache.org/jira/browse/MESOS-5020
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Yong Tang
>  Labels: mesosphere, newbie
>
> Currently, the scheduler library does not drop {{404 Not Found}} but treats 
> them as {{Event::ERROR}}. The library can receive this if the master has not 
> yet set up HTTP routes yet. The executor library already deals with this.
> Secondly, in some cases, the {{detector}} can detect a new master without the 
> master realizing that it has been elected as the new master. In such cases, 
> the master responds with {{307 Temporary Redirect}}. We would like to drop 
> these status codes too instead of treating them as {{Event::ERROR}} too.
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L547



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5020) Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.

2016-03-24 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang reassigned MESOS-5020:


Assignee: Yong Tang

> Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.
> ---
>
> Key: MESOS-5020
> URL: https://issues.apache.org/jira/browse/MESOS-5020
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Yong Tang
>  Labels: mesosphere, newbie
>
> Currently, the scheduler library does not drop {{404 Not Found}} but treats 
> them as {{Event::ERROR}}. The library can receive this if the master has not 
> yet set up HTTP routes yet. The executor library already deals with this.
> Secondly, in some cases, the {{detector}} can detect a new master without the 
> master realizing that it has been elected as the new master. In such cases, 
> the master responds with {{307 Temporary Redirect}}. We would like to drop 
> these status codes too instead of treating them as {{Event::ERROR}} too.
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L547



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4759) Add network/cni isolator for Mesos containerizer.

2016-03-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1520#comment-1520
 ] 

Jie Yu commented on MESOS-4759:
---

commit 4bf5833e58324df65e48f607a0d2b73b56f23f40
Author: Qian Zhang 
Date:   Thu Mar 24 16:09:26 2016 -0700

Implemented prepare() method of "network/cni" isolator.

Review: https://reviews.apache.org/r/44514/

> Add network/cni isolator for Mesos containerizer.
> -
>
> Key: MESOS-4759
> URL: https://issues.apache.org/jira/browse/MESOS-4759
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Qian Zhang
>
> See the design doc for more context (MESOS-4742).
> The isolator will interact with CNI plugins to create the network for the 
> container to join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers

2016-03-24 Thread Ian Babrou (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211051#comment-15211051
 ] 

Ian Babrou commented on MESOS-3573:
---

It shouldn't be, but it is.

> The agent would itself kill the container the executor is running in after 2 
> seconds (EXECUTOR_REREGISTER_TIMEOUT)

Not always, just like I showed.

> Of course, if the docker daemon is still stuck and the agent is not able to 
> invoke docker->stop on the container, it would fail.

Docker is healthy, "docker stop" does not happen: 
https://issues.apache.org/jira/browse/MESOS-3573?focusedCommentId=15075015=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15075015

> Mesos does not kill orphaned docker containers
> --
>
> Key: MESOS-3573
> URL: https://issues.apache.org/jira/browse/MESOS-3573
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Ian Babrou
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1| I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1| I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1- terminated with signal Terminated
> slave_1| I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from @0.0.0.0:0
> slave_1| I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1| W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1| I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to the slave
> slave_1| I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050
> slave_1| I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37204711 master.cpp:4069] Status update 
> TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from slave 
> 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 
> (172.16.91.128)
> master_1   | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37301811 master.cpp:5576] Updating the latest 
> state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to TASK_FAILED
> master_1   | I1002 12:12:59.37344711 hierarchical.hpp:814] Recovered 
> cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: 

[jira] [Updated] (MESOS-5018) FrameworkInfo Capability enum does not support upgrades.

2016-03-24 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5018:
---
Fix Version/s: 0.27.3
   0.28.1

> FrameworkInfo Capability enum does not support upgrades.
> 
>
> Key: MESOS-5018
> URL: https://issues.apache.org/jira/browse/MESOS-5018
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0, 0.26.0, 0.27.0, 
> 0.27.1, 0.28.0, 0.27.2, 0.26.1, 0.25.1
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 0.29.0, 0.28.1, 0.27.3
>
>
> See MESOS-4997 for the general issue around enum usage. This ticket tracks 
> fixing the FrameworkInfo Capability enum to support upgrades in a backwards 
> compatible way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5029) Add labels to ExecutorInfo

2016-03-24 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-5029:
-
Shepherd: Benjamin Mahler

> Add labels to ExecutorInfo
> --
>
> Key: MESOS-5029
> URL: https://issues.apache.org/jira/browse/MESOS-5029
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> We want to to allow frameworks to populate metadata on ExecutorInfo object.
> An use case would be custom labels inspected by QosController.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5030) Expose TaskInfo's metadata to ResourceUsage struct

2016-03-24 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-5030:
-
Shepherd: Benjamin Mahler

> Expose TaskInfo's metadata to ResourceUsage struct
> --
>
> Key: MESOS-5030
> URL: https://issues.apache.org/jira/browse/MESOS-5030
> Project: Mesos
>  Issue Type: Improvement
>  Components: oversubscription
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: qos
>
> So QosController could use metadata information from TaskInfo.
> Based on conversations from Mesos work group, we would at least include:
> - task id;
> - name;
> - labels;
> ( I think resources, kill_policy should probably also included).
> Alternative would be just purge fields like `data`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5014) Call and Event Type enums in scheduler.proto should be optional

2016-03-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210974#comment-15210974
 ] 

Yong Tang commented on MESOS-5014:
--

Hi [~vinodkone], I added a review request:
https://reviews.apache.org/r/45317/
Let me know if there are any issues.

> Call and Event Type enums in scheduler.proto should be optional
> ---
>
> Key: MESOS-5014
> URL: https://issues.apache.org/jira/browse/MESOS-5014
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Having a 'required' Type enum has backwards compatibility issues when adding 
> new enum types. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5014) Call and Event Type enums in scheduler.proto should be optional

2016-03-24 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang reassigned MESOS-5014:


Assignee: Yong Tang

> Call and Event Type enums in scheduler.proto should be optional
> ---
>
> Key: MESOS-5014
> URL: https://issues.apache.org/jira/browse/MESOS-5014
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Having a 'required' Type enum has backwards compatibility issues when adding 
> new enum types. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5030) Expose TaskInfo's metadata to ResourceUsage struct

2016-03-24 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-5030:


 Summary: Expose TaskInfo's metadata to ResourceUsage struct
 Key: MESOS-5030
 URL: https://issues.apache.org/jira/browse/MESOS-5030
 Project: Mesos
  Issue Type: Improvement
  Components: oversubscription
Reporter: Zhitao Li
Assignee: Zhitao Li


So QosController could use metadata information from TaskInfo.

Based on conversations from Mesos work group, we would at least include:
- task id;
- name;
- labels;

( I think resources, kill_policy should probably also included).

Alternative would be just purge fields like `data`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5029) Add labels to ExecutorInfo

2016-03-24 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-5029:


 Summary: Add labels to ExecutorInfo
 Key: MESOS-5029
 URL: https://issues.apache.org/jira/browse/MESOS-5029
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Assignee: Zhitao Li
Priority: Minor


We want to to allow frameworks to populate metadata on ExecutorInfo object.

An use case would be custom labels inspected by QosController.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers

2016-03-24 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210909#comment-15210909
 ] 

Anand Mazumdar commented on MESOS-3573:
---

[~bobrik] This shouldn't be a problem. The transient error that you are linking 
to happens due to this:

When the agent is recovering, it tries to send a {{ReconnectExecutorMessage}} 
to reconnect with the executor. If for some reason, it fails like in your logs, 
probably due to the executor process itself being hung or executor process 
already exited. The agent would itself kill the container the executor is 
running in after 2 seconds (EXECUTOR_REREGISTER_TIMEOUT): 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4700

Of course, if the docker daemon is still stuck and the agent is not able to 
invoke {{docker->stop}} on the container, it would fail. We cannot do anything 
about that as remarked in 1. of my earlier comment. Let me know if you have any 
further queries.

> Mesos does not kill orphaned docker containers
> --
>
> Key: MESOS-3573
> URL: https://issues.apache.org/jira/browse/MESOS-3573
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Ian Babrou
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1| I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1| I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1- terminated with signal Terminated
> slave_1| I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from @0.0.0.0:0
> slave_1| I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1| W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1| I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to the slave
> slave_1| I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050
> slave_1| I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37204711 master.cpp:4069] Status update 
> TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from slave 
> 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 
> (172.16.91.128)
> master_1   | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37301811 master.cpp:5576] 

[jira] [Updated] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink

2016-03-24 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5028:
--
Assignee: Gilbert Song  (was: Jie Yu)

> Copy provisioner does not work for docker image layers with dangling symlink
> 
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink

2016-03-24 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5028:
--
Shepherd: Jie Yu

> Copy provisioner does not work for docker image layers with dangling symlink
> 
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink

2016-03-24 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-5028:


 Summary: Copy provisioner does not work for docker image layers 
with dangling symlink
 Key: MESOS-5028
 URL: https://issues.apache.org/jira/browse/MESOS-5028
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Zhitao Li
Assignee: Jie Yu


I'm trying to play with the new image provisioner on our custom docker images, 
but one of layer failed to get copied, possibly due to a dangling symlink.

Error log with Glog_v=1:

{quote}
I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
'/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
 to rootfs 
'/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
'5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
Collect failed: Failed to copy layer: cp: cannot overwrite directory 
‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
 with non-directory
{quote}

Content of 
_/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
 points to a non-existing absolute path (cannot provide exact path but it's a 
result of us trying to mount apt keys into docker container at build time).

I believe what happened is that we executed a script at build time, which 
contains equivalent of:
{quote}
rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers

2016-03-24 Thread Ian Babrou (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210830#comment-15210830
 ] 

Ian Babrou commented on MESOS-3573:
---

Cleanup is not always successful, there might be transient errors: 
https://issues.apache.org/jira/browse/MESOS-3573?focusedCommentId=15121914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15121914

> Mesos does not kill orphaned docker containers
> --
>
> Key: MESOS-3573
> URL: https://issues.apache.org/jira/browse/MESOS-3573
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Ian Babrou
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1| I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1| I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1- terminated with signal Terminated
> slave_1| I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from @0.0.0.0:0
> slave_1| I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1| W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1| I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to the slave
> slave_1| I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050
> slave_1| I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37204711 master.cpp:4069] Status update 
> TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from slave 
> 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 
> (172.16.91.128)
> master_1   | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> master_1   | I1002 12:12:59.37301811 master.cpp:5576] Updating the latest 
> state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to TASK_FAILED
> master_1   | I1002 12:12:59.37344711 hierarchical.hpp:814] Recovered 
> cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; 
> mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: 
> cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from 
> framework 20150923-122130-2153451692-5050-1-
> {noformat}
> Another issue: if you restart mesos-slave on the host with orphaned docker 
> containers, 

[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers

2016-03-24 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210821#comment-15210821
 ] 

Anand Mazumdar commented on MESOS-3573:
---

There are a few things at play here:

A container can be orphaned due to any of the following:

1. The agent sends a explicit {{Shutdown}} request to the docker executor. The 
docker executor in turn executes a {{docker->stop}} which hangs due to some 
issues with the docker daemon itself. We currently just kill the 
{{mesos-docker-executor}} process if it's a command executor or the executor 
process if it's a custom executor. The actual container associated is now an 
orphan. 
We would kill all such orphaned containers when the agent process starts up 
during the recovering phase. We can't do anything more then that for this 
scenario.

2. A container can also be orphaned due to the following: 
  - The agent either gets partitioned off from the master. The master sends it 
an explicit {{Shutdown}} request that instructs the agent to commit suicide 
after killing all it's tasks. Some of the containers can now be orphans due to 
1. The agent process now starts off as a fresh instance with a new {{SlaveID}}.
  - The agent process is accidentally started with a new {{SlaveID}} due to 
specifying an incorrect {{work_dir}}. 
  In both the above cases for 2., all the existing containers would be treated 
as orphans. Since, as of now, we only try to invoke {{docker ps}} for all 
containers that match the current {{SlaveID}}. 
We should ideally be killing all other containers that don't match the current 
{{SlaveID}} but have the {{mesos-}} prefix.

> Mesos does not kill orphaned docker containers
> --
>
> Key: MESOS-3573
> URL: https://issues.apache.org/jira/browse/MESOS-3573
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Ian Babrou
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1| I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1| I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1| I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1- terminated with signal Terminated
> slave_1| I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- from @0.0.0.0:0
> slave_1| I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1| W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1| I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> slave_1| I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to the slave
> slave_1| I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050
> slave_1| I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-
> 

[jira] [Created] (MESOS-5027) Enable authenticated login in the webui

2016-03-24 Thread Greg Mann (JIRA)
Greg Mann created MESOS-5027:


 Summary: Enable authenticated login in the webui
 Key: MESOS-5027
 URL: https://issues.apache.org/jira/browse/MESOS-5027
 Project: Mesos
  Issue Type: Improvement
  Components: master, security, webui
Reporter: Greg Mann


The webui hits a number of endpoints to get the data that it displays: 
{{/state}}, {{/metrics/snapshot}}, {{/files/browse}}, {{/files/read}}, and 
maybe others? Once authentication is enabled on these endpoints, we need to add 
a login prompt to the webui so that users can provide credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4951) Enable actors to pass an authentication realm to libprocess

2016-03-24 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4951:
-
Assignee: (was: Greg Mann)

> Enable actors to pass an authentication realm to libprocess
> ---
>
> Key: MESOS-4951
> URL: https://issues.apache.org/jira/browse/MESOS-4951
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, slave
>Reporter: Greg Mann
>  Labels: authentication, http, mesosphere, security
>
> To prepare for MESOS-4902, the Mesos master and agent need a way to pass the 
> desired authentication realm to libprocess. Since some endpoints (like 
> {{/profiler/*}}) get installed in libprocess, the master/agent should be able 
> to specify during initialization what authentication realm the 
> libprocess-level endpoints will be authenticated under.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5025) FetcherCacheTest.LocalUncached is flaky

2016-03-24 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5025:
--
Affects Version/s: 0.29.0

> FetcherCacheTest.LocalUncached is flaky
> ---
>
> Key: MESOS-5025
> URL: https://issues.apache.org/jira/browse/MESOS-5025
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.29.0
> Environment: CentOS 7
>Reporter: Anand Mazumdar
>  Labels: flaky, flaky-test
>
> Showed up on an internal CI:
> {code}
> [17:57:05] :   [Step 11/11] [ RUN  ] FetcherCacheTest.LocalUncached
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.653718  1813 cluster.cpp:139] 
> Creating default 'local' authorizer
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.659001  1813 leveldb.cpp:174] 
> Opened db in 5.09329ms
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660393  1813 leveldb.cpp:181] 
> Compacted db in 1.367077ms
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660434  1813 leveldb.cpp:196] 
> Created db iterator in 13516ns
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660446  1813 leveldb.cpp:202] 
> Seeked to beginning of db in 1531ns
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660454  1813 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 284ns
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660478  1813 replica.cpp:779] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.660815  1831 recover.cpp:447] 
> Starting replica recovery
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.661001  1831 recover.cpp:473] 
> Replica is in EMPTY status
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.661866  1830 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> (1886)@172.30.2.131:51675
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.662237  1831 recover.cpp:193] 
> Received a recover response from a replica in EMPTY status
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.662652  1827 recover.cpp:564] 
> Updating replica status to STARTING
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663151  1829 master.cpp:376] 
> Master 2574ed73-b254-4829-9efc-f76d89150396 (ip-172-30-2-131.mesosphere.io) 
> started on 172.30.2.131:51675
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663172  1829 master.cpp:378] Flags 
> at startup: --acls="" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate="true" 
> --authenticate_http="true" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/VRasc1/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/VRasc1/master" 
> --zk_session_timeout="10secs"
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663374  1829 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663384  1829 master.cpp:432] 
> Master only allowing authenticated slaves to register
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663390  1829 credentials.hpp:35] 
> Loading credentials for authentication from '/tmp/VRasc1/credentials'
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663595  1829 master.cpp:474] Using 
> default 'crammd5' authenticator
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663725  1829 master.cpp:545] Using 
> default 'basic' HTTP authenticator
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.663880  1829 master.cpp:583] 
> Authorization enabled
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.664001  1827 hierarchical.cpp:144] 
> Initialized hierarchical allocator process
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.664010  1831 
> whitelist_watcher.cpp:77] No whitelist given
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.664114  1834 leveldb.cpp:304] 
> Persisting metadata (8 bytes) to leveldb took 1.271881ms
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.664136  1834 replica.cpp:320] 
> Persisted replica status to STARTING
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.664353  1833 recover.cpp:473] 
> Replica is in STARTING status
> [17:57:05]W:   [Step 11/11] I0324 17:57:05.665315  1833 replica.cpp:673] 
> Replica in STARTING status received a broadcasted recover request from 
> 

[jira] [Commented] (MESOS-5015) Call and Event Type enums in executor.proto should be optional

2016-03-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210751#comment-15210751
 ] 

Yong Tang commented on MESOS-5015:
--

Hi [~vinodkone] Just created a review request:
https://reviews.apache.org/r/45304/
Let me know if there are any issues.

> Call and Event Type enums in executor.proto should be optional
> --
>
> Key: MESOS-5015
> URL: https://issues.apache.org/jira/browse/MESOS-5015
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Having a 'required' Type enum has backwards compatibility issues when adding 
> new enum types. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5015) Call and Event Type enums in executor.proto should be optional

2016-03-24 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang reassigned MESOS-5015:


Assignee: Yong Tang

> Call and Event Type enums in executor.proto should be optional
> --
>
> Key: MESOS-5015
> URL: https://issues.apache.org/jira/browse/MESOS-5015
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Having a 'required' Type enum has backwards compatibility issues when adding 
> new enum types. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5025) FetcherCacheTest.LocalUncached is flaky

2016-03-24 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5025:
-

 Summary: FetcherCacheTest.LocalUncached is flaky
 Key: MESOS-5025
 URL: https://issues.apache.org/jira/browse/MESOS-5025
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
 Environment: CentOS 7
Reporter: Anand Mazumdar


Showed up on an internal CI:

{code}
[17:57:05] : [Step 11/11] [ RUN  ] FetcherCacheTest.LocalUncached
[17:57:05]W: [Step 11/11] I0324 17:57:05.653718  1813 cluster.cpp:139] 
Creating default 'local' authorizer
[17:57:05]W: [Step 11/11] I0324 17:57:05.659001  1813 leveldb.cpp:174] 
Opened db in 5.09329ms
[17:57:05]W: [Step 11/11] I0324 17:57:05.660393  1813 leveldb.cpp:181] 
Compacted db in 1.367077ms
[17:57:05]W: [Step 11/11] I0324 17:57:05.660434  1813 leveldb.cpp:196] 
Created db iterator in 13516ns
[17:57:05]W: [Step 11/11] I0324 17:57:05.660446  1813 leveldb.cpp:202] 
Seeked to beginning of db in 1531ns
[17:57:05]W: [Step 11/11] I0324 17:57:05.660454  1813 leveldb.cpp:271] 
Iterated through 0 keys in the db in 284ns
[17:57:05]W: [Step 11/11] I0324 17:57:05.660478  1813 replica.cpp:779] 
Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
[17:57:05]W: [Step 11/11] I0324 17:57:05.660815  1831 recover.cpp:447] 
Starting replica recovery
[17:57:05]W: [Step 11/11] I0324 17:57:05.661001  1831 recover.cpp:473] 
Replica is in EMPTY status
[17:57:05]W: [Step 11/11] I0324 17:57:05.661866  1830 replica.cpp:673] 
Replica in EMPTY status received a broadcasted recover request from 
(1886)@172.30.2.131:51675
[17:57:05]W: [Step 11/11] I0324 17:57:05.662237  1831 recover.cpp:193] 
Received a recover response from a replica in EMPTY status
[17:57:05]W: [Step 11/11] I0324 17:57:05.662652  1827 recover.cpp:564] 
Updating replica status to STARTING
[17:57:05]W: [Step 11/11] I0324 17:57:05.663151  1829 master.cpp:376] 
Master 2574ed73-b254-4829-9efc-f76d89150396 (ip-172-30-2-131.mesosphere.io) 
started on 172.30.2.131:51675
[17:57:05]W: [Step 11/11] I0324 17:57:05.663172  1829 master.cpp:378] Flags 
at startup: --acls="" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" 
--authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/VRasc1/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/VRasc1/master" 
--zk_session_timeout="10secs"
[17:57:05]W: [Step 11/11] I0324 17:57:05.663374  1829 master.cpp:427] 
Master only allowing authenticated frameworks to register
[17:57:05]W: [Step 11/11] I0324 17:57:05.663384  1829 master.cpp:432] 
Master only allowing authenticated slaves to register
[17:57:05]W: [Step 11/11] I0324 17:57:05.663390  1829 credentials.hpp:35] 
Loading credentials for authentication from '/tmp/VRasc1/credentials'
[17:57:05]W: [Step 11/11] I0324 17:57:05.663595  1829 master.cpp:474] Using 
default 'crammd5' authenticator
[17:57:05]W: [Step 11/11] I0324 17:57:05.663725  1829 master.cpp:545] Using 
default 'basic' HTTP authenticator
[17:57:05]W: [Step 11/11] I0324 17:57:05.663880  1829 master.cpp:583] 
Authorization enabled
[17:57:05]W: [Step 11/11] I0324 17:57:05.664001  1827 hierarchical.cpp:144] 
Initialized hierarchical allocator process
[17:57:05]W: [Step 11/11] I0324 17:57:05.664010  1831 
whitelist_watcher.cpp:77] No whitelist given
[17:57:05]W: [Step 11/11] I0324 17:57:05.664114  1834 leveldb.cpp:304] 
Persisting metadata (8 bytes) to leveldb took 1.271881ms
[17:57:05]W: [Step 11/11] I0324 17:57:05.664136  1834 replica.cpp:320] 
Persisted replica status to STARTING
[17:57:05]W: [Step 11/11] I0324 17:57:05.664353  1833 recover.cpp:473] 
Replica is in STARTING status
[17:57:05]W: [Step 11/11] I0324 17:57:05.665315  1833 replica.cpp:673] 
Replica in STARTING status received a broadcasted recover request from 
(1888)@172.30.2.131:51675
[17:57:05]W: [Step 11/11] I0324 17:57:05.665621  1827 recover.cpp:193] 
Received a recover response from a replica in STARTING status
[17:57:05]W: [Step 11/11] I0324 17:57:05.666237  1833 master.cpp:1826] The 
newly elected leader is master@172.30.2.131:51675 with id 
2574ed73-b254-4829-9efc-f76d89150396
[17:57:05]W: [Step 

[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters

2016-03-24 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210687#comment-15210687
 ] 

John Omernik commented on MESOS-3548:
-

This is a very interesting topic to me, and one that I would love to help test 
things on with Mesos as it evolves. 

> Investigate federations of Mesos masters
> 
>
> Key: MESOS-3548
> URL: https://issues.apache.org/jira/browse/MESOS-3548
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: federation, mesosphere, multi-dc
>
> In a large Mesos installation, the operator might want to ensure that even if 
> the Mesos masters are inaccessible or failed, new tasks can still be 
> scheduled (across multiple different frameworks). HA masters are only a 
> partial solution here: the masters might still be inaccessible due to a 
> correlated failure (e.g., Zookeeper misconfiguration/human error).
> To support this, we could support the notion of "hierarchies" or 
> "federations" of Mesos masters. In a Mesos installation with 10k machines, 
> the operator might configure 10 Mesos masters (each of which might be HA) to 
> manage 1k machines each. Then an additional "meta-Master" would manage the 
> allocation of cluster resources to the 10 masters. Hence, the failure of any 
> individual master would impact 1k machines at most. The meta-master might not 
> have a lot of work to do: e.g., it might be limited to occasionally 
> reallocating cluster resources among the 10 masters, or ensuring that newly 
> added cluster resources are allocated among the masters as appropriate. 
> Hence, the failure of the meta-master would not prevent any of the individual 
> masters from scheduling new tasks. A single framework instance probably 
> wouldn't be able to use more resources than have been assigned to a single 
> Master, but that seems like a reasonable restriction.
> This feature might also be a good fit for a multi-datacenter deployment of 
> Mesos: each Mesos master instance would manage a single DC. Naturally, 
> reducing the traffic between frameworks and the meta-master would be 
> important for performance reasons in a configuration like this.
> Operationally, this might be simpler if Mesos processes were self-hosting 
> ([MESOS-3547]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4909) Introduce kill policy for tasks.

2016-03-24 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210609#comment-15210609
 ] 

Alexander Rukletsov edited comment on MESOS-4909 at 3/24/16 5:36 PM:
-

{noformat}
Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:31 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Introduced KillPolicy protobuf.

Describes a kill policy for a task. Currently does not express
different policies (e.g. hitting HTTP endpoints), only controls
how long to wait between graceful and forcible task kill.

Review: https://reviews.apache.org/r/44656/
{noformat}
{noformat}
Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:42 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Added validation for task's kill policy.

Review: https://reviews.apache.org/r/44707/
{noformat}
{noformat}
Commit: d13de4c42b39037c8bd8f79122e7a9ac0d82317f [d13de4c]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:52 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Used KillPolicy and shutdown grace period in command executor.

The command executor determines how much time it allots the
underlying task to clean up (effectively how long to wait for
the task to comply to SIGTERM before sending SIGKILL) based
on both optional task's KillPolicy and optional
shutdown_grace_period field in ExecutorInfo.

Manual testing was performed to ensure newly introduced protobuf
fields are respected. To do that, "mesos-execute" was modified to
support KillPolicy and CommandInfo.shell=false. To simulate a
task that does not exit in the allotted period, a tiny app
(https://github.com/rukletsov/unresponsive-process) that ignores
SIGTERM was used. More details on testing in the review request.

Review: https://reviews.apache.org/r/44657/
{noformat}


was (Author: alexr):
{noformat}
Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:31 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Introduced KillPolicy protobuf.

Describes a kill policy for a task. Currently does not express
different policies (e.g. hitting HTTP endpoints), only controls
how long to wait between graceful and forcible task kill.

Review: https://reviews.apache.org/r/44656/
{noformat}
{noformat}
Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:42 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Added validation for task's kill policy.

Review: https://reviews.apache.org/r/44707/
{noformat}

> Introduce kill policy for tasks.
> 
>
> Key: MESOS-4909
> URL: https://issues.apache.org/jira/browse/MESOS-4909
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> A task may require some time to clean up or even a special mechanism to issue 
> a kill request (currently it's a SIGTERM followed by SIGKILL). Introducing 
> kill policies per task will help address these issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-2043:
--
Shepherd: Adam B

> framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver, security, slave
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
> Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, 
> mesos-master.20141104-1606-1706.log, slave.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4909) Introduce kill policy for tasks.

2016-03-24 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210609#comment-15210609
 ] 

Alexander Rukletsov commented on MESOS-4909:


{noformat}
Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:31 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Introduced KillPolicy protobuf.

Describes a kill policy for a task. Currently does not express
different policies (e.g. hitting HTTP endpoints), only controls
how long to wait between graceful and forcible task kill.

Review: https://reviews.apache.org/r/44656/
{noformat}
{noformat}
Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221]
Author: Alexander Rukletsov ruklet...@gmail.com
Date: 24 Mar 2016 17:30:42 CET
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 24 Mar 2016 18:21:03 CET

Added validation for task's kill policy.

Review: https://reviews.apache.org/r/44707/
{noformat}

> Introduce kill policy for tasks.
> 
>
> Key: MESOS-4909
> URL: https://issues.apache.org/jira/browse/MESOS-4909
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> A task may require some time to clean up or even a special mechanism to issue 
> a kill request (currently it's a SIGTERM followed by SIGKILL). Introducing 
> kill policies per task will help address these issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4992:
--
Story Points: 3

> sandbox uri does not work outisde mesos http server
> ---
>
> Key: MESOS-4992
> URL: https://issues.apache.org/jira/browse/MESOS-4992
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.1
>Reporter: Stavros Kontopoulos
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> The SandBox uri of a framework does not work if i just copy paste it to the 
> browser.
> For example the following sandbox uri:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse
> should redirect to:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80
> yet it fails with the message:
> "Failed to find slaves.
> Navigate to the slave's sandbox via the Mesos UI."
> and redirects to:
> http://172.17.0.1:5050/#/
> It is an issue for me because im working on expanding the mesos spark ui with 
> sandbox uri, The other option is to get the slave info and parse the json 
> file there and get executor paths not so straightforward or elegant though.
> Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess 
> this is hidden info, this is the needed piece of info to re-write the uri 
> without redirection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4933) Registrar HTTP Authentication.

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4933:
--
Story Points: 3  (was: 2)

> Registrar HTTP Authentication.
> --
>
> Key: MESOS-4933
> URL: https://issues.apache.org/jira/browse/MESOS-4933
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Jan Schlicht
>  Labels: authentication, mesosphere, security
>
> Now that the master (and agents in progress) provide http authentication the 
> registrar should do the same. 
> See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-2043:
--
Story Points: 5

> framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver, security, slave
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
> Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, 
> mesos-master.20141104-1606-1706.log, slave.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4933) Registrar HTTP Authentication.

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4933:
--
Sprint: Mesosphere Sprint 32

> Registrar HTTP Authentication.
> --
>
> Key: MESOS-4933
> URL: https://issues.apache.org/jira/browse/MESOS-4933
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Jan Schlicht
>  Labels: authentication, mesosphere, security
>
> Now that the master (and agents in progress) provide http authentication the 
> registrar should do the same. 
> See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4949) Executor shutdown grace period should be configurable.

2016-03-24 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4949:
---
Story Points: 3  (was: 1)

> Executor shutdown grace period should be configurable.
> --
>
> Key: MESOS-4949
> URL: https://issues.apache.org/jira/browse/MESOS-4949
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Currently, executor shutdown grace period is specified by an agent flag, 
> which is propagated to executors via the 
> {{MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD}} environment variable. There is no 
> way to adjust this timeout for the needs of a particular executor.
> To tackle this problem, we propose to introduce an optional 
> {{shutdown_grace_period}} field in {{ExecutorInfo}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-2043:
--
Fix Version/s: 0.29.0

> framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver, security, slave
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
> Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, 
> mesos-master.20141104-1606-1706.log, slave.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4992:
--
Fix Version/s: 0.29.0

> sandbox uri does not work outisde mesos http server
> ---
>
> Key: MESOS-4992
> URL: https://issues.apache.org/jira/browse/MESOS-4992
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.1
>Reporter: Stavros Kontopoulos
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> The SandBox uri of a framework does not work if i just copy paste it to the 
> browser.
> For example the following sandbox uri:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse
> should redirect to:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80
> yet it fails with the message:
> "Failed to find slaves.
> Navigate to the slave's sandbox via the Mesos UI."
> and redirects to:
> http://172.17.0.1:5050/#/
> It is an issue for me because im working on expanding the mesos spark ui with 
> sandbox uri, The other option is to get the slave info and parse the json 
> file there and get executor paths not so straightforward or elegant though.
> Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess 
> this is hidden info, this is the needed piece of info to re-write the uri 
> without redirection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3003) Support mounting in default configuration files/volumes into every new container

2016-03-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210523#comment-15210523
 ] 

Jie Yu commented on MESOS-3003:
---

I'm closing this ticket because all the network related /etc/* files should be 
handled by the network isolator because it (and only it) knows about the ip, 
hostname information. Mounting in host /etc/* files blindly does not make sense.

> Support mounting in default configuration files/volumes into every new 
> container
> 
>
> Key: MESOS-3003
> URL: https://issues.apache.org/jira/browse/MESOS-3003
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere, unified-containerizer-mvp
>
> Most container images leave out system configuration (e.g: /etc/*) and expect 
> the container runtimes to mount in specific configurations as needed such as 
> /etc/resolv.conf from the host into the container when needed.
> We need to support mounting in specific configuration files for command 
> executor to work, and also allow the user to optionally define other 
> configuration files to mount in as well via flags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5024) local docker puller uses colon in tarball names

2016-03-24 Thread James Peach (JIRA)
James Peach created MESOS-5024:
--

 Summary: local docker puller uses colon in tarball names
 Key: MESOS-5024
 URL: https://issues.apache.org/jira/browse/MESOS-5024
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: James Peach
Priority: Trivial


The local docker puller for the unifier containerizer expects tagged docker 
repository images to be named {{repository:tag.tar}}. However, tar(1) would 
normally interpret that as a remote archive:

{quote}
   -f, --file=ARCHIVE
...
  An archive name that has a colon in it specifies a file or device 
on a remote machine.  The part before the colon is taken as the  machine  name  
or  IP
  address, and the part after it as the file or device pathname
...
{quote}

This works correctly only because the puller always passes an absolute path to 
tar(1), which causes it to interpret the name as a local archive again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks

2016-03-24 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210399#comment-15210399
 ] 

Joris Van Remoortere commented on MESOS-4694:
-

{code}
commit 6a8738f89b01ac3ddd70c418c49f350e17fa
Author: Dario Rexin 
Date:   Thu Mar 24 14:10:31 2016 +0100

Allocator Performance: Exited early to avoid needless computation.

Review: https://reviews.apache.org/r/43668/
{code}

> DRFAllocator takes very long to allocate resources with a large number of 
> frameworks
> 
>
> Key: MESOS-4694
> URL: https://issues.apache.org/jira/browse/MESOS-4694
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>
> With a growing number of connected frameworks, the allocation time grows to 
> very high numbers. The addition of quota in 0.27 had an additional impact on 
> these numbers. Running `mesos-tests.sh --benchmark 
> --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us 
> the following numbers:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 2.921202secs to make 200 offers
> round 1 allocate took 2.85045secs to make 200 offers
> round 2 allocate took 2.823768secs to make 200 offers
> {noformat}
> Increasing the number of frameworks to 2000:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 28.209454secs to make 2000 offers
> round 1 allocate took 28.469419secs to make 2000 offers
> round 2 allocate took 28.138086secs to make 2000 offers
> {noformat}
> I was able to reduce this time by a substantial amount. After applying the 
> patches:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 1.016226secs to make 2000 offers
> round 1 allocate took 1.102729secs to make 2000 offers
> round 2 allocate took 1.102624secs to make 2000 offers
> {noformat}
> And with 2000 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 12.563203secs to make 2000 offers
> round 1 allocate took 12.437517secs to make 2000 offers
> round 2 allocate took 12.470708secs to make 2000 offers
> {noformat}
> The patches do 3 things to improve the performance of the allocator.
> 1) The total values in the DRFSorter will be pre calculated per resource type
> 2) In the allocate method, when no resources are available to allocate, we 
> break out of the innermost loop to prevent looping over a large number of 
> frameworks when we have nothing to allocate
> 3) when a framework suppresses offers, we remove it from the sorter instead 
> of just calling continue in the allocation loop - this greatly improves 
> performance in the sorter and prevents looping over frameworks that don't 
> need resources
> Assuming that most of the frameworks behave nicely and suppress offers when 
> they have nothing to schedule, it is fair to assume, that point 3) has the 
> biggest impact on the performance. If we suppress offers for 90% of the 
> frameworks in the benchmark test, we see following numbers:
> {noformat}
> ==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 200 slaves and 2000 frameworks
> round 0 allocate took 11626us to make 200 offers
> round 1 allocate took 22890us to make 200 offers
> round 2 allocate took 21346us to make 200 offers
> {noformat}
> And for 200 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 

[jira] [Updated] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks

2016-03-24 Thread Joris Van Remoortere (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-4694:

Affects Version/s: 0.28.1
   0.28.0
   0.27.2

> DRFAllocator takes very long to allocate resources with a large number of 
> frameworks
> 
>
> Key: MESOS-4694
> URL: https://issues.apache.org/jira/browse/MESOS-4694
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1
>Reporter: Dario Rexin
>Assignee: Dario Rexin
>
> With a growing number of connected frameworks, the allocation time grows to 
> very high numbers. The addition of quota in 0.27 had an additional impact on 
> these numbers. Running `mesos-tests.sh --benchmark 
> --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us 
> the following numbers:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 2.921202secs to make 200 offers
> round 1 allocate took 2.85045secs to make 200 offers
> round 2 allocate took 2.823768secs to make 200 offers
> {noformat}
> Increasing the number of frameworks to 2000:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 28.209454secs to make 2000 offers
> round 1 allocate took 28.469419secs to make 2000 offers
> round 2 allocate took 28.138086secs to make 2000 offers
> {noformat}
> I was able to reduce this time by a substantial amount. After applying the 
> patches:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 200 frameworks
> round 0 allocate took 1.016226secs to make 2000 offers
> round 1 allocate took 1.102729secs to make 2000 offers
> round 2 allocate took 1.102624secs to make 2000 offers
> {noformat}
> And with 2000 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 12.563203secs to make 2000 offers
> round 1 allocate took 12.437517secs to make 2000 offers
> round 2 allocate took 12.470708secs to make 2000 offers
> {noformat}
> The patches do 3 things to improve the performance of the allocator.
> 1) The total values in the DRFSorter will be pre calculated per resource type
> 2) In the allocate method, when no resources are available to allocate, we 
> break out of the innermost loop to prevent looping over a large number of 
> frameworks when we have nothing to allocate
> 3) when a framework suppresses offers, we remove it from the sorter instead 
> of just calling continue in the allocation loop - this greatly improves 
> performance in the sorter and prevents looping over frameworks that don't 
> need resources
> Assuming that most of the frameworks behave nicely and suppress offers when 
> they have nothing to schedule, it is fair to assume, that point 3) has the 
> biggest impact on the performance. If we suppress offers for 90% of the 
> frameworks in the benchmark test, we see following numbers:
> {noformat}
> ==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 200 slaves and 2000 frameworks
> round 0 allocate took 11626us to make 200 offers
> round 1 allocate took 22890us to make 200 offers
> round 2 allocate took 21346us to make 200 offers
> {noformat}
> And for 200 frameworks:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from HierarchicalAllocator_BENCHMARK_Test
> [ RUN  ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
> Using 2000 slaves and 2000 frameworks
> round 0 allocate took 1.11178secs to make 2000 offers
> round 1 allocate took 1.062649secs to make 2000 offers
> round 2 allocate took 1.080181secs to make 2000 offers
> {noformat}
> Review requests:
> 

[jira] [Created] (MESOS-5023) MesosContainerizerProvisionerTest.ProvisionFailed is flaky.

2016-03-24 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-5023:
--

 Summary: MesosContainerizerProvisionerTest.ProvisionFailed is 
flaky.
 Key: MESOS-5023
 URL: https://issues.apache.org/jira/browse/MESOS-5023
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rukletsov


Observed on the Apache Jenkins.

{noformat}
[ RUN  ] MesosContainerizerProvisionerTest.ProvisionFailed
I0324 13:38:56.284261  2948 containerizer.cpp:666] Starting container 
'test_container' for executor 'executor' of framework ''
I0324 13:38:56.285825  2939 containerizer.cpp:1421] Destroying container 
'test_container'
I0324 13:38:56.285854  2939 containerizer.cpp:1424] Waiting for the provisioner 
to complete for container 'test_container'
[   OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms)
[ RUN  ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning
I0324 13:38:56.291187  2944 containerizer.cpp:666] Starting container 
'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework ''
I0324 13:38:56.292157  2944 containerizer.cpp:1421] Destroying container 
'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
I0324 13:38:56.292179  2944 containerizer.cpp:1424] Waiting for the provisioner 
to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2'
F0324 13:38:56.292899  2944 containerizer.cpp:752] Check failed: 
containers_.contains(containerId)
*** Check failure stack trace: ***
@ 0x2ac9973d0ae4  google::LogMessage::Fail()
@ 0x2ac9973d0a30  google::LogMessage::SendToLog()
@ 0x2ac9973d0432  google::LogMessage::Flush()
@ 0x2ac9973d3346  google::LogMessageFatal::~LogMessageFatal()
@ 0x2ac996af897c  
mesos::internal::slave::MesosContainerizerProcess::_launch()
@ 0x2ac996b1f18a  
_ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_
@ 0x2ac996b479d9  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
@ 0x2ac997334fef  std::function<>::operator()()
@ 0x2ac99731b1c7  process::ProcessBase::visit()
@ 0x2ac997321154  process::DispatchEvent::visit()
@   0x9a699c  process::ProcessBase::serve()
@ 0x2ac9973173c0  process::ProcessManager::resume()
@ 0x2ac99731445a  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x2ac997320916  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x2ac9973208c6  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x2ac997320858  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x2ac9973207af  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x2ac997320748  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x2ac9989aea60  (unknown)
@ 0x2ac999125182  start_thread
@ 0x2ac99943547d  (unknown)
make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
make[4]: *** [check-local] Aborted
make[3]: *** [check-am] Error 2
make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
make[2]: *** [check] Error 2
make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/mesos/mesos-0.29.0/_build'
make: *** [distcheck] Error 1
Build step 'Execute shell' marked build as failure
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4933) Registrar HTTP Authentication.

2016-03-24 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210388#comment-15210388
 ] 

Jan Schlicht edited comment on MESOS-4933 at 3/24/16 3:17 PM:
--

Yes, the approach will be similar to MESOS-4956.


was (Author: nfnt):
Yes, my approach will be similar to MESOS-4956.

> Registrar HTTP Authentication.
> --
>
> Key: MESOS-4933
> URL: https://issues.apache.org/jira/browse/MESOS-4933
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Jan Schlicht
>  Labels: authentication, mesosphere, security
>
> Now that the master (and agents in progress) provide http authentication the 
> registrar should do the same. 
> See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4933) Registrar HTTP Authentication.

2016-03-24 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210388#comment-15210388
 ] 

Jan Schlicht commented on MESOS-4933:
-

Yes, my approach will be similar to MESOS-4956.

> Registrar HTTP Authentication.
> --
>
> Key: MESOS-4933
> URL: https://issues.apache.org/jira/browse/MESOS-4933
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Jan Schlicht
>  Labels: authentication, mesosphere, security
>
> Now that the master (and agents in progress) provide http authentication the 
> registrar should do the same. 
> See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1420) Shorten slave, framework and run IDs

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1420:
--
Shepherd:   (was: Adam B)

> Shorten slave, framework and run IDs
> 
>
> Key: MESOS-1420
> URL: https://issues.apache.org/jira/browse/MESOS-1420
> Project: Mesos
>  Issue Type: Bug
>Reporter: Robert Lacroix
>Priority: Minor
>
> Slave, framework and run IDs are currently quite long and therefore clutter 
> paths to sandboxes. Typically a sandbox path looks like this:
> {code}
> /tmp/mesos/slaves/2014-05-23-16:21:05-16777343-5050-3204-0/frameworks/2014-05-23-16:21:05-16777343-5050-3204-/executors//runs/c22e4dc3-95e5-49fc-9793-b0d22a4f244c
> {code}
> I'd propose shorter and uniform IDs for slaves, frameworks and runs (and 
> probably everywhere else where we use IDs) that look like this:
> {code}
> [a-z0-9]{13}
> {code}
> This has about 65bit keyspace compared to 128bit of a UUID, but I think it 
> should be random enough.
> With that the path would be roughly 80 chars shorter (179 vs 99) and a lot 
> more readable:
> {code}
> /tmp/mesos/slaves/i0b195fb1j14n/frameworks/gtq5kgba60ll4/executors//runs/7qisjqsb581io
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4308) Reliably report executor terminations to framework schedulers.

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4308:
--
Shepherd:   (was: Adam B)

> Reliably report executor terminations to framework schedulers.
> --
>
> Key: MESOS-4308
> URL: https://issues.apache.org/jira/browse/MESOS-4308
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Charles Reiss
>  Labels: mesosphere
>
> Now that executor terminations are reported (unreliably), we should 
> investigate queuing up these messages (on the agent?) and resending them 
> periodically until we get an acknowledgement, much like status updates do.
> From MESOS-313: The Scheduler interface has a callback for executorLost, but 
> currently it is never called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3656) Port process/socket.hpp to Windows

2016-03-24 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210024#comment-15210024
 ] 

Joris Van Remoortere commented on MESOS-3656:
-

{code}
commit 4e19c3e6f09eaa2793f4717e414429e0e6335e0f
Author: Daniel Pravat 
Date:   Thu Mar 24 09:33:05 2016 +0100

Windows: [2/2] Lifted socket API into Stout.

Review: https://reviews.apache.org/r/44139/

commit 6f8544cf5e2748a58ac979e6d12336b2dccbf1fb
Author: Daniel Pravat 
Date:   Thu Mar 24 09:32:57 2016 +0100

Windows: [1/2] Lifted socket API into Stout.

Review: https://reviews.apache.org/r/44138/
{code}

> Port process/socket.hpp to Windows
> --
>
> Key: MESOS-3656
> URL: https://issues.apache.org/jira/browse/MESOS-3656
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4431) Sharing of persistent volumes via reference counting

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4431:
--
Shepherd:   (was: Adam B)

> Sharing of persistent volumes via reference counting
> 
>
> Key: MESOS-4431
> URL: https://issues.apache.org/jira/browse/MESOS-4431
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: external-volumes, persistent-volumes
>
> Add capability for specific resources to be shared amongst tasks within or 
> across frameworks/roles. Enable this functionality for persistent volumes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4893) Allow setting permissions and access control on persistent volumes

2016-03-24 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209993#comment-15209993
 ] 

Deshi Xiao commented on MESOS-4893:
---

i have reading the design doc.  i prefer add backward compatibility ability.

> Allow setting permissions and access control on persistent volumes
> --
>
> Key: MESOS-4893
> URL: https://issues.apache.org/jira/browse/MESOS-4893
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: external-volumes, persistent-volumes
>
> Currently, persistent volumes are exclusive, i.e. that if a persistent volume 
> is used by one task or executor, it cannot be concurrently used by other task 
> or executor. 
> With the introduction of shared volumes, persistent volumes can be used 
> simultaneously by multiple tasks or executors. As a result, we need to 
> introduce setting up of ownership of persistent volumes at creation of 
> volumes which the tasks need to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate

2016-03-24 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209957#comment-15209957
 ] 

Deshi Xiao commented on MESOS-4760:
---

+1

> Expose metrics and gauges for fetcher cache usage and hit rate
> --
>
> Key: MESOS-4760
> URL: https://issues.apache.org/jira/browse/MESOS-4760
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher, statistics
>Reporter: Michael Browning
>Priority: Minor
>  Labels: features, fetcher, statistics
>
> To evaluate the fetcher cache and calibrate the value of the 
> fetcher_cache_size flag, it would be useful to have metrics and gauges on 
> agents that expose operational statistics like cache hit rate, occupied cache 
> size, and time spent downloading resources that were not present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks

2016-03-24 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-4812:

Assignee: (was: Benjamin Bannier)

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5022) Provide LDAP as default authorisation

2016-03-24 Thread Klaus Ma (JIRA)
Klaus Ma created MESOS-5022:
---

 Summary: Provide LDAP as default authorisation
 Key: MESOS-5022
 URL: https://issues.apache.org/jira/browse/MESOS-5022
 Project: Mesos
  Issue Type: Epic
Reporter: Klaus Ma
Assignee: Klaus Ma


The default authorisation/ACL is configured by {{json}} file; operator has to 
restart master if any new user was added. It's better to provide LDAP as 
default ACL:

1. Provide an real example on ACL interface
2. Provide an default auth plugin for user to use



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4984) MasterTest.SlavesEndpointTwoSlaves is flaky

2016-03-24 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4984:
--
  Sprint: Mesosphere Sprint 31
Story Points: 2

> MasterTest.SlavesEndpointTwoSlaves is flaky
> ---
>
> Key: MESOS-4984
> URL: https://issues.apache.org/jira/browse/MESOS-4984
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>  Labels: flaky-test, mesosphere, tech-debt
> Fix For: 0.29.0
>
> Attachments: slaves_endpoint_flaky_4984_verbose_log.txt
>
>
> Observed on Arch Linux with GCC 6, running in a virtualbox VM:
> [ RUN  ] MasterTest.SlavesEndpointTwoSlaves
> /mesos-2/src/tests/master_tests.cpp:1710: Failure
> Value of: array.get().values.size()
>   Actual: 1
> Expected: 2u
> Which is: 2
> [  FAILED  ] MasterTest.SlavesEndpointTwoSlaves (86 ms)
> Seems to fail non-deterministically, perhaps more often when there is 
> concurrent CPU load on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)